Fast Access to Data

Scalable Throughput Ensures Fast Access

In comparison to disk-based storage, in-memory data grids (IMDGs) dramatically lower access latency for fast-changing data. The key to maintaining this fast access as the workload grows is to scale throughput by adding servers. The IMDG distributes stored data across all servers within a cluster, enabling all servers to simultaneously share a portion of the access workload. This enables linear throughput scaling without bottlenecks (“hot spots”) and keeps access times from climbing.

To scale throughput, ScaleOut’s IMDG technology automatically partitions stored data across all grid servers. As servers are added to the grid, ScaleOut automatically repartitions and rebalances the workload to increase storage capacity and increase throughput. Likewise, if servers are removed or fail, ScaleOut coalesces stored objects on the surviving servers and rebalances the storage workload as necessary.

ScaleOut’s IMDG automatically partitions data across grid servers.

Client Caching Reduces Access Time

For reliability and manageability, IMDGs typically implement “out of process” in-memory data stores. As with database servers, accessed data must be serialized and communicated to and from an IMDG using network connections, and this imposes an overhead that impacts access times.

To minimize access delays, ScaleOut’s IMDG incorporates a cache within its client API libraries for recently accessed objects. By holding references to these objects for future access, it eliminates repeated network transfers and deserialization delays. This significantly shortens access times to near “in process” levels for objects which are repeatedly read without an intervening update. (For example, tests with 10 KB objects and 20:1 read/update ratio have demonstrated a 6X reduction in access latency.) Client caching also reduces resource usage by lowering both network and CPU overheads.

Client caching significantly lowers read access times.

Importantly, ScaleOut’s client caches automatically maintain coherency with the IMDG’s data store and implement “sequential consistency” semantics. This offers the simplest possible development model; it ensures that applications can safely store and access objects within multi-threaded applications and transparently benefit from the client cache’s performance gains without any changes to program structure.

Integrated Data-Parallel Computation

Why Data-Parallel Computation?

Operational intelligence requires that fast-changing data be analyzed quickly to provide immediate feedback to a live system. Beyond creating a fast, scalable repository for storing fast-changing data, the server cluster that hosts an IMDG provides an ideal platform for performing data-parallel computation on that data. The IMDG’s servers have CPU resources that can run analytics code on in-memory data stored within the same servers. This takes advantage of the IMDG’s data partitioning to shorten execution times and avoid the overhead of data motion over the network, reducing compute times to seconds (or less). It also leverages the IMDG’s scalability; adding servers enables growing workloads to be analyzed without increasing compute times.

Another important advantage of data-parallel computation is that it is simple, and this makes development easy and fast. To implement a data-parallel computation, a developer simply writes a single method which the IMDG executes in parallel on a large collection of objects held within the IMDG. Compared to task-parallel programs (for example, Storm), which involve several tasks, explicit communication, and coordination, data-parallel programs are significantly easier to develop.

Object-Oriented Computing within an IMDG

Because IMDGs use an object-oriented view of data, ScaleOut’s compute engine follows an object-oriented model to structuring data-parallel computations. Called “parallel method invocation” (PMI), this feature lets developers specify an analysis (or “eval”) method that the compute engine executes in parallel on a collection of objects stored within the IMDG. This method analyzes an object and possibly updates it using a domain-specific algorithm (such as analyzing a machine for an impending failure or a shopping cart to make a recommendation).

The developer also can specify a second method (“merge” method) for combining results created by the parallel computation so that they can be collected for delivery to the operational system. The compute engine executes the merge method in parallel on each grid server and then globally combines the results to produce a single object for return to the client application.

Using an object-oriented approach enables the developer to encapsulate the analysis logic within the class that defines both the properties of the object collection and the methods which implement a data-parallel computation on those objects. This cleanly separates application code from the IMDG’s APIs, and it makes it straightforward to build an in-memory representation of real-world entities. Together, PMI’s eval and merge methods represent an object-oriented formulation of well understood techniques from parallel supercomputing.

The IMDG runs eval and merge methods in parallel on all servers.

Automatic Code Shipping and Parallel Execution

ScaleOut’s IMDG automatically ships application code to all grid servers for execution and starts execution environments (e.g., JVMs or .NET runtimes) on these servers to host and execute this code. To minimize startup times and reduce execution latency, the developer also can persist an execution environment (called an “invocation grid”) across several parallel method invocations.

The IMDG automatically ships code to invocation grids.

ScaleOut’s in-memory compute engine employs several techniques to maximize performance and scalability. It avoids data motion by running eval method invocations only for objects locally hosted on each server. This allows throughput to be increased and execution time to remain fixed by adding servers. The compute engine also uses all available processors and cores to run the merge method, and it combines results across servers in parallel using a built-in binary merge tree to minimize execution time.

In-Memory Compute Engine Enables Hadoop MapReduce

ScaleOut’s in-memory compute engine for parallel method invocation supports the execution of MapReduce applications using open source Java libraries shipped with ScaleOut hServer. The map and reduce phases of execution run as two parallel method invocations, and the IMDG stores intermediate results in memory. Applications can store input and output key-value pairs either within the IMDG or in HDFS.

This standalone, in-memory execution environment for MapReduce eliminates the overhead of batch scheduling as well as data motion required for disk access. It uses additional techniques to further shorten execution time, resulting in 40X speedup over Apache MapReduce in benchmark applications. For the first time, standard MapReduce can be used for operational intelligence on live data within live systems.

ScaleOut StateServer Pro provides fast, in-memory computing.

LEARN MORE

ScaleOut hServer extends ScaleOut’s data-parallel architecture to run standard MapReduce in memory.

LEARN MORE

High Availability Everywhere

Operational Systems Require 24/7 Uptime

Systems handling live data — whether in ecommerce, financial services, or the Internet of Things — need continuous availability to provide uninterrupted, mission-critical services. Unlike analytics platforms targeted for offline use in a data warehouse, ScaleOut’s technology was specifically designed for live environments that require integrated high availability.

High availability starts with ScaleOut’s in-memory data grid, which ensures that data is never lost by automatically replicating all stored objects on up to two additional grid servers. If a server goes offline, ScaleOut retrieves objects from replicas stored on other servers in the grid, and it creates new replicas to maintain redundant storage and “self-heal.” By using a fixed number of replicas, the IMDG gains scalable storage capacity and throughput as servers are added. To maximize ease of use, all aspects of data replication are transparent to the IMDG’s APIs.

ScaleOut’s IMDG automatically creates replicas on additional servers.

ScaleOut’s IMDG employs patented, “scalable quorum-based updating” technology that synchronizes replicas during updates with minimum access latency. It maintains a consistent view of data and transparently handles a variety of server and network failures, including so-called “split-brain” scenarios.

To implement high availability, ScaleOut’s IMDG must be able to immediately detect a server failure and rebuild the membership of the server cluster. It must then rebalance the storage workload across the new membership and restore redundant data storage where necessary. To implement these recovery steps with minimum impact on users, ScaleOut’s IMDG employs algorithms honed in a decade of production use. Techniques such as scalable heart-beating, peer-to-peer load-balancing, and specialized network transport protocols enable the IMDG to quickly respond to outages and deliver fast, transparent recovery.

Highly Available Computing and Eventing

To fully deliver on the promise of operational intelligence, ScaleOut integrates high availability into all aspects of its in-memory architecture, including the compute engine. If a server fails, patent-pending technology efficiently determines which portions of a data-parallel computation have completed and restarts only the necessary tasks, ensuring that the operation fully completes.

Event handling provides important asynchronous notifications to clients when the state of the IMDG changes. ScaleOut’s architecture provides scalable, highly available event handling so that multiple clients can share the event handling load and ensure reliable event delivery if a failure occurs. ScaleOut’s architecture integrates both the compute engine and event handling with the IMDG’s load-balancer so that all grid servers participate as peers with a balanced workload during recovery.

Remote Clients Maintain Connectivity After Failures

ScaleOut’s Remote Client option automatically detects when a grid server fails and re-establishes connections with alternative servers. Likewise, remote clients automatically detect new grid servers and load-balancing events.

High Availability Spans Datacenters

ScaleOut GeoServer DR’s remote data replication uses scalable and highly available connections on all grid servers to maximize performance and maintain high availability.

Focus on Ease of Use

Implementing scalability requires that the IMDG transparently integrate new servers into its membership and rebalance the workload among them. Likewise, implementing high availability requires building redundancy into data storage, detecting server failures, and recovering as fast as possible to restore service. Together, the IMDG’s architecture must implement a daunting set of features which all have to work seamlessly together.

ScaleOut’s core design philosophy is to maximize ease of use by automating its inner workings to the greatest extent possible. This lets developers and system administrators avoid becoming experts on the details of a scalable, in-memory computing architecture; they benefit from simplified application development, fast deployment, and minimum TCO.

To realize this design philosophy, ScaleOut’s architecture transparently implements its key mechanisms, including cluster membership, data partitioning, load-balancing, replication, heart-beating, recovery, and self-healing. For example, when an application stores an object, the IMDG automatically load-balances it to a server and creates the configured number of replicas. If a server fails, the IMDG silently rebuilds the membership, rebalances stored objects, and creates new ones to restore redundancy; ScaleOut’s client library automatically redirects access requests to the right servers. Client applications are unaffected by a server failure except for a possible short delay during recovery.

Powerful In-Memory Computing Architecture

The Foundation: In-Memory Data Grids