Integrated Data-Parallel Computation
Why Data-Parallel Computation?
Operational intelligence requires that fast-changing data be analyzed quickly to provide immediate feedback to a live system. Beyond creating a fast, scalable repository for storing fast-changing data, the server cluster that hosts an IMDG provides an ideal platform for performing data-parallel computation on that data. The IMDG’s servers have CPU resources that can run analytics code on in-memory data stored within the same servers. This takes advantage of the IMDG’s data partitioning to shorten execution times and avoid the overhead of data motion over the network, reducing compute times to seconds (or less). It also leverages the IMDG’s scalability; adding servers enables growing workloads to be analyzed without increasing compute times.
Another important advantage of data-parallel computation is that it is simple, and this makes development easy and fast. To implement a data-parallel computation, a developer simply writes a single method which the IMDG executes in parallel on a large collection of objects held within the IMDG. Compared to task-parallel programs (for example, Storm), which involve several tasks, explicit communication, and coordination, data-parallel programs are significantly easier to develop.
Object-Oriented Computing within an IMDG
Because IMDGs use an object-oriented view of data, ScaleOut’s compute engine follows an object-oriented model to structuring data-parallel computations. Called “parallel method invocation” (PMI), this feature lets developers specify an analysis (or “eval”) method that the compute engine executes in parallel on a collection of objects stored within the IMDG. This method analyzes an object and possibly updates it using a domain-specific algorithm (such as analyzing a machine for an impending failure or a shopping cart to make a recommendation).
The developer also can specify a second method (“merge” method) for combining results created by the parallel computation so that they can be collected for delivery to the operational system. The compute engine executes the merge method in parallel on each grid server and then globally combines the results to produce a single object for return to the client application.
Using an object-oriented approach enables the developer to encapsulate the analysis logic within the class that defines both the properties of the object collection and the methods which implement a data-parallel computation on those objects. This cleanly separates application code from the IMDG’s APIs, and it makes it straightforward to build an in-memory representation of real-world entities. Together, PMI’s eval and merge methods represent an object-oriented formulation of well understood techniques from parallel supercomputing.
The IMDG runs eval and merge methods in parallel on all servers.
Automatic Code Shipping and Parallel Execution
ScaleOut’s IMDG automatically ships application code to all grid servers for execution and starts execution environments (e.g., JVMs or .NET runtimes) on these servers to host and execute this code. To minimize startup times and reduce execution latency, the developer also can persist an execution environment (called an “invocation grid”) across several parallel method invocations.
The IMDG automatically ships code to invocation grids.
ScaleOut’s in-memory compute engine employs several techniques to maximize performance and scalability. It avoids data motion by running eval method invocations only for objects locally hosted on each server. This allows throughput to be increased and execution time to remain fixed by adding servers. The compute engine also uses all available processors and cores to run the merge method, and it combines results across servers in parallel using a built-in binary merge tree to minimize execution time.
In-Memory Compute Engine Enables Hadoop MapReduce
ScaleOut’s in-memory compute engine for parallel method invocation supports the execution of MapReduce applications using open source Java libraries shipped with ScaleOut hServer. The map and reduce phases of execution run as two parallel method invocations, and the IMDG stores intermediate results in memory. Applications can store input and output key-value pairs either within the IMDG or in HDFS.
This standalone, in-memory execution environment for MapReduce eliminates the overhead of batch scheduling as well as data motion required for disk access. It uses additional techniques to further shorten execution time, resulting in 40X speedup over Apache MapReduce in benchmark applications. For the first time, standard MapReduce can be used for operational intelligence on live data within live systems.