Apache Spark Performance Acceleration

The performance of Apache Spark® applications can be accelerated by keeping data in a shared Apache Ignite® in-memory cluster. Spark works with Ignite as a data source similar to how it uses Hadoop or a relational database. You can start an Ignite cluster, set it as a data source for Spark workers, and continue using Spark RDDs or DataFrames APIs. You can gain even more speed by running Ignite SQL or compute APIs directly on the Spark dataset. Ignite can also be used as a distributed in-memory layer by Spark workers that need to share both data and state.

Apache Spark Performance Acceleration

The performance increase is achievable for several reasons. First, Ignite is designed to store data sets in memory across a cluster of nodes reducing latency of Spark operations that usually need to pull date from disk-based systems. Second, Ignite tries to minimize data shuffling over the network between its store and Spark applications by running certain Spark tasks, produced by RDDs or DataFrames APIs, in-place on Ignite nodes. This optimization helps to reduce the effect of network latency on the performance of Spark calls. Finally, the network impact can be further reduced if the native Ignite APIs, such as SQL, are called from Spark applications directly. By doing so, you can eliminate data shuffling between Spark and Ignite as long as Ignite SQL queries are always executed on Ignite nodes returning a much smaller final result set to the application layer.

Ignite Shared RDDs

Apache Ignite provides an implementation of the Spark RDD, which allows any data and state to be shared in memory as RDDs across Spark jobs. The Ignite RDD provides a shared, mutable view of the data stored in Ignite caches across different Spark jobs, workers, or applications.

The Ignite RDD is implemented as a view over a distributed Ignite table (aka. cache). It can be deployed with an Ignite node either within the Spark job executing process, on a Spark worker, or in a separate Ignite cluster. This means that depending on the chosen deployment mode, the shared state may either exist only during the lifespan of a Spark application (embedded mode), or it may out-survive the Spark application (standalone mode).

Ignite DataFrames

The Apache Spark DataFrame API introduced the concept of a schema to describe the data, allowing Spark to manage the schema and organize the data into a tabular format. To put it simply, a DataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database and allows Spark to leverage the Catalyst query optimizer to produce much more efficient query execution plans in comparison to RDDs, which are collections of elements partitioned across the nodes of the cluster.

Ignite supports DataFrame APIs allowing Spark to write to and read from Ignite through that interface. Furthermore, Ignite analyses execution plans produced by Spark's Catalyst engine and can execute parts of the plan on Ignite nodes directly, which will reduce data shuffling and consequently make your SparkSQL perform better.