Apache Hadoop Performance Acceleration

Apache Ignite® enables real-time analytics across Apache™ Hadoop® operational and historical data silos. The Ignite in-memory computing platform provides low-latency and high-throughput operations while Hadoop continues to be used for long-running OLAP workloads.

As the architecture diagram on the right suggests, you can achieve the performance acceleration of Hadoop-based systems by deploying Ignite as a separate distributed storage that maintains the data sets required for your low-latency operations or real-time reports.

First, depending on the data volume and available memory capacity, you can enable Ignite native persistence to store historical data sets on disk while dedicating a memory space for operational records. You can continue to use Hadoop as storage for less frequently used data or for long-running and ad-hoc analytical queries.

Next, your applications and services should use Ignite native APIs to process the data residing in the in-memory cluster. Ignite provides SQL, compute (aka. map-reduce), and machine learning APIs for various data processing needs.

Finally, consider using Apache Spark DataFrames APIs if an application needs to run federated or cross-database across Ignite and Hadoop clusters. Ignite is integrated with Spark, which natively supports Hive/Hadoop. Cross-database queries should be considered only for a limited number of scenarios when neither Ignite nor Hadoop contains the entire data set.

How to split data and operations between Ignite and Hadoop?

Consider using this approach:

Use Apache Ignite for tasks that require low-latency response time (microseconds, milliseconds, seconds), high throughput operations (thousands and millions of operations per second), and real-time processing.
Continue using Apache Hadoop for high-latency operations (dozens of seconds, minutes, hours) and batch processing.

Getting Started Checklist

Follow the steps below to implement the discussed architecture in practice:

Download and install Apache Ignite in your system.
Select a list of operations/reports to be executed against Ignite. The best candidates are operations that require low-latency response time, high-throughput, and real-time analytics.
Depending on the data volume and available memory space, consider using Ignite native persistence. Alternatively, you can use Ignite as a pure in-memory cache or in-memory data grid that persists changes to Hadoop or another external database.
Update your applications to ensure they use Ignite native APIs to process Ignite data and Spark for federated queries.

Learn More