Implementation

Giraph is an Apache open source project. A Giraph computation runs as a Hadoop job, hence any existing Hadoop user can immediately benefit from Giraph. Workers use ZooKeeper to elect a master that will coordinate computation. The graph is loaded and partitioned across workers. The master then dictates when workers should start computing consecutive supersteps. Once the computation has halted, workers save the output. Checkpoints are initiated at user-defined intervals and are used for automatic application restarts when any worker fails. Any worker can act as the master and one will automatically take over if the current master fails.

Giraph offers several mechanisms that help implement graph algorithms at scale. You can input vertices and edges (see the input/output section) from any input source. We support several Hadoop input formats as well as Hive tables. Aggregators allow applications to compute a global value from contributing values provided by each vertex, see the aggregators section. By default vertex and edge values and messages are stored in workers’ memory. However, you can decide to store the values and messages on disk, for example on a Hadoop cluster with limited memory but ample disk space.