Giraph : Large-scale graph processing on Hadoop Web and online social graphs have been rapidly growing in size and scale during the past decade. In 2008, Google estimated that the number of web pages reached over a trillion. Online social networking and email sites, including Yahoo!, Google, Microsoft, Facebook, LinkedIn, and Twitter, have hundreds of millions of users and are expected to grow much more in the future. Processing these graphs plays a big role in relevant and personalized information for users, such as results from a search engine or news in an online social networking site. Graph processing platforms to run large-scale algorithms (such as page rank, shared connections, personalization-based popularity, etc.) have become quite popular. Some recent examples include Pregel and HaLoop. For general-purpose big data computation, the map-reduce computing model has been well adopted and the most deployed map-reduce infrastructure is Apache Hadoop. We have implemented a graph-processing framework that is launched as a typical Hadoop job to leverage existing Hadoop infrastructure, such as Amazon’s EC2. Giraph builds upon the graph-oriented nature of Pregel but additionally adds fault-tolerance to the coordinator process with the use of ZooKeeper as its centralized coordination service. Giraph follows the bulk-synchronous parallel model relative to graphs where vertices can send messages to other vertices during a given superstep. Checkpoints are initiated by the Giraph infrastructure at user-defined intervals and are used for automatic application restarts when any worker in the application fails. Any worker in the application can act as the application coordinator and one will automatically take over if the current application coordinator fails. ------------------------------- Hadoop versions for use with Giraph: Secure Hadoop versions: - Apache Hadoop 0.20.203, 0.20.204, other secure versions may work as well -- Other versions reported working include: --- Cloudera CDH3u0, CDH3u1 Unsecure Hadoop versions: - Apache Hadoop 0.20.1, 0.20.2, 0.20.3 Facebook Hadoop release (https://github.com/facebook/hadoop-20-warehouse): - GitHub master While we provide support for the unsecure and Facebook versions of Hadoop with the maven profiles 'hadoop_non_secure' and 'hadoop_facebook', respectively, we have been primarily focusing on secure Hadoop releases at this time. ------------------------------- Building and testing: You will need the following: - Java 1.6 - Maven 3 or higher. Giraph uses the munge plugin (http://sonatype.github.com/munge-maven-plugin/), which requires Maven 3, to support multiple versions of Hadoop. Also, the web site plugin requires Maven 3. Use the maven commands with secure Hadoop to: - compile (i.e mvn compile) - package (i.e. mvn package) - test (i.e. mvn test) For the non-secure versions of Hadoop, run the maven commands with the additional argument '-Dhadoop=non_secure' to enable the maven profile 'hadoop_non_secure'. An example compilation command is 'mvn -Dhadoop=non_secure compile'. For the Facebook Hadoop release, run the maven commands with the additional arguments '-Dhadoop=facebook' to enable the maven profile 'hadoop_facebook' as well as a location for the hadoop core jar file. An example compilation command is 'mvn -Dhadoop=facebook -Dhadoop.jar.path=/tmp/hadoop-0.20.1-core.jar compile'. How to run the unittests on a local pseudo-distributed Hadoop instance: As mentioned earlier, Giraph supports several versions of Hadoop. In this section, we describe how to run the Giraph unittests against a single node instance of Apache Hadoop 0.20.203. Download Apache Hadoop 0.20.203 (hadoop-0.20.203.0/hadoop-0.20.203.0rc1.tar.gz) from a mirror picked at http://www.apache.org/dyn/closer.cgi/hadoop/common/ and unpack it into a local directory Follow the guide at http://hadoop.apache.org/common/docs/r0.20.2/quickstart.html#PseudoDistributed to setup a pseudo-distributed single node Hadoop cluster. Giraph’s code assumes that you can run at least 4 mappers at once, unfortunately the default configuration allows only 2. Therefore you need to update conf/mapred-site.xml: mapred.tasktracker.map.tasks.maximum 4 mapred.map.tasks 4 After preparing the local filesystem with: rm -rf /tmp/hadoop- /path/to/hadoop/bin/hadoop namenode -format you can start the local hadoop instance: /path/to/hadoop/bin/start-all.sh and finally run Giraph’s unittests: mvn clean test -Dprop.mapred.job.tracker=localhost:9001 Now you can open a browser, point it to http://localhost:50030 and watch the Giraph jobs from the unittests running on your local Hadoop instance! Notes: Counter limit: In Hadoop 0.20.203.0 onwards, there is a limit on the number of counters one can use, which is set to 120 by default. This limit restricts the number of iterations/supersteps possible in Giraph. This limit can be increased by setting a parameter "mapreduce.job.counters.limit" in job tracker's config file mapred-site.xml.