Welcome To Apache Giraph

Web and online social graphs have been rapidly growing in size and scale during the past decade. In 2008, Google estimated that the number of web pages reached over a trillion. Online social networking and email sites, including Yahoo!, Google, Microsoft, Facebook, LinkedIn, and Twitter, have hundreds of millions of users and are expected to grow much more in the future. Processing these graphs plays a big role in relevant and personalized information for users, such as results from a search engine or news in an online social networking site.

Graph processing platforms to run large-scale algorithms (such as page rank, shared connections, personalization-based popularity, etc.) have become quite popular. Some recent examples include Pregel and HaLoop. For general-purpose big data computation, the map-reduce computing model has been well adopted and the most deployed map-reduce infrastructure is Apache Hadoop. We have implemented a graph-processing framework that is launched as a typical Hadoop job to leverage existing Hadoop infrastructure, such as Amazon's EC2. Giraph builds upon the graph-oriented nature of Pregel but additionally adds fault-tolerance to the coordinator process with the use of ZooKeeper as its centralized coordination service.

Giraph follows the bulk-synchronous parallel model relative to graphs where vertices can send messages to other vertices during a given superstep. Checkpoints are initiated by the Giraph infrastructure at user-defined intervals and are used for automatic application restarts when any worker in the application fails. Any worker in the application can act as the application coordinator and one will automatically take over if the current application coordinator fails.

News

  • February 6, 2012: Giraph 0.1-incubating released. The Giraph PPMC is excited to announce that version 0.1 has been released. Grab a copy of the release here.

Presentations

We're working hard to build a community of users and developers around Giraph. As part of that outreach, we're giving presentations and talks to help bring people up to speed.
  • Avery Ching introduced Giraph at Hadoop Summit 2011. Watch the video.
  • An updated slidedeck of the Hadoop Summit talk was presented at HortonWorks. Read the slides.
  • Claudio Martella gave a talk about Giraph at FOSDEM 2012. Watch part 1 and part 2. The presentation slides can be found here.
  • André Kelpe presented an overview of the Giraph project at the Belgian Big Data group. Slides are available here.
  • Jakob Homan talked about Giraph at 2012 Berlin Buzzwords. Slides are available here.

Supported versions of Apache Hadoop

Hadoop versions for use with Giraph:

  • Secure Hadoop versions: Apache Hadoop 0.20.203, 0.20.204, other secure versions may work as well
  • Unsecure Hadoop versions: Apache Hadoop 0.20.1, 0.20.2, 0.20.3. While we provide support for unsecure Hadoop with the maven profile 'hadoop_non_secure', we have been primarily focusing on secure Hadoop releases at this time.
  • Other distributions that included Apache Hadoop reported to work include: Cloudera CDH3u0, CDH3u1

Getting involved

Giraph is a new project and we're looking to quickly build a community of users and contributors. All types of help is appreciated: contributing patches, writing documentation, posing and answering questions on the mailing list, even graphic design. Here's how to get involved with Giraph (or any Apache project):

  • Subscribe to the mailing lists, particularly the user and dev list, and follow their activity for a while to get a feel for the state of the project and what the community is working on.
  • Browse through Giraph's JIRA, our issue tracking system, to find issues you may be interested in working on. To help new contributors pitch in quickly, we maintain a set of JIRAs that focus on getting new contributors started with the mechanics of generating a patch — downloading the source, changing a couple lines, creating a patch, verifying its correctness, uploading it to JIRA and working with the community — rather that deep technical issues within Giraph itself. These are good issues with which to join the community. See below for detailed instructions on creating patches.
  • Try out the examples and play with Giraph on your cluster. Be sure to ask questions on the mailing list or open new JIRAs if you run into issues with your particular configuration.

Releases

Official releases of Giraph may be downloaded from an Apache mirror. Soon we will also publish our release artifacts to Apache's Maven repositories to make it easier to include Giraph in your projects.

Building and testing

You will need the following:

  • Java 1.6
  • Maven 3 or higher. Giraph uses the munge plugin, which requires Maven 3, to support multiple versions of Hadoop. Also, the web site plugin requires Maven 3.

Use the maven commands with secure Hadoop to:

  • compile (i.e mvn compile)
  • package (i.e. mvn package)
  • test (i.e. mvn test) For testing, one can submit the test to a running Hadoop instance (i.e. mvn test -Dprop.mapred.job.tracker=localhost:50300)
For the non-secure versions of Hadoop, run the maven commands with the additional argument -Dhadoop=non_secure to enable the maven profile hadoop_non_secure. An example compilation command is mvn -Dhadoop=non_secure compile.

Notes

Counter limit: In Hadoop 0.20.203.0 onwards, there is a limit on the number of counters one can use, which is set to 120 by default. This limit restricts the number of iterations/supersteps possible in Giraph. This limit can be increased by setting a parameter mapreduce.job.counters.limit in job tracker's config file mapred-site.xml.

Generating patches

Follow these steps to generate a patch that can be attached to a JIRA issue for review.

  • Check out the Giraph source, either from the subversion repository or from a git mirror. Note that the git mirrors may lag slightly behind the subversion repos.
  • Make the changes necessary for your particular issue. Try to avoid unnecessary changes, such as extra whitespace or formatting changes. Include a unit test, or be ready to justify in the JIRA why one isn't necessary.
  • Verify the new and existing tests continue to pass via mvn test. Verify the change works as expected on a real cluster, if possible. If one's not available for testing, mention it on the JIRA so another contributor can verify.
  • Verify that RAT is ok with the changes that you've made via mvn rat:check. Also check that the patch follows Giraph's style guidelines (found in the source root in CODE_CONVENTIONS).
  • Generate a patch either by svn diff > GIRAPH-{ISSUE-NUMBER}.patch or git diff --no-prefix trunk > GIRAPH-{ISSUE_NUMBER}.patch (the --no-prefix option is necessary to make the patch compatible with Apache's subversion repository). For subsequent patches, if necessary, number each version to make it easier for reviewers to track their progress.
  • Attach the patch to the JIRA issue (click More Actions and then Attach File from the top menu) using the comment to briefly explain what changes it contains and what testing was done. Mark the JIRA as Patch Available to let reviewers know it's ripe for evaluation.
  • Optionally, you can open reviewboard request for the patch, although not all reviewers use this tool.

A committer should review the patch shortly and either provide feedback for a new version, or commit it to the Giraph source.