We try to flag projects that are good for people getting started with the code base, you can find the list of projects here.

Current Work

Below is a list of major projects we know people are currently pursuing. If you have thoughts on these or want to help, please let us know.

Replication

Replication is currently the major focus for a number of us. This will turn Kafka into a fully replicated message log.

What is replication? Messages are currently written to a single broker with no replication between brokers. We would like to provide replication between brokers and expose options to the producer to block until a configurable number of replicas have acknowledged the message to allow the client to control the fault-tolerance semantics.

You can see more details on this plan here.

Project Ideas

Improved Stream Processing Libraries

Kafka supports partitioning data by key and doing distributed stream consumption and publication. It would be nice to have a small library for common processing operations like joins, filtering, grouping, etc.

Below is a list of projects which would be great to have but haven't yet been started. Ping the mailing list if you are interested in working on any of these.

Clients In Other Languages

We offer a JVM-based client for production and consumption and also a rather primitive native python client. It would be great to improve this list. The lower-level protocols are well documented here and should be relatively easy to implement in any language that supports standard socket I/O.

Convert Hadoop InputFormat or OutputFormat to Scala

We have an Hadoop InputFormat and OutputFormat that were contributed and are in use at LinkedIn. This code is in Java, though, which means it doesn't quite fit in well with the project. It would be good to convert this code to Scala to keep things consistent.

Syslogd Producer

We currently have a custom producer and also a log4j appender to work for "logging"-type applications. Outside the java world, however, the standard for logging is syslogd. It would be great to have an asynchronous producer that worked with syslogd to support these kinds of applications.

Hierarchical Topics

Currently streams are divided into only two levels—topics and partitions. This is unnecessarily limited. We should add support for hierarchical topics and allow subscribing to an arbitrary subset of paths. For example one could have /events/clicks and /events/logins and one could subscribe to either of these alone or get the merged stream by subscribing to the parent directory /events.

In this model, partitions are naturally just subtopics (for example /events/clicks/0 might be one partition). This reduces the conceptual weight of the system and adds some power.

Pluggable Offset Consumer Offset Storage Strategies

Currently consumer offsets are persisted in Zookeeper which works well for many use cases. There is no inherent reason the offsets need to be stored here, however. We should expose a pluggable interface to allow alternate storage mechanisms.

Restful Proxy

It would be great to have a REST proxy for KAFKA to help integration with languages that don't have first-class clients. It also makes it easier for web applications to produce data to Kafka.