.. Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
==========================================
Flume 1.3.0-SNAPSHOT Developer Guide
==========================================
Introduction
============
Overview
--------
Apache Flume is a distributed, reliable, and available system for
efficiently collecting, aggregating and moving large amounts of log
data from many different sources to a centralized data store.
Apache Flume is a top level project at the Apache Software Foundation.
There are currently two release code lines available, versions 0.9.x and 1.x.
This documentation applies to the 1.x codeline.
Please click here for
`the Flume 0.9.x Developer Guide `_.
Architecture
------------
Data flow model
~~~~~~~~~~~~~~~
A unit of data flow is called event which is a byte payload that is accompanied
by an optional set of string attributes. Flume agent is a process (JVM) that
hosts the components that flows events from an external source to next
destination.
.. figure:: images/DevGuide_image00.png
:align: center
:alt: Agent component diagram
A source consumes events delivered to it by an external source like web server
in a specific format. For example, an Avro source can be used to receive Avro
events from clients or other agents in the flow. When a source receives an
event, it stores it into one or more channels. The channel is a passive store
that keeps the event until its consumed by a sink. An example of channel is
the JDBC channel that uses a file-system backed embedded database. The sink
removes the event from channel and puts it into an external repository like
HDFS or forwards it to the source in next hop of the flow. The source and sink
within the given agent run asynchronously with the events staged in the
channel.
Reliability
~~~~~~~~~~~
The events are staged in the channel on each agent. Then they are delivered to
the next agent or terminal repository (like HDFS) in the flow. The events are
removed from the channel only after they are stored in the channel of next
agent or in the terminal repository. This is a how the single-hop message
delivery semantics in Flume provide end-to-end reliability of the flowFlume
uses transactional approach to guarantee the reliable delivery of the events.
The sources and sinks encapsulate the store/retrieval of the events in a
transaction provided by the channel. This ensures that the set of events are
reliably passed from point to point in the flow. In case of multi hop flow, the
sink on previous hop and source on next hop both have their transactions
running to ensure that the data is safely stored in the channel of the next
hop.
Building Flume
--------------
Getting the source
~~~~~~~~~~~~~~~~~~
Check out the code using Subversion. Click here for
`the SVN repository root `_.
The Flume 1.x development happens under the branch "trunk" so this command line
can be used::
svn checkout http://svn.apache.org/repos/asf/flume/trunk flume-trunk
Alternatively, if you prefer using Git, you may use::
git clone git://git.apache.org/flume.git
cd flume
git checkout trunk
Compile/test Flume
~~~~~~~~~~~~~~~~~~
The Flume build is mavenized. You can compile Flume using the standard Maven
commands:
#. Compile only: ``mvn clean compile``
#. Compile and run unit tests: ``mvn clean test``
#. Run individual test(s): ``mvn clean test -Dtest=,,... -DfailIfNoTests=false``
#. Create tarball package: ``mvn clean install``
#. Create tarball package (skip unit tests): ``mvn clean install -DskipTests``
Developing custom components
----------------------------
Client
~~~~~~
The client operates at the point of origin of events and delivers them to a
Flume agent. Clients typically operate in the process space of the application
they are consuming data from. Currently flume supports Avro, log4j and syslog
as ways to transfer data from remote source. Additionally there’s an Exec
source that can consume the output of a local process as input to Flume.
It’s quite possible to have a use case where these existing options are not
sufficient. In this case you can build a custom mechanism to send data to
Flume. There are two ways of achieving this. First is to create a custom client
that communicates to one of the flume’s existing sources like Avro or syslog.
Here the client should convert it’s data into messages understood by these
Flume sources. The other option is to write a custom Flume source that directly
talks to your existing client application using some IPC or RPC protocols, and
then convert the data into flume events to send it upstream.
Client SDK
''''''''''
Though flume contains a number of built in mechanisms to ingest data, often one
wants the ability to communicate with flume directly from a custom application.
The Client SDK is a library that enables applications to connect to Flume and
send data into Flume’s data flow over RPC.
RPC Client interface
''''''''''''''''''''
The is an interface to wrap the user data data and attributes into an
``Event``, which is Flume’s unit of flow. This encapsulates the RPC mechanism
supported by Flume. The application can simply call ``append()`` or
``appendBatch()`` to send data and not worry about the underlying message
exchanges.
Avro RPC Client
'''''''''''''''
As of Flume 1.1.0, Avro is the only support RPC protocol. The
``NettyAvroRpcClient`` implements the ``RpcClient`` interface. The client needs
to create this object with the host and port of the Flume agent and use it to
send data into flume. The following example shows how to use the Client SDK
API:
.. code-block:: java
import org.apache.flume.Event;
import org.apache.flume.EventDeliveryException;
import org.apache.flume.FlumeException;
import org.apache.flume.api.RpcClient;
import org.apache.flume.api.RpcClientFactory;
import org.apache.flume.event.EventBuilder;
public void myInit () {
// setup the RPC connection to Flume agent at hostname/port
RpcClient rpcClient = RpcClientFactory.getDefaultInstance(hostname, port);
...
}
public void sendDataToFlume(String data) {
// Create flume event object
Event event = EventBuilder.withBody(data, Charset.forName("UTF-8"));
try {
rpcClient.append(event);
} catch (EventDeliveryException e) {
// clean up and recreate rpcClient
rpcClient.close();
rpcClient = null;
rpcClient = RpcClientFactory.getDefaultInstance(hostname, port);
}
...
}
public void cleanUp () {
// close the rpc connection
rpcClient.close();
...
}
Failover handler
''''''''''''''''
This class wraps the Avro RPC client to provide failover handling capability to
clients. This takes a list of host/ports of the Flume agent. If there’s an
error in communicating the current agent, then it automatically falls back to
the next agent in the list:
.. code-block:: java
// Setup properties for the failover
Properties props = new Properties();
props.put("client.type", "default_failover");
// list of hosts
props.put("hosts", "host1 host2 host3");
// address/port pair for each host
props.put("hosts.host1", host1 + ":" + port1);
props.put("hosts.host1", host2 + ":" + port2);
props.put("hosts.host1", host3 + ":" + port3);
// create the client with failover properties
client = (FailoverRpcClient);
RpcClientFactory.getInstance(props);
Transaction interface
~~~~~~~~~~~~~~~~~~~~~
The ``Transaction`` interface is the basis of reliability for Flume. All the
major components ie. sources, sinks and channels needs to interface with Flume
transaction.
.. figure:: images/DevGuide_image01.png
:align: center
:alt: Transaction sequence diagram
The transaction interface is implemented by a channel implementation. The
source and sink connected to channel obtain a transaction object. The sources
actually use a channel selector interface that encapsulate the transaction
(discussed in later sections). The operations to stage or extract an event is
done inside an active transaction. For example:
.. code-block:: java
Channel ch = ...
Transaction tx = ch.getTransaction();
try {
tx.begin();
...
// ch.put(event) or ch.take()
...
tx.commit();
} catch (ChannelException ex) {
tx.rollback();
...
} finally {
tx.close();
}
Here we get hold of a transaction from a channel. After the begin method is
executed, the event is put in the channel and transaction is committed.
Sink
~~~~
The purpose of a sink to extract events from the channel and forward it to the
next Agent in the flow or store in an external repository. A sink is linked to
a channel instance as per the flow configuration. There’s a sink runner thread
that’s get created for every configured sink which manages the sink’s
lifecycle. The sink needs to implement ``start()`` and ``stop()`` methods that
are part of the ``LifecycleAware`` interface. The ``start()`` method should
initialize the sink and bring it to a state where it can forward the events to
its next destination. The ``process()`` method from the ``Sink`` interface
should do the core processing of extracting the event from channel and
forwarding it. The ``stop()`` method should do the necessary cleanup. The sink
also needs to implement a ``Configurable`` interface for processing its own
configuration settings:
.. code-block:: java
// foo sink
public class FooSink extends AbstractSink implements Configurable {
@Override
public void configure(Context context) {
some_Param = context.get("some_param", String.class);
// process some_param …
}
@Override
public void start() {
// initialize the connection to foo repository ..
}
@Override
public void stop () {
// cleanup and disconnect from foo repository ..
}
@Override
public Status process() throws EventDeliveryException {
// Start transaction
ch = getChannel();
tx = ch.getTransaction();
try {
tx.begin();
Event e = ch.take();
// send the event to foo
// foo.some_operation(e);
tx.commit();
sgtatus = Status.READY;
(ChannelException e) {
tx.rollback();
status = Status.BACKOFF;
} finally {
tx.close();
}
return status;
}
}
}
Source
~~~~~~
The purpose of a Source is to receive data from an external client and store it
in the channel. As mentioned above, for sources the ``Transaction`` interface
is encapsulated by the ``ChannelSelector``. Similar to ``SinkRunner``, there’s
a ``SourceRunner`` thread that gets created for every configured source that
manages the source’s lifecycle. The source needs to implement ``start()`` and
``stop()`` methods that are part of the ``LifecycleAware`` interface. There are
two types of sources, pollable and event-driven. The runner of pollable source
runner invokes a ``process()`` method from the pollable source. The
``process()`` method should check for new data and store it in the channel. The
event driver source needs have its own callback mechanism that captures the new
data:
.. code-block:: java
// bar source
public class BarSource extends AbstractSource implements Configurable, EventDrivenSource{
@Override
public void configure(Context context) {
some_Param = context.get("some_param", String.class);
// process some_param …
}
@Override
public void start() {
// initialize the connection to bar client ..
}
@Override
public void stop () {
// cleanup and disconnect from bar client ..
}
@Override
public Status process() throws EventDeliveryException {
try {
// receive new data
Event e = get_some_data();
// store the event to underlying channels(s)
getChannelProcessor().processEvent(e)
} catch (ChannelException ex) {
return Status.BACKOFF;
}
return Status.READY;
}
}
Channel
~~~~~~~
TBD