////////////////////
Licensed to Cloudera, Inc. under one
or more contributor license agreements.  See the NOTICE file
distributed with this work for additional information
regarding copyright ownership.  Cloudera, Inc. licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License.  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
////////////////////

[[Quickstart]] 
== Flume Single Node Quick Start

In this section, you will learn how to get a single Flume node running and 
transmitting data. You will also learn about some data *sources*, and how to 
configure Flume flows on a per-node basis.

Each logical node consists of a event-producing *source* and an event-
consuming *sink*.  Nodes pull data from their sources, and push data out 
through their sink.

NOTE: This section assumes that the Flume node and Flume Master are running in 
the foreground and not as daemons. You can stop the daemons by using '/etc/
init.d/flume-master stop' and '/etc/init.d/flume-node stop'.


=== Sources and the `dump` command

Start by getting a Flume node running that echoes data written to standard 
input from the console back out to the console on stdout. You do this by using 
the +dump+ command.

----
$ flume dump console 
----

TIP: The Flume program has the general form `flume <command> [args ...]`.  If 
you installed from the tarball package, the command can be found in 
+$FLUME_HOME/bin/+. If you installed from either RPM or DEB, then +flume+ 
should already be in your path.

TIP: The example above uses the `dump` command and `console` is the argument.  
The command’s syntax is `flume dump <source> [<outputformat>]`. It prints data
from +<source>+ to the console.  Optionally, an output format can be specified,
otherwise the default text format is used.

NOTE: Some flume configurations by default write to local disk.
Initially the default is '/tmp/flume'.  This is good for initial
testing but for production environments the +flume.agent.logdir+
property should be set to a more durable location.

NOTE: If the node refuses to run and exits with this message,
+agent.FlumeNode: Aborting: Unexpected problem with
environment.Failure to write in log directory: '/tmp/flume'.  Check
permissions?+, then check the +/ tmp/flume+ directory to make sure you
have write permissions to it (change the owner or have the user join
the group).  This is, by default, where various logging information is
kept.


You have started a Flume node where `console` is the source of incoming data.  
When you run it, you should see some logging messages displayed to the 
console.  For now, you can ignore messages about Masters, back-off and failed 
connections (these are explained in later sections).  When you type at the 
console and press a new line, you should see a new log entry line appear 
showing the data that you typed. If you entered `This is a test`, it should 
look similar to this:

----
hostname [INFO Thu Nov 19 08:37:13 PST 2009] This is a test 
----

To exit the program, press ^C.

NOTE: Some sources do not automatically exit and require a manual ^C to
exit.

// TODO there are actually too many irrelevant events right now. 
// Need to turn of heartbeat when using one shot option.

==== Reading from a text file, `text`

You can also specify other sources of events.  For example, if you want a text 
file where each line represents a new event, run the following command. 

----
$ flume dump 'text("/etc/services")' 
----

This command reads the file, and then outputs each line as a new event.

NOTE: The default console output escapes special characters with Java-style 
escape sequences. Characters such as '"' and '\' are prefaced with an extra 
'\'.

NOTE: You can try this command with other files such as `/var/log/messages`, 
`/var/log/syslog`, or `/var/log/hadoop/hadoop.log` also.  However, Flume must 
run with appropriate permissions to read the files. 

==== Tailing a file name, `tail` and `multitail`

If you want to tail a file instead of just reading it, specify another source 
by using `tail` instead of `text`.

----
$ flume dump 'tail("testfile")' 
----

This command pipes data from the file into Flume and then out to the console.

This message appears: "File 'testfile' does not currently exist, waiting for 
file to appear".

In another terminal, you can create and write data to the file:

----
$ echo Hello world! >> testfile 
----

New data should appear.

When you delete the file:

----
$ rm testfile 
----

The `tail` sink detects this.  If you then recreate the file, the `tail` 
source detects the new file and follows it:

----
$ echo Hello world again! >> testfile 
----

You should see your new message appear in the Flume node console.

You can also use the `multitail` source to follow multiple files by file name:

----
$ flume dump 'multitail("test1", "test2")' 
----

And send it data coming from the two different files:

----
$ echo Hello world test1! >> test1 
$ echo Hello world test2! >> test2 
----

The tail source by default assumes `\n` as a delimiter, and excludes
the delimiter from events.  There are optional line delimiter
arguments that allow you to specify arbitrary regular expressions as
delimiters and to specify if the delimiter should be part of the
`prev` ious event, `next` event, or `exclude` d.

Here are some examples and scenarios to illustrate:

The following example tails a file that requires two or more
consecutive new lines to be considered a delimiter. The newlines are
excluded from the events.

----
tail("file", delim="\n\n+", delimMode="exclude")
----

This example tails a file and uses `</a>` as a delimiter, and appends
the delimiter to the previous event.  This could serve as a
quick-and-dirty xml record splitter.

----
tail("file", delim="</a>", delimMode="prev")
----

Finally, this example tails a file and uses the regex "\n\d\d\d\d" as
a delimiter and appends the delimiter to the next event.  This could
be used to gather lines from a stack dump in a log file that starts
with four digits (like a year from a date stamp).

----
tail("file", delim="\\n\\d\\d\\d\\d", delimMode="next")
----

==== Synthetic sources, `synth`

Here's one more example where you use the `synth` sources to generate events:

----
$ flume dump 'asciisynth(20,30)'
----

You should get 20 events, each with 30 random ASCII bytes.

==== Syslog as a source, `syslogUdp` and `syslogTcp`

As with files, you can also accept data from well known wire formats such as 
syslog. For example, you can start a traditional syslog-like UDP server 
listening on port 5140 (the normal syslog UDP port is the privileged port 514) 
by running this command:

----
$ flume dump 'syslogUdp(5140)' 
----

You can feed the source data by using netcat to send syslog formatted data as 
shown in the example below:

----
$ echo "<37>hello via syslog"  | nc -u localhost 5140 
----

TIP: You may need to press ^C to exit this command.

NOTE: The extra +<37>+ is a syslog wireformat encoding of a message category 
and priority level.

Similarly, you can set up a syslog-ng compatible source that listens on TCP 
port 5140 (the normal syslog-ng TCP port is the privileged port 514):

----
$ flume dump 'syslogTcp(5140)' 
----

And send it data:

----
$ echo "<37>hello via syslog" | nc -t localhost 5140 
----

TIP: You may need to press ^C to exit this command.

Syslog backwards-compatibility allows data normally created from syslog, 
rsyslog, or syslog-ng to be sent to and processed by Flume.

=== Anatomy of an Event

This section describes a number of sources of data that Flume can interoperate 
with. Before going any further, it will be helpful for you to understand what 
Flume is actually sending and processing internally.

Flume internally converts every external source of data into a stream of 
*events*. Events are Flume's unit of data and are a simple and flexible 
representation. An event is composed of a *body* and *metadata*. The event 
body is a string of bytes representing the content of an event. For example, a 
line in a log file is represented as an event whose body was the actual byte 
representation of that line. The event metadata is a table of key / value 
pairs that capture some detail about the event, such as the time it was 
created or the name of the machine on which it originated. This table can be 
appended as an event travels along a Flume flow, and the table can be read to 
control the operation of individual components of that flow. For example, the 
machine name attached to an event can be used to control the output path where 
the event is written at the end of the flow.

An event's body can be up to 32KB long - although this limit can be controlled 
via a system property, it is recommended that it is not changed in order to 
preserve performance.

=== Section Summary

In this section, you learned how to use Flume's +dump+ command to print data 
from a variety of different input sources to the console. You also learned 
about the *event*, the fundamental unit of data transfer in Flume.

The following table summarizes the sources described in this section.

.Flume Event 
Sources +console+ :: Stdin console 

+text("filename")+ :: One shot text file source.  One line is one event

+tail("filename")+ :: Similar to Unix's +tail -F+. One line is one event. 
Stays open for more data and follows filename if file rotated.

+multitail("file1"[, "file2"[, ...]])+ :: Similar to +tail+ source but follows 
multiple files.

+asciisynth(msg_count,msg_size)+ :: A source that synthetically generates 
msg_count random messages of size msg_size.  This converts all characters into 
printable ASCII characters.

+syslogUdp(port)+ :: Syslog over UDP port, port.  This is syslog compatible.

+syslogTcp(port)+ :: Syslog over TCP port, port. This is syslog-ng compatible.