//////////////////// Licensed to Cloudera, Inc. under one or more contributor license agreements. See the NOTICE file distributed with this work for additional information regarding copyright ownership. Cloudera, Inc. licenses this file to you under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. //////////////////// [[Quickstart]] == Flume Single Node Quick Start In this section, you will learn how to get a single Flume node running and transmitting data. You will also learn about some data *sources*, and how to configure Flume flows on a per-node basis. Each logical node consists of a event-producing *source* and an event- consuming *sink*. Nodes pull data from their sources, and push data out through their sink. NOTE: This section assumes that the Flume node and Flume Master are running in the foreground and not as daemons. You can stop the daemons by using '/etc/ init.d/flume-master stop' and '/etc/init.d/flume-node stop'. === Sources and the `dump` command Start by getting a Flume node running that echoes data written to standard input from the console back out to the console on stdout. You do this by using the +dump+ command. ---- $ flume dump console ---- TIP: The Flume program has the general form `flume [args ...]`. If you installed from the tarball package, the command can be found in +$FLUME_HOME/bin/+. If you installed from either RPM or DEB, then +flume+ should already be in your path. TIP: The example above uses the `dump` command and `console` is the argument. The command’s syntax is `flume dump []`. It prints data from ++ to the console. Optionally, an output format can be specified, otherwise the default text format is used. NOTE: Some flume configurations by default write to local disk. Initially the default is '/tmp/flume'. This is good for initial testing but for production environments the +flume.agent.logdir+ property should be set to a more durable location. NOTE: If the node refuses to run and exits with this message, +agent.FlumeNode: Aborting: Unexpected problem with environment.Failure to write in log directory: '/tmp/flume'. Check permissions?+, then check the +/ tmp/flume+ directory to make sure you have write permissions to it (change the owner or have the user join the group). This is, by default, where various logging information is kept. You have started a Flume node where `console` is the source of incoming data. When you run it, you should see some logging messages displayed to the console. For now, you can ignore messages about Masters, back-off and failed connections (these are explained in later sections). When you type at the console and press a new line, you should see a new log entry line appear showing the data that you typed. If you entered `This is a test`, it should look similar to this: ---- hostname [INFO Thu Nov 19 08:37:13 PST 2009] This is a test ---- To exit the program, press ^C. NOTE: Some sources do not automatically exit and require a manual ^C to exit. // TODO there are actually too many irrelevant events right now. // Need to turn of heartbeat when using one shot option. ==== Reading from a text file, `text` You can also specify other sources of events. For example, if you want a text file where each line represents a new event, run the following command. ---- $ flume dump 'text("/etc/services")' ---- This command reads the file, and then outputs each line as a new event. NOTE: The default console output escapes special characters with Java-style escape sequences. Characters such as '"' and '\' are prefaced with an extra '\'. NOTE: You can try this command with other files such as `/var/log/messages`, `/var/log/syslog`, or `/var/log/hadoop/hadoop.log` also. However, Flume must run with appropriate permissions to read the files. ==== Tailing a file name, `tail` and `multitail` If you want to tail a file instead of just reading it, specify another source by using `tail` instead of `text`. ---- $ flume dump 'tail("testfile")' ---- This command pipes data from the file into Flume and then out to the console. This message appears: "File 'testfile' does not currently exist, waiting for file to appear". In another terminal, you can create and write data to the file: ---- $ echo Hello world! >> testfile ---- New data should appear. When you delete the file: ---- $ rm testfile ---- The `tail` sink detects this. If you then recreate the file, the `tail` source detects the new file and follows it: ---- $ echo Hello world again! >> testfile ---- You should see your new message appear in the Flume node console. You can also use the `multitail` source to follow multiple files by file name: ---- $ flume dump 'multitail("test1", "test2")' ---- And send it data coming from the two different files: ---- $ echo Hello world test1! >> test1 $ echo Hello world test2! >> test2 ---- The tail source by default assumes `\n` as a delimiter, and excludes the delimiter from events. There are optional line delimiter arguments that allow you to specify arbitrary regular expressions as delimiters and to specify if the delimiter should be part of the `prev` ious event, `next` event, or `exclude` d. Here are some examples and scenarios to illustrate: The following example tails a file that requires two or more consecutive new lines to be considered a delimiter. The newlines are excluded from the events. ---- tail("file", delim="\n\n+", delimMode="exclude") ---- This example tails a file and uses `` as a delimiter, and appends the delimiter to the previous event. This could serve as a quick-and-dirty xml record splitter. ---- tail("file", delim="", delimMode="prev") ---- Finally, this example tails a file and uses the regex "\n\d\d\d\d" as a delimiter and appends the delimiter to the next event. This could be used to gather lines from a stack dump in a log file that starts with four digits (like a year from a date stamp). ---- tail("file", delim="\\n\\d\\d\\d\\d", delimMode="next") ---- ==== Synthetic sources, `synth` Here's one more example where you use the `synth` sources to generate events: ---- $ flume dump 'asciisynth(20,30)' ---- You should get 20 events, each with 30 random ASCII bytes. ==== Syslog as a source, `syslogUdp` and `syslogTcp` As with files, you can also accept data from well known wire formats such as syslog. For example, you can start a traditional syslog-like UDP server listening on port 5140 (the normal syslog UDP port is the privileged port 514) by running this command: ---- $ flume dump 'syslogUdp(5140)' ---- You can feed the source data by using netcat to send syslog formatted data as shown in the example below: ---- $ echo "<37>hello via syslog" | nc -u localhost 5140 ---- TIP: You may need to press ^C to exit this command. NOTE: The extra +<37>+ is a syslog wireformat encoding of a message category and priority level. Similarly, you can set up a syslog-ng compatible source that listens on TCP port 5140 (the normal syslog-ng TCP port is the privileged port 514): ---- $ flume dump 'syslogTcp(5140)' ---- And send it data: ---- $ echo "<37>hello via syslog" | nc -t localhost 5140 ---- TIP: You may need to press ^C to exit this command. Syslog backwards-compatibility allows data normally created from syslog, rsyslog, or syslog-ng to be sent to and processed by Flume. === Anatomy of an Event This section describes a number of sources of data that Flume can interoperate with. Before going any further, it will be helpful for you to understand what Flume is actually sending and processing internally. Flume internally converts every external source of data into a stream of *events*. Events are Flume's unit of data and are a simple and flexible representation. An event is composed of a *body* and *metadata*. The event body is a string of bytes representing the content of an event. For example, a line in a log file is represented as an event whose body was the actual byte representation of that line. The event metadata is a table of key / value pairs that capture some detail about the event, such as the time it was created or the name of the machine on which it originated. This table can be appended as an event travels along a Flume flow, and the table can be read to control the operation of individual components of that flow. For example, the machine name attached to an event can be used to control the output path where the event is written at the end of the flow. An event's body can be up to 32KB long - although this limit can be controlled via a system property, it is recommended that it is not changed in order to preserve performance. === Section Summary In this section, you learned how to use Flume's +dump+ command to print data from a variety of different input sources to the console. You also learned about the *event*, the fundamental unit of data transfer in Flume. The following table summarizes the sources described in this section. .Flume Event Sources +console+ :: Stdin console +text("filename")+ :: One shot text file source. One line is one event +tail("filename")+ :: Similar to Unix's +tail -F+. One line is one event. Stays open for more data and follows filename if file rotated. +multitail("file1"[, "file2"[, ...]])+ :: Similar to +tail+ source but follows multiple files. +asciisynth(msg_count,msg_size)+ :: A source that synthetically generates msg_count random messages of size msg_size. This converts all characters into printable ASCII characters. +syslogUdp(port)+ :: Syslog over UDP port, port. This is syslog compatible. +syslogTcp(port)+ :: Syslog over TCP port, port. This is syslog-ng compatible.