//////////////////// Licensed to Cloudera, Inc. under one or more contributor license agreements. See the NOTICE file distributed with this work for additional information regarding copyright ownership. Cloudera, Inc. licenses this file to you under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. //////////////////// == Trying out Flume sources and sinks First this section will describe some basic tips for testing sources, sinks and logical nodes. === Testing sources Testing sources is a straightforward process. The +flume+ script has a +dump+ command that allows you test data source configuration by displaying data at the console. ---- flume dump source ---- NOTE: +source+ needs to be a single command line argument, so you may need to add 'quotes' or "quotes" to the argument if it has quotes or parenthesis in them. Using single quotes allow you to use unescaped double quotes in the configuration argument. (ex: `'text("/tmp/foo")'` or `"text(\"/tmp/foo\")"`). Here are some simple examples to try: ---- $ flume dump console $ flume dump 'text("/path/to/file")' $ flume dump 'file("/path/to/file")' $ flume dump syslogTcp $ flume dump 'syslogUdp(5140)' $ flume dump 'tail("/path/to/file")' $ flume dump 'tailDir("path/to/dir", "fileregex")' $ flume dump 'rpcSource(12346)' ---- Under the covers, this dump command is actually running the +flume node_nowatch+ command with some extra command line parameters. ---- flume node_nowatch -1 -s -n dump -c "dump: $1 | console;" ---- Here's a summary of what the options mean. [horizontal] +-1+ :: one shot execution. This makes the node instance not use the heartbeating mechanism to get a config. +-s+ :: starts the Flume node without starting the node's http status web server. If the status web server is started, a Flume node's status server will keep the process alive even if in one-shot mode. If the -s flag is specified along with one-shot mode (-1), the Flume node will exit after all logical nodes complete. +-c "node:src|snk;"+ :: Starts the node with the given configuration definition. NOTE: If not using -1, this will be invalidated upon the first heartbeat to the master. +-n node+ :: gives the node the physical name node. You can get info on all of the Flume node commands by using this command: ---- $ flume node -h ---- === Testing sinks and decorators Now that you can test sources, there is only one more step necessary to test arbitrary sinks and decorators from the command line. A sink requires data to consume so some common sources used to generate test data include synthetic datasets (+asciisynth+), the console (+console+), or files (+text+). For example, you can use a synthetic source to generate 1000 events, each of 100 "random" bytes data to the console. You could use a text source to read a file like /etc/services, or you could use the console as a source and interactively enter lines of text as events: ---- $ flume node_nowatch -1 -s -n dump -c 'dump: asciisynth(1000,100) | console;' $ flume node_nowatch -1 -s -n dump -c 'dump: text("/etc/services") | console;' $ flume node_nowatch -1 -s -n dump -c 'dump: console | console;' ---- You can also use decorators on the sinks you specify. For example, you could rate limit the amount of data that pushed from the source to sink by inserting a delay decorator. In this case, the 'delay' decorator waits 100ms before sending each synthesized event to the console. ---- $ flume node_nowatch -1 -s -n dump -c 'dump: asciisynth(1000,100) | { delay(100) => console};' ---- Using the command line, you can send events via the direct best-effort (BE) or disk-failover (DFO) agents. The example below uses the +console+ source so you can interactively generate data to send in BE and DFO mode to collectors. NOTE: Flume node_nowatch must be used when piping data in to a Flume node's console source. The watchdog program does not forward stdin. ---- $ flume node_nowatch -1 -s -n dump -c 'dump: console | agentBESink("collectorHost");' $ flume node_nowatch -1 -s -n dump -c 'dump: console | agentDFOSink("collectorHost");' ---- WARNING: Since these nodes are executed with configurations entered at the command line and never contact the master, they cannot use the automatic chains or logical node translations. Currently, the acknowledgements used in E2E mode go through the master piggy-backed on node-to-master heartbeats. Since this mode does not heartbeat, E2E mode should not be used. Console sources are useful because we can pipe data into Flume directly. The next example pipes data from a program into Flume which then delivers it. ---- $ | flume node_nowatch -1 -s -n foo -c 'foo:console|agentBESink("collector");' ---- Ideally, you could write data to a named pipe and just have Flume read data from a named pipe using +text+ or +tail+. Unfortunately, this version of Flume's +text+ and +tail+ are not currently compatible with named pipes in a Linux environment. However, you could pipe data to a Flume node listening on the stdin console: ---- $ tail -f namedpipe | flume node_nowatch -1 -s -n foo -c 'foo:console|agentBESink;' ---- Or you can use the exec source to get its output data: ---- $ flume node_nowatch -1 -s -n bar -c 'bar:exec("cat pipe")|agentBESink;' ---- === Monitoring nodes While outputting data to a console or to a logfile is an effective way to verify data transmission, Flume nodes provide a way to monitor the state of sources and sinks. This can be done by looking at the node's report page. By default, you can navigate your web browser to the nodes TCP 35862 port. (http://node:35862/) This page shows counter information about all of the logical nodes on the physical node as well as some basic machine metric information. TIP: If you have multiple physical nodes on a single machine, there may be a port conflict. If you have the auto-find port option on (+flume.node.http.autofindport+), the physical node will increment the port number until it finds a free port it can bind to.