//////////////////// Licensed to Cloudera, Inc. under one or more contributor license agreements. See the NOTICE file distributed with this work for additional information regarding copyright ownership. Cloudera, Inc. licenses this file to you under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. //////////////////// == Integrating Flume with your Data Sources .WARNING ******************** This section is incomplete. ******************** Flume's source interface is designed to be simple yet powerful and enable logging of all kinds of data -- from unstructured blobs of byte, semi-structured blobs with structured metadata, to completely structured data. //// Issues: language neutrality, reliability, push vs pull, one shot vs continuous. //// In this section we describe some of the basic mechanisms that can be used to pull in data. Generally, this approach has three flavors. *Pushing* data to Flume, having Flume *polling* for data, or *embedding* Flume or Flume components into an application. These mechanisms have different trade-offs -- based on the semantics of the operation. Also, some sources can be *one shot* or *continuous* sources. === Push Sources +syslogTcp+, +syslogUdp+ :: wire-compatibility with syslog, and syslog-ng logging protocols. +scribe+ :: wire-compatibility with the scribe log collection system. === Polling Sources +tail+, +multitail+ :: watches a file(s) for appends. +exec+ :: This is good for extracting custom data by using existing programs. +poller+ :: We can gather information from Flume nodes themselves. === Embedding Sources WARNING: these features are incomplete. +log4j+ +simple client library+ // move this to gathering data from sources === Logging via log4j Directly Flume includes specific integration support for Apache Log4j that allows end user applications to log directly to a Flume agent with no code modification. This support comes in the form of a log4j appender and can be configured in an application's +log4j.properties+ or +log4j.xml+ file as you would with any of the built-in appenders. The appender uses Flume's +avroSource()+ and converts each log4j +LoggingEvent+ into a Flume Avro event that can be natively handled by Flume. To configure log4j to log to Flume: . Ensure the proper jar files are on the application's classpath. . Configure the +com.cloudera.flume.log4j.appender.FlumeLog4jAvroAppender+ appender in the log4j configuration file. To use the Flume Avro appender, you must have the following jars on your application's classpath: - +flume-log4j-appender-_version_.jar+ - +flume-core-_version_.jar+ Avro jar files and its dependencies are also required. The simplest way to ensure all dependencies are properly included in your application's classpath is to use a build system such as Maven that handles transitive dependencies for you. Flume's log4j appender is available as a Maven project and will properly include Avro dependencies. The Flume Avro appender has a number of options users can set to affect its behavior. The only parameter that absolutely must be set is the port on which the Flume avroSource is listening. The appender assumes the Flume agent is running locally and that we can communicate via the hostname +localhost+. Users can also control the number of times to attempt reconnection before a logging call fails. .Parameters +hostname+ :: The hostname or IP to which we should attempt to send events. (default: +localhost+) +port+ :: The port on which Flume's avroSource is configured to listen. (required) +reconnectAttempts+ :: The maximum number of times we should attempt to connect to the +avroSource()+ before throwing an exception. A setting of 0 (zero) means to try forever. (default: 10) .Example log4j.properties -------------------------------------- log4j.debug = true log4j.rootLogger = INFO, flume log4j.appender.flume = com.cloudera.flume.log4j.appender.FlumeLog4jAvroAppender log4j.appender.flume.layout = org.apache.log4j.TTCCLayout log4j.appender.flume.port = 12345 log4j.appender.flume.hostname = localhost log4j.appender.flume.reconnectAttempts = 10 -------------------------------------- .Example Flume configuration -------------------------------------- my-app : avroSource(12345) | agentE2ESink("my-app-col", 12346) my-app-col : collectorSource(12346) | collectorSink("hdfs://...", "my-app-") -------------------------------------- Note how the port referenced in the log4j.properties example matches that of the +avroSource()+ in the Flume configuration example. .Notes The +FlumeLog4jAvroAppender+ uses no buffering internally. This is because buffering would potentially create a case where, even if a Flume node is configured as end-to-end durable, events in the appender's internal buffer could be lost in the event of a failure. By setting the +reconnectAttempts+ parameter to zero (i.e. retry forever) you can ensure the end user application blocks should the Flume agent become unavailable. This is meant to satisfy users who have a zero data loss requirement where it's better to stop service than to not be able to log that it occurred. //// WARNING: These instructions are out of date and currently untested. Modify hadoop-daemon.sh so that it includes Flume. Places where log4j mentioned: ---- bin/hadoop-daemon.sh -- defaults to INFO,DRFA conf/hadoop-env.sh -- can be set but can cause perms issues bin/hadoop -- defaults to INFO,console conf/log4j.properties -- loggers are defined here, but the default root logger is ignored src/java/..../TaskRunner.java -- INFO,TLA ---- //// ==== Example of Logging Hadoop Jobs //// For jobs initiated by the user, the easiest mechanism to enable Flume logging is to modify conf/hadoop-env.sh to include: export HADOOP_ROOT_LOGGER=INFO,flume,console //// ==== Logging Hadoop Daemons //// To log events generated by Hadoop's daemons (tasktracker, jobtracker, datanode, secondarynamenode, namenode), modify bin/hadoop-daemon.sh so that HADOOP_ROOT_LOGGER is set to export HADOOP_ROOT_LOGGER=INFO,flume,DRFA TODO (jon) this doesn't seem to working right now -- need to figure out why. ////