//////////////////// Licensed to Cloudera, Inc. under one or more contributor license agreements. See the NOTICE file distributed with this work for additional information regarding copyright ownership. Cloudera, Inc. licenses this file to you under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. //////////////////// == Using Flume Agents for Apache 2.x Web Server Logging To connect Flume to Apache 2.x servers, you will need to # Configure web log file permissions # Tail the web logs or use piped logs to enable Flume to get data from the web server. This section will step through basic setup on default Ubuntu Lucid and default CentOS 5.5 installations. Then it will describe various ways of integrating Flume. === If you are using CentOS / Red Hat Apache servers By default, CentOS's Apache writes weblogs to files owned by +root+ and in group +adm+ in 0644 (rw-r--r--) mode. Flume is run as the +flume+ user, so the Flume node is able to read the logs. Apache on CentOS/Red Hat servers defaults to writing logs to two files: ---- /var/log/httpd/access_log /var/log/httpd/error_log ---- The simplest way to gather data from these files is to tail the files by configuring Flume nodes to use Flume's +tail+ source: ---- tail("/var/log/httpd/access_log") tail("/var/log/httpd/error_log") ---- === If you are using Ubuntu servers Apache servers By default, Ubuntu writes weblogs to files owned by +root+ and in group +adm+ in 0640 (rw-r-----) mode. Flume is run as the +flume+ user and by default will *not* be able to tread the files. One approach to allow the +flume+ user to read the files is to add it to the +adm+ group. Apache servers on Ubuntu defaults to writing logs to three files: ---- /var/log/apache2/access.log /var/log/apache2/error.log /var/log/apache2/other_vhosts_access.log ---- The simplest way to gather data from these files is by configuring Flume nodes to use Flume's +tail+ source: ---- tail("/var/log/apache2/access.log") tail("/var/log/apache2/error.log") tail("/var/log/apache2/other_vhosts_access.log") ---- === Getting log entries from Piped Log files The Apache 2.x's documentation (http://httpd.apache.org/docs/2.2/logs.html) describes using piped logfile with the +CustomLog+ directive. Their example uses +rotatelogs+ to periodically write data to new files with a given prefix. Here are some example directives that could be in the +httpd.conf+/+apache2.conf+ file. ---- LogFormat "%h %l %u %t \"%r\" %>s %b" common CustomLog "|/usr/sbin/rotatelogs /var/log/apache2/foo_access_log 3600" common ---- TIP: In Ubuntu Lucid these directives are in +/etc/apache2/sites-available/default+. In CentOS 5.5, these directives are in +/etc/httpd/conf/httpd.conf+. These directives configure Apache to write log files in +/var/log/apache2/foo_access_log.xxxxx+ every hour (3600 seconds) using the "common" log format. You can use Flume's +tailDir+ source to read all files without modifying the Apache settings: ---- tailDir("/var/log/apache2/", "foo_access_log.*") ---- The first argument is the directory, and then the second is a regex that should match against the file name. +tailDir+ will watch the dir and tail all files that have matching file names. === Using piped logs Instead of writing data to disk and then having Flume read it, you can have Flume ingest data directly from Apache. To do so, modify the web server's parameters and use its piped log feature by adding some directives to Apache's configuration: ---- CustomLog "|flume node_nowatch -1 -s -n apache -c \'apache:console|agentBESink(\"collector\");\'" common ---- ---- CustomLog "|flume node_nowatch -1 -s -n apache -c \'apache:console|agentDFOSink(\"collector\");\'" common ---- WARNING: By default, CentOS does not have the Java required by the Flume node in user +root+'s path. You can use +alternatives+ to create a managed symlink in +/usr/bin/+ for the java executable. Using piped logs can be more efficient, but is riskier because Flume can deliver messages without saving on disk. Doing this, however, increases the probability of event loss. From a security point of view, this Flume node instance runs as Apache's user which is often +root+ according to the Apache manual. NOTE: You could configure the one-shot mode node to deliver data directly to a collector. This can only be done at the best effort or disk-failover level. The prior examples use Flume nodes in one-shot mode which runs without contacting a master. Unfortunately, it means that one-shot mode cannot directly use the automatic chains or the end-to-end (E2E) reliability mode. This is because the automatic chains are generated by the master and because E2E mode currently delivers acknowledgements through the master. However, you can have a one-shot Flume node deliver data to a Flume local node daemon where the reliable E2E mode can be used. In this setup we would have the following Apache directive: ---- CustomLog "|flume node_nowatch -1 -s -n apache -c \'apache:console|agentBESink(\"localhost\", 12345);\'" common ---- Then you can have a Flume node setup to listen with the following configuration: ---- node : rpcSource(12345) | agentE2ESink("collector"); ---- Since this daemon node is connected to the master, it can use the auto*Chains. ---- node : rpcSource(12345) | autoE2EChain; ---- NOTE: End-to-end mode attempts to ensure the deliver of the data that enters the E2E sink. In this one-shot-node to reliable-node scenario, data is not safe it gets to the E2E sink. However, since this is a local connection, it should only fail when the machine or processes fails. The one-shot node can be set to disk failover (DFO) mode in order to reduce the chance of message loss if the daemon node's configuration changes.