////////////////////
Licensed to Cloudera, Inc. under one
or more contributor license agreements.  See the NOTICE file
distributed with this work for additional information
regarding copyright ownership.  Cloudera, Inc. licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License.  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
////////////////////

== Using Flume Agents for Apache 2.x Web Server Logging

To connect Flume to Apache 2.x servers, you will need to

# Configure web log file permissions 

# Tail the web logs or use piped logs to enable Flume to get data from
  the web server.

This section will step through basic setup on default Ubuntu Lucid and
default CentOS 5.5 installations.  Then it will describe various ways
of integrating Flume.

=== If you are using CentOS / Red Hat Apache servers

By default, CentOS's Apache writes weblogs to files owned by +root+
and in group +adm+ in 0644 (rw-r--r--) mode.  Flume is run as the
+flume+ user, so the Flume node is able to read the logs.

Apache on CentOS/Red Hat servers defaults to writing logs to two
files:

----
/var/log/httpd/access_log
/var/log/httpd/error_log
----

The simplest way to gather data from these files is to tail the files
by configuring Flume nodes to use Flume's +tail+ source:

----
tail("/var/log/httpd/access_log") 
tail("/var/log/httpd/error_log")
----

=== If you are using Ubuntu servers Apache servers

By default, Ubuntu writes weblogs to files owned by +root+ and in
group +adm+ in 0640 (rw-r-----) mode.  Flume is run as the +flume+
user and by default will *not* be able to tread the files.  One
approach to allow the +flume+ user to read the files is to add it to
the +adm+ group.

Apache servers on Ubuntu defaults to writing logs to three files:

----
/var/log/apache2/access.log
/var/log/apache2/error.log
/var/log/apache2/other_vhosts_access.log
----

The simplest way to gather data from these files is by configuring
Flume nodes to use Flume's +tail+ source:

----
tail("/var/log/apache2/access.log") 
tail("/var/log/apache2/error.log")
tail("/var/log/apache2/other_vhosts_access.log")
----


=== Getting log entries from Piped Log files

The Apache 2.x's documentation
(http://httpd.apache.org/docs/2.2/logs.html) describes using piped
logfile with the +CustomLog+ directive.  Their example uses
+rotatelogs+ to periodically write data to new files with a given
prefix.  Here are some example directives that could be in the
+httpd.conf+/+apache2.conf+ file.

----
LogFormat "%h %l %u %t \"%r\" %>s %b" common
CustomLog "|/usr/sbin/rotatelogs /var/log/apache2/foo_access_log 3600" common
----

TIP: In Ubuntu Lucid these directives are in
+/etc/apache2/sites-available/default+. In CentOS 5.5, these
directives are in +/etc/httpd/conf/httpd.conf+.

These directives configure Apache to write log files in
+/var/log/apache2/foo_access_log.xxxxx+ every hour (3600 seconds)
using the "common" log format.

You can use Flume's +tailDir+ source to read all files without
modifying the Apache settings:

----
tailDir("/var/log/apache2/", "foo_access_log.*")
----

The first argument is the directory, and then the second is a regex
that should match against the file name.  +tailDir+ will watch the dir
and tail all files that have matching file names.

=== Using piped logs

Instead of writing data to disk and then having Flume read it, you can
have Flume ingest data directly from Apache.  To do so, modify the web
server's parameters and use its piped log feature by adding some
directives to Apache's configuration:

----
CustomLog "|flume node_nowatch -1 -s -n apache -c \'apache:console|agentBESink(\"collector\");\'" common
----

----
CustomLog "|flume node_nowatch -1 -s -n apache -c \'apache:console|agentDFOSink(\"collector\");\'" common
----

WARNING: By default, CentOS does not have the Java required by the
Flume node in user +root+'s path.  You can use +alternatives+ to
create a managed symlink in +/usr/bin/+ for the java executable.

Using piped logs can be more efficient, but is riskier because Flume
can deliver messages without saving on disk.  Doing this, however,
increases the probability of event loss.  From a security point of
view, this Flume node instance runs as Apache's user which is often
+root+ according to the Apache manual.

NOTE: You could configure the one-shot mode node to deliver data
directly to a collector.  This can only be done at the best effort or
disk-failover level.

The prior examples use Flume nodes in one-shot mode which runs without
contacting a master.  Unfortunately, it means that one-shot mode
cannot directly use the automatic chains or the end-to-end (E2E)
reliability mode.  This is because the automatic chains are generated
by the master and because E2E mode currently delivers acknowledgements
through the master.  

However, you can have a one-shot Flume node deliver data to a Flume
local node daemon where the reliable E2E mode can be used.  In this
setup we would have the following Apache directive:

----
CustomLog "|flume node_nowatch -1 -s -n apache -c \'apache:console|agentBESink(\"localhost\", 12345);\'" common
----

Then you can have a Flume node setup to listen with the following
configuration:

----
node : rpcSource(12345) | agentE2ESink("collector");
----

Since this daemon node is connected to the master, it can use the
auto*Chains.

----
node : rpcSource(12345) | autoE2EChain;
----

NOTE: End-to-end mode attempts to ensure the deliver of the data that
enters the E2E sink.  In this one-shot-node to reliable-node scenario,
data is not safe it gets to the E2E sink.  However, since this is a
local connection, it should only fail when the machine or processes
fails.  The one-shot node can be set to disk failover (DFO) mode in
order to reduce the chance of message loss if the daemon node's
configuration changes.