////////////////////
Licensed to Cloudera, Inc. under one
or more contributor license agreements.  See the NOTICE file
distributed with this work for additional information
regarding copyright ownership.  Cloudera, Inc. licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License.  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
////////////////////

== Flume Agents for Syslog data

+syslog+ is the standard unix single machine logging service.  Events
are generally emitted as lines with a time stamp, "facility" type,
priority, and message.  Syslog can be configured to send data to
remote destinations.  The default syslog remote delivery was
originally designed to provide best effort delivery service.  Today,
there are several more advanced syslog services that deliver messages
with improved reliability (TCP connections with memory buffering on
failure).  The reliability guarantees however are one hop and weaker
than Flume's more reliable delivery mechanism.

This section describes collecting syslog data using two methods.  The
first part describes a file tailing approach.  The latter parts
describe syslog system configuration guidance that enables directly
feeding Flume's +syslog*+ sources.

=== Tailing files

The quickest way to record syslog messages is to tail syslog generated
log files.  These files generally live in +/var/log+.

Some examples include:
----
/var/log/auth.log
/var/log/messages
/var/log/syslog
/var/log/user.log
----

These files could be tailed by Flume nodes with tail sources:

----
tail("/var/log/auth.log")
tail("/var/log/messages")
tail("/var/log/syslog")
tail("/var/log/user.log")
----

Depending on your system configuration, there may be permissions
issues when accessing these files from the Flume node process.

NOTE: Red Hat/CentOS systems default to writing log files owned by
root, in group root, and with 0600 (-rw-------) permissions. Flume
could be run as root, but this is not advised because Flume can be
remotely configured to execute arbitrary programs.

NOTE: Ubuntu systems default to writing logs files owned by syslog, in
group adm, and with 0640 (-rw-r-----) permissions.  By adding the user
"flume" to group "adm", a Flume node running user "flume" should be
able to read the syslog generated files.

NOTE: When tailing files, the time when the event is read is used as the
time stamp.

=== Delivering Syslog events via sockets

The original syslog listens to the +/dev/log+ named pipe, and can be
configured to listen on UDP port
514. (http://tools.ietf.org/search/rfc5424). More advanced versions
(rsyslog, syslog-ng) can send and recieve over TCP and may do
in-memory queuing/buffering. For example, syslog-ng and rsyslog can
optionally use the default UDP port 514 or use TCP port 514 for better
recovery options.

NOTE: By default only superusers can listen on on UDP/TCP ports 514.
Unix systems usually only allow ports <1024 to be bound by
superusers. While Flume can run as superuser, from a security stance
this is not advised.  The examples provide directions to route to the
user-bindable port 5140.

For debugging syslog configurations, you can just use 'flume dump'
with syslog sources.  This command outputs received syslog data to the
console.  To test if syslog data is coming in to the proper port you
can run this command from the command line:

----
$ flume dump 'syslogUdp(5140)' 
----

This will dump all incoming events to the console.

If you are satisfied with your connection, you can have a Flume node
run on the machine configure its sink for the reliability level you
desire.

Using a +syslog*+ Flume source will save the entire line of event
data, use the timestamp found in the original data, extract a +host+,
and attempt to extract a service from the syslog line.  All of these
map to a Flume event's fields except for +service+ so this is added as
extra metadata field to each event (this is a convention with
syslog defined in RFC).

So, a syslog entry whose body is this:

----
Sep 14 07:57:24 blitzwing dhclient: bound to 192.168.126.212 -- renewal in 710 seconds.
----

will have the Flume event body:

----
Sep 14 07:57:24 blitzwing dhclient: bound to 192.168.126.212 -- renewal in 710 seconds.
----

The event will also translated the "Sep 14 07:57:24" date+time data so
that it will be bucketable.  Since this date does not have a year, it
assumes the current year and since it has no timezone it assumes the
local timezone.  The host field should be "blitzwing", and the
optional "service" metadata field will contain "dhclient".

====  Configuring +syslogd+

The original syslog is +syslogd+.  It is configured by an
+/etc/syslog.conf+ file.  Its format is fairly simple.

Syslog recieves messages and then sends to out to different facilities
that have associated names
(http://tools.ietf.org/search/rfc5424#section-6.2).

The +/etc/syslog.conf+ file essentially contains lists of facilities
and "actions".  These "actions" are destinations such as regular
files, but can also be named pipes, consoles, or remote machines.  One
can specify a remote machine by prefixing an '@' symbol in front the
destination host machine.  If no port is specified, events are sent
via UDP port 514.

The example below specifies delivery to machine localhost on port
5140.

----
user.*     @localhost:5140 
----

A Flume node daemon running on this machine would have a +syslogUdp+
source listening for new log data.

----
host-syslog : syslogUdp(5140) | autoE2EChain ;
----

==== Configuring +rsyslog+ 

+rsyslog+ is a more advanced drop-in replacement for syslog and the
default syslog system used by Ubuntu systems.  It supports basic
filtering, best effort delivery, and queuing for handling one-hop
downstream failures.

+rsyslog+ actually extends the syslog configuration file format.
Similar to regular +syslogd+ you can send data to a remote machine on
listening on UDP port 514 (standard syslog port).

----
*.*   @remotehost
----

Moreover, +rsyslog+ also allows you to use the more reliable TCP
protocol to send data to a remote host listening on TCP port 514.  In
+rsyslog+ configurations, an '@@' prefix dictates the use of TCP.

----
*.*  @@remotehost
----

Similarly, you can also append a suffix port number to have it deliver
to a particular port.  In this example, events are delivered to
localhost TCP port 5140.

----
*.*  @@localhost:5140
----

Assuming you have a Flume node daemon running on the local host, you
can capture syslog data by adding a logical node with the following
configuration:

----
host-syslog : syslogTcp(5140) | autoE2EChain ; 
----

////

TODO: (this requires new FileReaderSource)

You can also log data to a named pipe that flume can listen on.

Named pipes
----
*.*  |/dev/flume
----

////

==== Configuring +syslog-ng+

Syslog-ng is another common replacement for the default syslog logging
system.  Syslog-ng has a different configuration file format but
essentially gives the operator the ability to send syslog data from
different facilities to different remote destinations.  TCP or UDP can
be used.

Here is an example of modifications to a +syslog-ng.conf+ (often found
in +/etc/syslog-ng/+) file.

----
## set up logging to loghost (which is flume) 
destination loghost { 
        tcp("localhost" port(5140)); 
}; 

# send everything to loghost, too 
log { 
        source(src); 
        destination(loghost); 
}; 
----

Assuming you have a Flume node daemon running on the local host, you
can capture syslog data by adding a logical node with the following
configuration:

----
host-syslog : syslogTcp(5140) | autoE2EChain ; 
----