After each log entry has been parsed into its constituent parts it is next passed on to the data filtering stage of the importer pipeline. This is the stage at which decisions are made as to whether a log entry should be stored into the database.
If no filters have been specified then the importer process will import all entries into the database. Otherwise, if there are 1 or more filters, then no entries are accepted by default, at least one filter must declare an interest in a log entry for it to be stored. Each log entry is passed through the entire stack of filters in the order in which they were specified, filtering does not stop after any one plugin has declared an interest.
As well as declaring an interest in a particular log entry filters are permitted to attach tags and associate other information with an entry, which will be stored in the database for later retrieval (e.g. when generating reports). The result of this is that subsequent filters can make decisions and do further processing based on the results of filters earlier in the pipeline. It is worth noting however that an entry will be accepted if ANY filter declares an interest, it is not possible for filters later in the stack to overturn the results of those previous in the stack. It is possible to make decisions in one filter based on the results of previous filters. There are no limits to what processing you may do on each log entry but clearly when the filter has to be run on every entry and there might be millions to process it is worth doing only the minimum required if you want the entire process to complete in a reasonable amount of time.
When an entry is successfully parsed the separate parts are placed into a simple Perl hash - for speed reasons this is not done in an Object-Oriented style. The following elements will be available for querying during the filtering stage.
Note that syslog entries come in a huge variety of styles so some
fields such as program and pid are not always
specified. Even when they are present it is not always that easy to
extract the information without making the regular expression wildly
complicated. For details on how the strings are parsed see the
documentation for BuzzSaw::Parser
and BuzzSaw::Parser::RFC3339
.
A BuzzSaw filter is implemented as a Perl class using the Moose
Object-Oriented framework. It must implement
the BuzzSaw::Filter
role and provide
a check()
method. For example:
For every parsed log entry the check()
method will be
called with the following arguments:
The method must return one of the following values:
The first two options are fairly straightforward. The third may
seem a little peculiar but it becomes useful when you need to write a
filter which is designed to make decisions based on the results of
other filter modules which are placed earlier in the stack. For
example, the BuzzSaw::Filter::UserClassifier
module will
classify the value of the userid field if it has been added
by a previous filter (e.g. SSH or Cosign) and extra information will
be associated with the event.
Optionally, a filter may also return a list of tags (simple strings) which should be associated with this log entry when it is stored.
Here is a particularly trivial first example which shows a filter
which will return true if the event has a value for
the program field and it matches the kernel
string.
package BuzzSaw::Filter::Kernel; use Moose; with 'BuzzSaw::Filter'; sub check { my ( $self, $event, $votes, $results ) = @_; return ( exists $event->{program} && $event->{program} eq 'kernel' ); }
Returning a list of tags is useful to aid later searching and
reporting. It is not obligatory but clearly it is going to be simpler
to write an SQL query which states "show me all events with
the 'authfail' tag" than it is to parse the various strings
(again) to search for SSH login events which contain particular error
messages. If nothing else this stores the results of the filter
process which avoids duplication of code and effort in two different
languages. The set of collected tags from all filters in the stack
which express an interest in the entry are uniqueified and stored in
the tags
table in the database.
Here is a slightly more involved version of the previous example
which shows how to add a simple tag (named segfault
) when
the event message contains the word segfault
.
package BuzzSaw::Filter::Kernel; use Moose; with 'BuzzSaw::Filter'; sub check { my ( $self, $event, $votes, $results ) = @_; my $accept = 0; my @tags; if ( exists $event->{program} && $event->{program} eq 'kernel' ) { push @tags, 'kernel'; $accept = 1; if ( $event->{message} =~ m/segfault/o ) { push @tags, 'segfault'; } } return ( $accept, @tags ); }
As mentioned previously, it is also possible to attach extra
information to a log entry which is going to be stored. This is done
via the extra_info
hash element, it is a reference to a
simple Perl hash of keys and string values. For example, the SSH
filter uses this approach to store the source address for each SSH
login event log entry. These keys and values will be stored in
the extra_info
table in the database. Extra information
can be specified like this:
$event->{extra_info}{source_address} = '10.0.0.0'; $event->{extra_info}{auth_method} = 'password';
Note that for data protection reasons the stored log messages are anonymised after a certain period of time. The tag data is assumed to be safe and is all kept. Currently all other extra information is considered to be risky and it is thus deleted when an event is anonymised. So, don't rely on the extra information being available for really long-term statistical analysis.
It is very tempting, for the sake of speed and simplicity, to write a filter which just declares an interest in every event with the correct program string. In a few cases this might be the right thing to do but more often it is better to do further filtering based on the message to see if it really is of genuine interest. The design of BuzzSaw is to only store events of real interest, filling the database with data for events you will never subsequently examine adds in a lot more noise to the stored data, makes processing and reporting take longer and is generally rather pointless. For example, a typical syslog can contain hundreds of varied entries related to the kernel most of which are of little consequence. We are likely to only be interested in serious issues such as panics, oops, out-of-memory conditions. It is also worth noting that, in general, any program can insert a syslog entry containing any information it likes so you should never completely trust the data.
If BuzzSaw is being used to process logs daily on a central server
then these filter methods could potentially be called hundreds of
thousands of times. Consequently, speed is of the essence, it is worth
spending a little time considering if you can achieve your goals with
simple string equality checks (e.g. is the program string equal to
"kernel") rather than regular expressions. Where regular
expressions are required then it is best to use the /o
regular expression modifier to ensure it is only compiled once. It is
also well worth declaring the regular expressions globally using
the qr
function. The SSH and Kernel filters which are
shipped as part of the BuzzSaw package are good guides to best
practice.