Fast SpamAssassin Score Learning Tool

Henry Stern
Faculty of Computer Science
Dalhousie University
6050 University Avenue
Halifax, NS  Canada
B3H 1W5
henry@stern.ca

January 8, 2004

1.  WHAT IS IT?

This program is used to compute scores for SpamAssassin rules.  It makes
use of data files generated by the suite of scripts in
spamassassin/masses.  The program outputs the generated scores in a file 
titled 'perceptron.scores'.

The advantage of this program over that of the genetic algorithm (GA)
implementation in spamassassin/masses/craig_evolve.c is that while the GA 
requires several hours to run on high-end machines, the perceptron 
requires only about 15 seconds of CPU time on an Athlon XP 1700+ system.

This makes incremental updates and score personalization practical for the
end-user and gives developers a better idea just how useful a new rule is.

2.  OPTIONS

There are four options that can be passed to the perceptron program.

  -p ham_preference

	This increases the number of non-spam messages in the training 
	set.  It does this by adding 1 + (number of tests hit) * 
	ham_preference instances of non-spam messages to the training set.
	This is intended to reduce false positives by encouraging the
	training program to look at the harder-to-clasify ham messages 
	more often.  By default, this parameter is 2.0.

  -e num_epochs

	This parameter sets how many passes the perceptron will make 
	through the training set before terminating.  On each pass, the 
	training set is shuffled and then iterated through.  By default, 
	it will make 15 passes.

  -l learning_rate

	This parameter modifies the learning rate of the perceptron.  The 
	error gradient is computed for each instance.  This program uses a 
	logsig activation function y = (1/(1+exp(-x))), so the error 
	gradient is computed as E(x) = y*(1-y)*(is_spam - y).  For 
	each instance and score hit in the training set, the scores are 
	modified by adding E(x) / (number of tests hit + 1) * 
	learning_rate.  The default value for this is 2.0, but it can be 
	whatever you want.

  -w weight_decay

	To prevent the scores from getting too high (or even forcing them 
	to if you want), before each epoch, the scores and network bias
	are multiplied by the weight decay.  This is off by default (1.0), 
	but it can be useful.

3.  HOW DOES IT WORK?

This program implements the "Stochastic Gradient Descent" method of 
training a neural network.  It uses a single perceptron with a logsig 
activation function and maps the weights to SpamAssassin score space.

The perceptron is the simplest form of neural network.  It consists of a 
transfer function and an activation function.  Together, they simulate the 
average firing rate of a biological neuron over time.

The transfer function is the sum of the product of weights and inputs.  
It simulates the membrane potential of a neuron.  When the accumulated
charge on the membrane exceeds a certain threshold, the neuron fires,
sending an electrical impulse down the axon. This implementation uses a
linear transfer function:

[1]	f(x) = bias + sum_{i=1}^{n} w_i * x_i

This is quite similar to how the SpamAssasin score system works.  If you 
set the bias to be 0, w_i to be the score for rule i and x_i to be whether 
or not rule i is activated by a given message, the transfer function will 
return the score of the message.  

The activation function simulates the electrical spike travelling down the 
axon.  It takes the output of the transfer function as input and applies 
some sort of transformation to it.  This implementation uses a logsig 
activation function:

[2]	y(x) = 1 / (1 + exp(-f(x)))

This non-linear function constrains the output of the transfer function 
between 0 and 1.  When plotted, it looks somewhat S-shaped with vertical 
asymptotes at 0 as x approaches -infinity and 1 as x approaches infinity.  
The slope of the function is greatest at x=0 and tapers off as it 
approaches the asymptotes.

Lastly, the performance of the perceptron needs to be measured using an 
error function.  Two error functions are commonly used:  mean squared 
error and entropic error.  By default, this implementation uses mean 
squared error but entropic error may be substituted by adding a compiler 
directive.

The most common method of training neural networks is called gradient
descent.  It involves iteratively tuning the parameters of the network so
that the mean error rate always decreases.  This is done by finding the
direction of steepest descent down the "error gradient," reducing the
value of the error function for the next iteration of the algorithm.

If the transfer function and activation function are both differentiable,
the error gradient of a neural network can be calculated with respect to
the weights and bias.  Without getting into calculus, the error gradient 
for a perceptron with a linear transfer function, logsig activation 
function and mean squared error function is:

[3]	E(x) = y(x) * (1-y(x)) * (y_expected - y(x))

The weights are updated using the function:

[4]	w_i = w_i + E(x) * x_i * learning_rate

Since the SpamAssassin rule hits are sparse, the basic gradient descent 
algorithm is impractical.  This implementation uses a variation called 
"Stochastic gradient descent."  Instead of doing one batch update per 
epoch, the training set is randomly walked through, doing incremental 
updates.  In addition, in this implementation, the learning rate is 
modified by the number of rule hits for a given training instance.  
Together, these allow good weights to be computed for 
infrequently-occurring rules.

Once the perceptron has finished running, the weights are converted to 
scores and exported using the familiar file format.  Weights are converted 
to scores using this function:

[5]	score(weight) = -threshold * weight / bias

4.  ACKNOWLEDGEMENTS

I would like to thank my PhD supervisor, Michael Shepherd, for not getting 
mad at me while I worked on this.  I'd also like to thank Thomas 
Trappenberg for his invaluable assistance while I was tweaking the 
performance of the learning algorithm.  I would like to thank Daniel 
Quinlan, Justin Mason and Theo Van Dinter for their valuable input and 
constructive criticism.

--
hs
8/1/2004