Fast SpamAssassin Score Learning Tool Henry Stern Faculty of Computer Science Dalhousie University 6050 University Avenue Halifax, NS Canada B3H 1W5 henry@stern.ca January 8, 2004 1. WHAT IS IT? This program is used to compute scores for SpamAssassin rules. It makes use of data files generated by the suite of scripts in spamassassin/masses. The program outputs the generated scores in a file titled 'perceptron.scores'. The advantage of this program over that of the genetic algorithm (GA) implementation in spamassassin/masses/craig_evolve.c is that while the GA requires several hours to run on high-end machines, the perceptron requires only about 15 seconds of CPU time on an Athlon XP 1700+ system. This makes incremental updates and score personalization practical for the end-user and gives developers a better idea just how useful a new rule is. 2. OPTIONS There are four options that can be passed to the perceptron program. -p ham_preference This increases the number of non-spam messages in the training set. It does this by adding 1 + (number of tests hit) * ham_preference instances of non-spam messages to the training set. This is intended to reduce false positives by encouraging the training program to look at the harder-to-clasify ham messages more often. By default, this parameter is 2.0. -e num_epochs This parameter sets how many passes the perceptron will make through the training set before terminating. On each pass, the training set is shuffled and then iterated through. By default, it will make 15 passes. -l learning_rate This parameter modifies the learning rate of the perceptron. The error gradient is computed for each instance. This program uses a logsig activation function y = (1/(1+exp(-x))), so the error gradient is computed as E(x) = y*(1-y)*(is_spam - y). For each instance and score hit in the training set, the scores are modified by adding E(x) / (number of tests hit + 1) * learning_rate. The default value for this is 2.0, but it can be whatever you want. -w weight_decay To prevent the scores from getting too high (or even forcing them to if you want), before each epoch, the scores and network bias are multiplied by the weight decay. This is off by default (1.0), but it can be useful. 3. HOW DOES IT WORK? This program implements the "Stochastic Gradient Descent" method of training a neural network. It uses a single perceptron with a logsig activation function and maps the weights to SpamAssassin score space. The perceptron is the simplest form of neural network. It consists of a transfer function and an activation function. Together, they simulate the average firing rate of a biological neuron over time. The transfer function is the sum of the product of weights and inputs. It simulates the membrane potential of a neuron. When the accumulated charge on the membrane exceeds a certain threshold, the neuron fires, sending an electrical impulse down the axon. This implementation uses a linear transfer function: [1] f(x) = bias + sum_{i=1}^{n} w_i * x_i This is quite similar to how the SpamAssasin score system works. If you set the bias to be 0, w_i to be the score for rule i and x_i to be whether or not rule i is activated by a given message, the transfer function will return the score of the message. The activation function simulates the electrical spike travelling down the axon. It takes the output of the transfer function as input and applies some sort of transformation to it. This implementation uses a logsig activation function: [2] y(x) = 1 / (1 + exp(-f(x))) This non-linear function constrains the output of the transfer function between 0 and 1. When plotted, it looks somewhat S-shaped with vertical asymptotes at 0 as x approaches -infinity and 1 as x approaches infinity. The slope of the function is greatest at x=0 and tapers off as it approaches the asymptotes. Lastly, the performance of the perceptron needs to be measured using an error function. Two error functions are commonly used: mean squared error and entropic error. By default, this implementation uses mean squared error but entropic error may be substituted by adding a compiler directive. The most common method of training neural networks is called gradient descent. It involves iteratively tuning the parameters of the network so that the mean error rate always decreases. This is done by finding the direction of steepest descent down the "error gradient," reducing the value of the error function for the next iteration of the algorithm. If the transfer function and activation function are both differentiable, the error gradient of a neural network can be calculated with respect to the weights and bias. Without getting into calculus, the error gradient for a perceptron with a linear transfer function, logsig activation function and mean squared error function is: [3] E(x) = y(x) * (1-y(x)) * (y_expected - y(x)) The weights are updated using the function: [4] w_i = w_i + E(x) * x_i * learning_rate Since the SpamAssassin rule hits are sparse, the basic gradient descent algorithm is impractical. This implementation uses a variation called "Stochastic gradient descent." Instead of doing one batch update per epoch, the training set is randomly walked through, doing incremental updates. In addition, in this implementation, the learning rate is modified by the number of rule hits for a given training instance. Together, these allow good weights to be computed for infrequently-occurring rules. Once the perceptron has finished running, the weights are converted to scores and exported using the familiar file format. Weights are converted to scores using this function: [5] score(weight) = -threshold * weight / bias 4. ACKNOWLEDGEMENTS I would like to thank my PhD supervisor, Michael Shepherd, for not getting mad at me while I worked on this. I'd also like to thank Thomas Trappenberg for his invaluable assistance while I was tweaking the performance of the learning algorithm. I would like to thank Daniel Quinlan, Justin Mason and Theo Van Dinter for their valuable input and constructive criticism. -- hs 8/1/2004