TESTING WITH BAYES ------------------ Dan said: "I think we need guidelines on how to train and mass-check Bayes using our spam and non-spam corpuses. Maybe you could check something in? *nudge*". OK then! If you're testing Bayes, or collating results on a change to the algorithms, please try to stick to these guidelines: - train with at least 1000 spam and 1000 ham messages - try to use at least as many ham as spam mails. - use mail from your own mail feed, not public corpora if possible. Many of the important signs are taken from headers and are specific to you and your systems. - Try to train with older messages, and test with newer, if possible. - As with the conventional "mass-check" runs, avoiding spam over 6 months old is a good idea, as older spam uses old techniques that no longer are seen in the wild. - DO NOT test with any of the messages you trained with. This will produce over-inflated success rates. These are just guidelines (well, apart from the last one), so they can be bent slightly if needs be ;) A SAMPLE LOG OF A BAYES 10FCV RUN --------------------------------- First, I made the corpus to test with. mkdir ch ; cp ~/Mail/deld/10* ch mkdir cs ; cp ....spam... cs This is simply one-file-per-message, RFC-2822 format, as usual. Now, set the SADIR env var to where your SpamAssassin source tree can be found: export SADIR=/home/jm/ftp/spamassassin Then split the test corpus into folds: mkdir -p cor/ham cor/spam $SADIR/tools/split_corpora -n 10 -p cor/ham/bucket ch $SADIR/tools/split_corpora -n 10 -p cor/spam/bucket cs That takes from "ch" and "cs" and generates mboxes containing 10% folds as "cor/ham/bucket{1,2,3,4,5,6,7,8,9,10}". I then created a set of items I wanted to test: mkdir testdir mkdir testdir/{base,bug3118} [...etc.] cp ~/ftp/spamassassin/lib/Mail/SpamAssassin/Bayes.pm testdir/base/Bayes.pm cp ~/ftp/spamassassin/lib/Mail/SpamAssassin/Bayes.pm testdir/bug3118/Bayes.pm In other words, created a directory for each test and copied Bayes.pm into each one. I then edited the "Bayes.pm" files in the testdirs to enable whatever tweaks I wanted to test. "base" remains the same as current SVN, however, so it acts as a baseline. Finally I run the driver script: sh -x $SADIR/masses/bayes-testing/run-multiple testdir/* That takes a long time, running through the dirs doing a 10-fold CV for each one. The results are written to each test-dir in a new directory "results", and looks like this: : jm 1204...; ls -l base/results/ total 7028 drwxrwxr-x 2 jm jm 4096 Mar 12 02:41 bucket1 drwxrwxr-x 2 jm jm 4096 Mar 12 03:21 bucket10 drwxrwxr-x 2 jm jm 4096 Mar 12 02:46 bucket2 drwxrwxr-x 2 jm jm 4096 Mar 12 02:50 bucket3 drwxrwxr-x 2 jm jm 4096 Mar 12 02:54 bucket4 drwxrwxr-x 2 jm jm 4096 Mar 12 02:59 bucket5 drwxrwxr-x 2 jm jm 4096 Mar 12 03:03 bucket6 drwxrwxr-x 2 jm jm 4096 Mar 12 03:08 bucket7 drwxrwxr-x 2 jm jm 4096 Mar 12 03:12 bucket8 drwxrwxr-x 2 jm jm 4096 Mar 12 03:17 bucket9 drwxrwxr-x 4 jm jm 4096 Mar 12 03:17 config -rw-rw-r-- 1 jm jm 1401 Mar 12 03:21 hist_all -rw-rw-r-- 1 jm jm 4424927 Mar 12 03:21 nonspam_all.log -rw-rw-r-- 1 jm jm 2596942 Mar 12 03:21 spam_all.log -rw-rw-r-- 1 jm jm 86338 Mar 12 03:21 test.log -rw-rw-r-- 1 jm jm 1322 Mar 12 12:03 thresholds.static -rw-rw-r-- 1 jm jm 3192 Mar 12 03:21 thresholds_all The important items are: - thresholds.static: FP/FN/Unsure counts of the Bayes score distribution across all messages. See "THRESHOLDS SCRIPT" below. - hist_all: An ASCII-art histogram of the Bayes score distribution across all messages. Good to view differences at a glance; however nowadays our tweaks all have much less effect than the "big ones" like hapax use or case-sensitivity did, so not so useful anymore. See "THE HISTOGRAM" below. - thresholds_all: a version of the thresholds output that is optimized for lowest "cost" figure, basically searched the entire score distribution for optimal thresholds. Nowadays we have chosen some static thresholds and they work OK, so this isn't much use any more. - The "bucket*" dirs, and "nonspam_all.log" or "spam_all.log" can be discounted unless you need to look into more details of why a run didn't work the way you expected it would... they are there for debugging, basically. "thresholds.static" is by far the most important, containing the FP/FN figures for various points on the score distribution. That's what needs to be used to compare different Bayes tweaks. THRESHOLDS SCRIPT ----------------- The "thresholds" script is an emulation of the spambayes testing methodology: it computes ham/spam hits across a corpus for each algorithm, then, by dividing those hits into FPs, FNs, and "unsure"s, and attaching a "cost" to each of those, it computes optimum spam and ham cutoff points. (It also outputs TCRs.) Sample output: Threshold optimization for hamcutoff=0.30, spamcutoff=0.70: cost=$804.50 Total ham:spam: 39987:23337 FP: 3 0.008% FN: 360 1.543% Unsure: 4145 6.546% (ham: 193 0.483% spam: 3952 16.934%) TCRs: l=1 5.408 l=5 5.393 l=9 5.378 BTW, the idea of cutoffs is a spambayes one; the range 0.0 .......... ham_cutoff ........ spam_cutoff ......... 1.0 maps to MAIL IS HAM UNSURE MAIL IS SPAM SpamAssassin can be more sophisticated in terms of turning the bayes value into scores across a range of [ -4.0, 4.0 ]. However the insight the "unsure" value provides is good to visualise the shape of the graph anyway, even if we don't use the same scoring system. But the important thing for our tests is that the threshold results, together with the histograms, give a good picture of how the algorithm scatters the results across the table. Ideally, we want - all ham clustered around 0.0 - all spam clustered around 1.0 - as little ham and spam as possible in the "unsure" middle-ground So the best algorithms are the ones that are closest to this ideal; in terms of the results below that means this is the pecking order for good results, strong indicators first... - a low cost figure - low FPs - low FNs - low unsures - a large difference between thresholds We can then tweak the threshold-to-SpamAssassin-score mapping so that we maximise the output of the bayes rules in SpamAssassin score terms, by matching our score ranges to the ham_cutoff and spam_cutoff points. THE HISTOGRAM ------------- A histogram from 'draw-bayes-histogram' looks like this: SCORE NUMHIT DETAIL OVERALL HISTOGRAM (. = ham, # = spam) 0.000 (99.047%) ..........|....................................................... 0.000 ( 0.977%) ##########|# 0.040 ( 0.145%) .. | 0.040 ( 0.141%) ## | 0.080 ( 0.113%) . | 0.080 ( 0.056%) # | 0.120 ( 0.065%) . | 0.120 ( 0.069%) # | 0.160 ( 0.060%) . | 0.160 ( 0.086%) # | 0.200 ( 0.040%) | 0.200 ( 0.111%) ## | 0.240 ( 0.043%) | 0.240 ( 0.103%) ## | 0.280 ( 0.030%) | 0.280 ( 0.090%) # | 0.320 ( 0.050%) . | 0.320 ( 0.167%) ### | 0.360 ( 0.055%) . | 0.360 ( 0.184%) ### | 0.400 ( 0.048%) . | 0.400 ( 0.184%) ### | 0.440 ( 0.085%) . | 0.440 ( 0.548%) ######## | 0.480 ( 0.195%) .. | 0.480 ( 9.860%) ##########|####### 0.520 ( 0.010%) | 0.520 ( 2.031%) ##########|## 0.560 ( 0.005%) | 0.560 ( 1.268%) ##########|# 0.600 ( 0.003%) | 0.600 ( 1.157%) ##########|# 0.640 ( 0.990%) ##########|# 0.680 ( 0.005%) | 0.680 ( 1.011%) ##########|# 0.720 ( 0.947%) ##########|# 0.760 ( 1.033%) ##########|# 0.800 ( 1.123%) ##########|# 0.840 ( 1.307%) ##########|# 0.880 ( 1.607%) ##########|# 0.920 ( 2.554%) ##########|## 0.960 ( 0.003%) | 0.960 (72.396%) ##########|####################################################### The format is: GROUP (PCT%) ZOOM | FULL the "GROUP" is the part of the [ 0.0, 1.0 ] range that the mails are falling into. "PCT%" is the percentage of the corpus that fell into that range. "FULL" is the scaled histogram of number of messages, so you can see at a glance what the proportions look like; and "ZOOM" is a "zoomed-in" view at the very bottom of the histogram, zoomed in by a factor of 10, for closer inspection.