TESTING WITH BAYES
------------------

Dan said: "I think we need guidelines on how to train and mass-check Bayes
using our spam and non-spam corpuses.  Maybe you could check something in?
*nudge*".  OK then!

If you're testing Bayes, or collating results on a change to the algorithms,
please try to stick to these guidelines:

  - train with at least 1000 spam and 1000 ham messages

  - try to use at least as many ham as spam mails.

  - use mail from your own mail feed, not public corpora if possible.  Many of
    the important signs are taken from headers and are specific to you and your
    systems.

  - Try to train with older messages, and test with newer, if possible.

  - As with the conventional "mass-check" runs, avoiding spam over 6 months old
    is a good idea, as older spam uses old techniques that no longer are seen
    in the wild.

  - DO NOT test with any of the messages you trained with.  This will produce
    over-inflated success rates.

These are just guidelines (well, apart from the last one), so they can be
bent slightly if needs be ;)


ABOUT THE THRESHOLDS SCRIPT
---------------------------

The "thresholds" script is an emulation of the spambayes testing
methodology:  it computes ham/spam hits across a corpus for each
algorithm, then, by dividing those hits into FPs, FNs, and "unsure"s, and
attaching a "cost" to each of those, it computes optimum spam and ham
cutoff points.  (It also outputs TCRs.)

BTW, the idea of cutoffs is a spambayes one; the range

  0.0 .......... ham_cutoff ........ spam_cutoff ......... 1.0

maps to

     MAIL IS HAM           UNSURE            MAIL IS SPAM

SpamAssassin can be more sophisticated in terms of turning the bayes value
into scores across a range of [ -4.0, 4.0 ].  However the insight the
"unsure" value provides is good to visualise the shape of the graph
anyway, even if we don't use the same scoring system.

But the important thing for our tests is that the threshold results,
together with the histograms, give a good picture of how the algorithm
scatters the results across the table.  Ideally, we want

  - all ham clustered around 0.0
  - all spam clustered around 1.0
  - as little ham and spam as possible in the "unsure" middle-ground

So the best algorithms are the ones that are closest to this ideal;
in terms of the results below that means this is the pecking order
for good results, strong indicators first...

  - a low cost figure
  - low FPs
  - low FNs
  - low unsures
  - a large difference between thresholds

We can then tweak the threshold-to-SpamAssassin-score mapping so that we
maximise the output of the bayes rules in SpamAssassin score terms, by
matching our score ranges to the ham_cutoff and spam_cutoff points.