RESCORING SURVEY: HOW TO TAKE PART ---------------------------------- The tools in this directory are used to optimise the scoring system used for incoming mails, using a genetic algorithm to search for optimal values. Since this works best with a very large dataset, it would be *great* if you (as a user) could run this and submit the results. The analysis script will not include text from the mails themselves, so it will not give away private details from your mail spool. The only details you'll give away will be your email address (and I promise *NEVER* to give that out or use it for spammy stuff) -- and how many mails you have sitting around in folders! CONDITIONS ---------- 1. First of all, you must be running it on a UNIX system; it's not portable to other OSes yet. Also currently it only reads UNIX mailbox format files, or MH spool directories. 2. This will not work unless you have separated the mail messages you'll be analysing into separate "spam" and "non-spam" piles. It doesn't matter how many mailboxes contain spam, or how many mailboxes contain non-spam; you just need to be sure you know which set is which! The latter is most important. If you have occasional spams scattered through your mailboxes, or occasional non-spam messages in your trapped spam folder, the analysis will be useless. See the CORPUS_POLICY file for more details. HOW TO SUBMIT RESULTS BACK TO US -------------------------------- See the file CORPUS_SUBMIT in this directory. HOW IT WORKS ------------ If you're interested, here's a quick description of the rest of the stuff in this directory and what they do: mass-check : This script is used to perform "mass checks" of a set of mailboxes, Cyrus folders, and/or MH mail spools. It generates summary lines like this: Y 7 /home/jm/Mail/Sapm/1382 SUBJ_ALL_CAPS,SUPERLONG_LINE,SUBJ_FULL_OF_8BITS or for mailboxes, . 1 /path/to/mbox:<5.1.0.14.2.20011004073932.05f4fd28@localhost> TRACKER_ID,BALANCE_FOR_LONG listing the path to the message or its message ID, its score, and the tests that triggered on that mail. Using this info, and a score optimization tool, I can figure out which tests get good hits with few false positives, etc., and re-score the tests to optimise the ratio. This script relies on the spamassassin distribution directory living in "..". logs-to-c : Takes the "spam.log" and "nonspam.log" files and converts them into C source files and simplified data files for use by the C score optimization algorithm. (Called by "make" when you build the perceptron, so generally you won't need to run it yourself.) hit-frequencies : Analyses the log files and computes how often each test hits, overall, for spam mails and for non-spam. mk-baseline-results : Compute results for the baseline scores (read from ../rules/*). If you provide the name of a config directory as the first argument, it'll use that instead. It will output statistics on the current ruleset to ../rules/STATISTICS.txt, suitable for a release build of SpamAssassin. perceptron.c : Perceptron learner by Henry Stern. See "README.perceptron" for details. -- EOF