# mgp pres.mgp # mgp -D out -O -o -g 800x600 pres.mgp # %include "default.mgp" %default 1 bgrad 0 0 256 20 0 "black" "darkgreen" %page #%bquality 10 %size 7, font "standard", fore "white", vgap 20 %center Filtering Spam With %image "ninjalogo.png" 256 150 150 1 %size 3 Justin Mason http://spamassassin.org/ %page Why Bother Filtering Spam? Seems to be an average of 30% to 50% of \ incoming mail traffic, and increasing; up 400% in last year # sources: HiWaay.net, HotMail, Brightmail Users are forced to waste time wading through their inbox, which \ costs their employers money # 100 people reading spam for 5 minutes/day = spam costs 80 quid / day Impossible to unsubscribe # FTC says 63% of unsubscribe links do not work, or just confirm yr addr Legal retaliation not possible - yet; too much lobbying going on # usual internet story -- no borders. Also pressure from DMA Just plain irritating! %page Spam Volume Is Increasing %center %image "brightmail.png" (data from Brightmail.com) %page The Downsides Of Filtering %center %image "eatmycomix.png" (with thanks to eatmycomix.com) %page History Of Filtering procmail - traditional UNIX mail filter Identify spam senders and delete on sight; \ later, match patterns in "disguised" spam messages, \ especially in the headers Very hard to configure, for most people DNS Blacklists: Paul Vixie's MAPS RBL Identify spam sources by IP, and route them down the plughole Third parties set up new DNS blacklists later Not quite so reliable anymore; many false positives # since IPs are frequently recycled, or ISP-wide mailservers get listed; # happened to eircom.net's mailserver. Most DNSBLs don't even track # their FP rate %page Origin of SpamAssassin Developed initially for personal use # Spam "itched" me, to use the open-source phrase Initially added rules to an existing filter script, filter.plx License unclear, so rewritten from scratch under open \ source license # Perl's Artistic License %page SpamAssassin Concepts Zero-configuration where possible Lots of rules to determine if a mail is spam or not (about 600) No 1 rule alone can mark a mail as spam "Fuzzy logic": rules are assigned scores, and \ scores are combined to produce an overall score for each message # high-confidence rules get high scores and vice versa If over a user-defined threshold, the mail is judged as spam %page SpamAssassin Concepts, pt.2 Combines many systems for a "broad-spectrum" approach: Detect forged headers Spam-tool signatures (Message-ID patterns etc.) Text keyword scanner in the message body DNS blacklists Razor, DCC (Distributed Checksum Clearinghouse), Pyzor # Means spammers cannot simply aim to defeat one system alone, \ # as the others will still catch them out %page Digression: Razor, DCC and Pyzor All operate by posting a checksum of the message to a central \ server If checksum matches a reported spam mail, it's spam Razor: lots of good press, semi-commercial # good press because Marc Andressen (ex-Netscape) works for CloudMark # Razor does not allow an organisation to run their own servers Pyzor: reimplementation of Razor in Python, as free software DCC: older than Razor, but has some philosophical issues # not entirely oriented towards spam, just measures a mail's # "bulkiness", ie. how many people it was sent to %page Large-Scale Use of SpamAssassin Modifications to filter.plx were UNIX-only, single-user. \ Not very useful for a lot of people # run from .forward, deliver to /var/spool/mail. SpamAssassin took this into account: Modular, clean, flexible design Open Source / free software license (Artistic) Released to CPAN (Comprehensive Perl Archive Network) %page Spamd "spamd" contributed by Craig Hughes: Client-server interface to SpamAssassin, over TCP Pure-C client; much faster than forking perl interpreter Simple TCP interface # Easy for anyone now to plug into SpamAssassin, without even \ # using perl: just open a socket, write a mail message to it, \ # and read either the results, or a rewritten mail, back! # Recent versions make libspamc.so, shared-library version. \ # Just call a C function and receive spam report back %page Integration of SpamAssassin Now about 20 different integrations that I know of Integration into MTAs: sendmail "milters", qmail-scanner, \ Exim, Postfix Integration into virus-scanner MTA plugins: MIMEDefang, \ amavisd-new IMAP/POP proxies and clients Plug-ins for Windows clients: MS Outlook, Eudora Also unofficial hacks adding SpamAssassin filtering into \ mailing list software etc. Plus the usual mail filters for "normal" UNIX users: procmail, \ mailfilter, Mail::Audit plugin etc. %page Accuracy: Evolve A Better Filter SpamAssassin's scores are assigned using a Genetic Algorithm: Given a big corpus of mail, divided into known piles of spam \ and nonspam Determine what tests each mail triggers and \ then "evolve" an efficient score set Exactly the kind of problem a genetic algorithm is good at %page False Positives (Non-spam Marked As Spam) No classifier is perfect; even humans get it wrong # MessageLabs where Matt works, have humans who make that decision, # and Matt tells me it's quite hard sometimes. Also BrightMail have # a whole NOC of people to do this False Positives (non-spam marked as spam) are much worse \ than spam getting through; much more inconvenient to user SpamAssassin is 98% accurate on our test corpora: \ 0.4% false positives, 87% of all spam caught correctly # with the default settings. threshold can be moved up or down to # make it more or less aggressive Highest rate available among present tools # improving constantly, very hard range of mails used to train Reduce FPs by increasing the threshold, ditto vice-versa # test corpus has only 1 FP out of 200000 mails at threshold 15, still # catches 50% of spam %page False Positives At Different Thresholds %center %image "fp-graph.png" # total = 202804, 5.0 = 168973 / 695 / 4307 # 15.0 = 169667 / 1 / 24005 %page What To Do When You've Caught It Since classifiers are imperfect, blind deletion is bad Better to mark, and allow user to check over them infrequently # However, higher threshold can work better; some systems will bounce # back high-scoring mails to the sender. threshold of 12 caught # only 0.00005% false positives on our test corpus, and still blocked # 50% of spam Also good to mark for legal reasons # in UK, illegal to withhold mail or stall it for more than 3 days %page Features For Large-Scale Use Can load user preferences from an SQL database Spamd will compile entire rule-set into RAM, like mod_perl \ (contributed by Matt Sergeant) Since the spamc/spamd protocol is TCP, load balancing is no problem # Many large sites run a "farm" of spamd machines and use \ # round-robin DNS A records to share the load. Deployed at several large organisations and ISPs: \ The Well, Salon.com, Panix, Transmeta, SourceForge # sonic.net, kde.org, plenty more %page Large-Scale Filtering For Your Network Different from filtering for yourself: Many users get little spam Should use conservative settings Better to use "opt-out by default"; notify that spam \ filtering is available, and ask them if they want it %page How Can ISPs, Net Admins Fight Spam? Scan for Open Relays & Proxies on your network: they are bad # run periodic scans to test for open mail relays # biggest problem at the moment; used by between 20 and 40% of spammers # now to disguise origins of mail and do relaying; can relay into your # own outgoing mailserver and get you blacklisted Block proxy ports at the firewall # Even if someone does leave an open proxy, this makes it a lot harder # for a spammer to find it Audit webservers for FormMail or other web-to-mail scripts # FormMail extremely trivial to exploit. Wrote an advisory about it # earlier this year. NMS FormMail not vulnerable Use "conservative" DNSBLs on your MXes # block well-known spammers at the edge of your network. Use DNSBLs # with low FPs: Spamhaus Block List, opm.blitzed.org Spamtraps reporting to Razor, DCC, Pyzor # should only use addresses that have not been valid for about 6 months, # and take care to log them for a month and unsub any newsletters # still coming in. Run SpamAssassin ;) %page How Do The Spammers Feel? Already hurting: CBS quotes a spammer as saying he's gone through \ "unbelievable hardships" to keep spamming ... "My \ operating costs have gone up 1,000 percent this year, \ just so I can figure out how to get around all these filters" Spam relies on low overheads and extremely cheap costs of delivery Increase the overheads and the spammers will give up! # (Hopefully ;) %page Future Directions for SpamAssassin: \ Bayesian Probability Naive Bayesian classification -- as seen on slashdot ;) Given a corpus of mail messages, and knowledge that each \ mail is spam or non-spam: break the message down into \ "words", and track how frequently each word appears in \ spam vs. non-spam. From this you can determine the probability that a mail is \ spam, based on the words used within it. # Normally trained, by feeding a saved corpus of spam and \ # non-spam into the filter beforehand # # SpamAssassin variant will train itself, based on incoming \ # mail and SpamAssassin's scores... hopefully %page Future Directions: Hash-Cash Hash-cash: impose a penalty for sending a mail Mail currently more-or-less free for sender With hash-cash, each recipient requires CPU time for the \ sender Classic chicken-and-egg problem But SpamAssassin can provide "bonus points" # ditto for PGP, Habeas etc. %page Fin http://spamassassin.org/ SpamAssassin for UNIX \ (free software) http://www.deersoft.com/ SpamAssassin for Windows: MS Outlook, Exchange \ (commercial version) %% vim:sw=8:tw=74:noexpandtab: