# mgp pres.mgp
# mgp -D out -O -o -g 800x600 pres.mgp
#
%include "default.mgp"
%default 1 bgrad 0 0 256 20 0 "black" "darkgreen"
%page
#%bquality 10
%size 7, font "standard", fore "white", vgap 20

%center


Filtering Spam With
%image "ninjalogo.png" 256 150 150 1


%size 3
Justin Mason
<jm@jmason.org>
http://spamassassin.org/

%page

Why Bother Filtering Spam?


	Seems to be an average of 30% to 50% of \
	incoming mail traffic, and increasing; up 400% in last year
# sources: HiWaay.net, HotMail, Brightmail

	Users are forced to waste time wading through their inbox, which \
	costs their employers money
# 100 people reading spam for 5 minutes/day = spam costs 80 quid / day

	Impossible to unsubscribe
# FTC says 63% of unsubscribe links do not work, or just confirm yr addr

	Legal retaliation not possible - yet; too much lobbying going on
# usual internet story -- no borders.    Also pressure from DMA

	Just plain irritating!


%page

Spam Volume Is Increasing
%center
%image "brightmail.png"
(data from Brightmail.com)
%page

The Downsides Of Filtering
%center
%image "eatmycomix.png"
(with thanks to eatmycomix.com)
%page

History Of Filtering


	procmail - traditional UNIX mail filter

		Identify spam senders and delete on sight; \
		later, match patterns in "disguised" spam messages, \
		especially in the headers

		Very hard to configure, for most people

	DNS Blacklists: Paul Vixie's MAPS RBL

		Identify spam sources by IP, and route them down the plughole

		Third parties set up new DNS blacklists later

		Not quite so reliable anymore; many false positives
# since IPs are frequently recycled, or ISP-wide mailservers get listed;
# happened to eircom.net's mailserver.  Most DNSBLs don't even track
# their FP rate

%page

Origin of SpamAssassin


	Developed initially for personal use
# Spam "itched" me, to use the open-source phrase

	Initially added rules to an existing filter script, filter.plx

	License unclear, so rewritten from scratch under open \
	source license
# Perl's Artistic License

%page

SpamAssassin Concepts


	Zero-configuration where possible

	Lots of rules to determine if a mail is spam or not (about 600)

	No 1 rule alone can mark a mail as spam

	"Fuzzy logic": rules are assigned scores, and \
	scores are combined to produce an overall score for each message
# high-confidence rules get high scores and vice versa

	If over a user-defined threshold, the mail is judged as spam

%page

SpamAssassin Concepts, pt.2 


	Combines many systems for a "broad-spectrum" approach:

		Detect forged headers

		Spam-tool signatures (Message-ID patterns etc.)

		Text keyword scanner in the message body

		DNS blacklists

		Razor, DCC (Distributed Checksum Clearinghouse), Pyzor
# Means spammers cannot simply aim to defeat one system alone, \
#	as the others will still catch them out

%page

Digression: Razor, DCC and Pyzor


	All operate by posting a checksum of the message to a central \
	server

	If checksum matches a reported spam mail, it's spam

		Razor: lots of good press, semi-commercial
# good press because Marc Andressen (ex-Netscape) works for CloudMark
# Razor does not allow an organisation to run their own servers

		Pyzor: reimplementation of Razor in Python, as free software

		DCC: older than Razor, but has some philosophical issues
# not entirely oriented towards spam, just measures a mail's 
# "bulkiness", ie. how many people it was sent to 

%page

Large-Scale Use of SpamAssassin


	Modifications to filter.plx were UNIX-only, single-user. \
	Not very useful for a lot of people
# run from .forward, deliver to /var/spool/mail.

	SpamAssassin took this into account:

		Modular, clean, flexible design

		Open Source / free software license (Artistic)

		Released to CPAN (Comprehensive Perl Archive Network)

%page

Spamd


	"spamd" contributed by Craig Hughes:

		Client-server interface to SpamAssassin, over TCP

		Pure-C client; much faster than forking perl interpreter

		Simple TCP interface
# Easy for anyone now to plug into SpamAssassin, without even \
# using perl: just open a socket, write a mail message to it, \
# and read either the results, or a rewritten mail, back!
# Recent versions make libspamc.so, shared-library version. \
# Just call a C function and receive spam report back

%page

Integration of SpamAssassin


	Now about 20 different integrations that I know of

		Integration into MTAs: sendmail "milters", qmail-scanner, \
		Exim, Postfix

		Integration into virus-scanner MTA plugins: MIMEDefang, \
		amavisd-new

		IMAP/POP proxies and clients

		Plug-ins for Windows clients: MS Outlook, Eudora

		Also unofficial hacks adding SpamAssassin filtering into \
		mailing list software etc.

	Plus the usual mail filters for "normal" UNIX users: procmail, \
	mailfilter, Mail::Audit plugin etc.

%page

Accuracy: Evolve A Better Filter


	SpamAssassin's scores are assigned using a Genetic Algorithm:

	Given a big corpus of mail, divided into known piles of spam \
	and nonspam

	Determine what tests each mail triggers and \
	then "evolve" an efficient score set

	Exactly the kind of problem a genetic algorithm is good at

%page

False Positives (Non-spam Marked As Spam)


	No classifier is perfect; even humans get it wrong
# MessageLabs where Matt works, have humans who make that decision,
# and Matt tells me it's quite hard sometimes.  Also BrightMail have
# a whole NOC of people to do this

	False Positives (non-spam marked as spam) are much worse \
	than spam getting through; much more inconvenient to user

	SpamAssassin is 98% accurate on our test corpora: \
	0.4% false positives, 87% of all spam caught correctly
# with the default settings.  threshold can be moved up or down to
# make it more or less aggressive

	Highest rate available among present tools
# improving constantly, very hard range of mails used to train

	Reduce FPs by increasing the threshold, ditto vice-versa
# test corpus has only 1 FP out of 200000 mails at threshold 15, still
# catches 50% of spam

%page

False Positives At Different Thresholds
%center
%image "fp-graph.png"
# total = 202804, 5.0 = 168973 / 695 / 4307
# 15.0 = 169667 / 1 / 24005
%page

What To Do When You've Caught It


	Since classifiers are imperfect, blind deletion is bad

	Better to mark, and allow user to check over them infrequently
# However, higher threshold can work better; some systems will bounce
# back high-scoring mails to the sender.  threshold of 12 caught
# only 0.00005% false positives on our test corpus, and still blocked
# 50% of spam

	Also good to mark for legal reasons
# in UK, illegal to withhold mail or stall it for more than 3 days

%page

Features For Large-Scale Use


	Can load user preferences from an SQL database

	Spamd will compile entire rule-set into RAM, like mod_perl \
	(contributed by Matt Sergeant)

	Since the spamc/spamd protocol is TCP, load balancing is no problem
#		Many large sites run a "farm" of spamd machines and use \
#		round-robin DNS A records to share the load.

	Deployed at several large organisations and ISPs: \
	The Well, Salon.com, Panix, Transmeta, SourceForge
#		sonic.net, kde.org, plenty more

%page

Large-Scale Filtering For Your Network


	Different from filtering for yourself:

	Many users get little spam

	Should use conservative settings

	Better to use "opt-out by default"; notify that spam \
	filtering is available, and ask them if they want it

%page

How Can ISPs, Net Admins Fight Spam?

	Scan for Open Relays & Proxies on your network: they are bad
# run periodic scans to test for open mail relays
# biggest problem at the moment; used by between 20 and 40% of spammers
# now to disguise origins of mail and do relaying; can relay into your
# own outgoing mailserver and get you blacklisted

	Block proxy ports at the firewall
# Even if someone does leave an open proxy, this makes it a lot harder
# for a spammer to find it

	Audit webservers for FormMail or other web-to-mail scripts
# FormMail extremely trivial to exploit.  Wrote an advisory about it
# earlier this year. NMS FormMail not vulnerable

	Use "conservative" DNSBLs on your MXes
# block well-known spammers at the edge of your network.  Use DNSBLs
# with low FPs: Spamhaus Block List, opm.blitzed.org

	Spamtraps reporting to Razor, DCC, Pyzor
# should only use addresses that have not been valid for about 6 months,
# and take care to log them for a month and unsub any newsletters
# still coming in.

	Run SpamAssassin ;)

%page

How Do The Spammers Feel?


	Already hurting:

		CBS quotes a spammer as saying he's gone through \
		"unbelievable hardships" to keep spamming ... "My \
		operating costs have gone up 1,000 percent this year, \
		just so I can figure out how to get around all these filters"

	Spam relies on low overheads and extremely cheap costs of delivery

	Increase the overheads and the spammers will give up! 
# (Hopefully ;)

%page

Future Directions for SpamAssassin: \
Bayesian Probability


	Naive Bayesian classification -- as seen on slashdot ;)

		Given a corpus of mail messages, and knowledge that each \
		mail is spam or non-spam: break the message down into \
		"words", and track how frequently each word appears in \
		spam vs. non-spam.

		From this you can determine the probability that a mail is \
		spam, based on the words used within it.

#		Normally trained, by feeding a saved corpus of spam and \
#		non-spam into the filter beforehand
#
#		SpamAssassin variant will train itself, based on incoming \
#		mail and SpamAssassin's scores... hopefully

%page

Future Directions: Hash-Cash


	Hash-cash: impose a penalty for sending a mail

		Mail currently more-or-less free for sender

		With hash-cash, each recipient requires CPU time for the \
		sender

	Classic chicken-and-egg problem

	But SpamAssassin can provide "bonus points"
# ditto for PGP, Habeas etc.

%page

Fin


	http://spamassassin.org/

		SpamAssassin for UNIX \
		(free software)

	http://www.deersoft.com/

		SpamAssassin for Windows: MS Outlook, Exchange \
		(commercial version)


%% vim:sw=8:tw=74:noexpandtab: