A growing issue for users of email over the past several years has been the increasing quantity of unsolicited and unwanted mail, usually but not necessarily featuring advertising, referred to generally and hereafter in this document as "spam." Adding to the issue is the difficulty (to say the least) in finding server-wide solutions that do not result in user mail being filtered (and perhaps deleted) without user consent. Therefore, spam filtering is generally considered the province of the user. This document attempts to address some ways users of CSA (the Oklahoma State University primary Computer Science server) might alleviate the spam problem in their own user experiences. A quick rundown of the technical steps included in this document can be found in this quickref.
There are a few general approaches to spam filtering that are worth knowing a little about before discussing specific packages that implement such approaches. The first is "heuristic filtering," where a set of static rules are applied one by one against a message and a determination is made of whether the message is probably best defined as spam. The upside of heuristic filtering is that it is very automatic -- typically no user involvement is required. The downside is that it is prone to a substantial number of false negatives (where actual spam is mischaracterized as legitimate email) and some number of false positives (where legitimate email is mischaracterized as spam) as well. For this reason, the "hands-off" characteristic of the heuristic approach can be misleading, as dealing with false positives and negatives has to be done manually and can be time-consuming. On the other hand, doing so is usually straightforward and can generally be done via any email client.
Another approach is the more recent so-called "Bayesian" approach, wherein a piece of software is trained over time to recognize spam by "showing" it pieces of spam and non-spam and informing it which is which. (The details of this approach were introduced and improved respectively by Paul Graham in his papers A Plan for Spam and Better Bayesian Filtering.) The upside is that this has been proven to generate far fewer false negatives and next to no false positives (although as spammers try to find ways to defeat the Bayesian approach, this may hold less true -- time will tell.) The downside is that the process of training the system may be laborious and often requires interaction directly on the server (in this case, CSA) which complicates matters for users of POP3 and IMAP.
Further approaches, both independent and in addition to the above, include "whitelisting," wherein a list of "good" email addresses is exempted from the filtering process, "blacklisting," wherein a list of "bad" email addresses is always refused outright, and "challenge-response," where an unknown sender might receive an automated challenge telling them to email back to a specific address at which time they will be whitelisted automatically.
We primarily recommend three software approaches as regards CSA mail. Before getting into two pieces of software installed on CSA, a discussion of IMAP and POP3 clients is in order. It is becoming more common for such email clients -- for two examples, Thunderbird and Mail.app for OSX -- to include client-side filtering, often including multiple approaches previously mentioned (typically at least Bayesian and whitelisting.) For users of IMAP and POP3, this can be an ideal approach, as the interface for training is often simple and intuitive and does not require logging into CSA whenever training is needed. It has been pointed out, however, that users on very slow links may not find this approach desirable, however, as spam is not filtered until it is downloaded. For many people, though, this approach may be excellent.
In cases where a client-side filtering is not desired or usable,
there are two pieces of software available (and up-to-date as of
this writing) on CSA. The first is
SpamAssassin,
a perl program
that historically has used the heuristic approach, although
recently it has begun to also provide a Bayesian training system
in addition. Primarily this document will concern itself with
the "hands-off" use of SpamAssassin heuristics, although the user
is welcome to read more about the Bayesian aspect. The file
/pub/htdocs/spamassassin.ex has an example of a file
that
procmail can use to
implement SpamAssassin filtering; generally a user could
simple do cp /pub/htdocs/spamassassin.ex .procmailrc
in his or her home directory and begin seeing results. The effect
is that all mail presumed by SpamAssassin to be spam is filed in
a folder named Spam (under the directory ~/mail) instead of the user's
inbox.
This leads to an important digression: why not just delete mail assumed to be spam? The answer is earlier in this document: the non-zero likelihood of false positives; if the user deletes presumed spam automatically, then false positives will be lost without ever being seen, which is typically considered unacceptable. Consequently, this document recommends instead filing presumed spam in a separate folder and reviewing this folder frequently before deleting its contents.
Spamprobe, a Bayesian filter, is also available on CSA. Unfortunately, it is somewhat harder to give a "plug and play" solution as was previously done with SpamAssassin. Generally, there are a few stages to implementing Spamprobe:
spamprobe -c good mail/Savedspamprobe -c spam mail/Spamspamprobe train-spam File (if
it's a false negative; if it's a false positive -- far less
common -- use train-good instead.)spamprobe cleanup periodically -- at
least daily is vital to
keep disk usage down. (Failure to do so that causes disk
overuse may result in deletion of spamprobe databases, requiring
starting training over.) One way to automate this would be to
run crontab -e and add a line like this:
0 1 * * * /usr/local/bin/spamprobe cleanupwhich runs
spamprobe cleanup every morning at 1am.Questions about these or other approaches are always welcome at the usual system manager address. Please be patient with us, as this is a difficult problem that dedicated individuals constantly struggle with on a daily basis. Any errors in this document or in the example files given should be pointed out, and we will give full diligence to correcting them. Hopefully this document will help you with your struggle against spam.
man spamassassin man spamprobe man procmail man procmailrc man procmailex Procmail Quick Start