Spam Filtering with Bayesian Filters

Bayesian filters are a popular way to identify spam. The idea is that the content of spam is different from the content of regular emails. Bayesian filters compute the probability that an email message is spam against the probability that a word is one commonly found in spam, and the probability that a word is one found in any email. For example, the words Viagra, win, and $$$ are mostly seen in spam. But your first name, city, and nickname are not likely to appear in spam. Common words such as hello, good, and so on are equally often part of spam and legitimate email. So if an email contains Viagra, win, and $$$, but not your first name or your nickname, it is very likely to be spam. It is very common to correctly identify spam 95 percent of the time or more with Bayesian filters.

Spamprobe

SpamProbe is a Bayesian filter designed to identify spam (http://spamprobe.sourceforge.net/). To use it, the first thing to do is to feed it both good email and spam. It should be run with the -c argument the first time to create the database directory $HOME/.spamprobe:

[julien@asus ˜]$ spamprobe -c goodgood.file
[julien@asus ˜]$ spamprobe spambad.file

Tip

SpamProbe stores the MD5 of each file it is given. If the same file is sent twice to SpamProbe, it is not used to update the database the second time.

This updates the SpamProbe database with good email and spam email.

The more data you use, the better SpamProbe is at separating the good email from the spam. It is best to feed SpamProbe with your own good email rather that someone else's, since the topic and words may be different. Spam can come from anywhere.

It is important to give SpamProbe the same amount of good email and spam. If SpamProbe is trained with 100 examples of good email but 1,000 examples of spam, future good email can more likely be reported as spam simply because more different words are contains in the spam database.

After feeding SpamProbe, you can test it with new emails:

[julien@asus ˜]$ spamprobe scorespam.mail
SPAM 0.9904065 f290c2059f31c90b20cc7a63b9607f03

SpamProbe classifies this mail as spam with a confidence of 99 percent. The last string is the MD5 signature of the email.

To see how each word in the email counts, use the -T option:

[julien@asus ˜]$ spamprobe -T scoreother_spam.mail
SPAM 0.9252232 8f1ac523bd19afde07e3fbf15b950c6a
    Spam Prob   Count    Good    Spam  Word
    0.9999962      10       0       5  acquisition
    0.9999989       8       0      17  investor
    0.9999976       6       0       8  featured
    0.9999969       6       0       6  U_lt
    0.9999962       6       0       5  subsidiary
    0.0000041       3      28       0  cap
    0.0000047       3      24       0  of any
    0.0000052       3      22       0  contained
    0.0000081       3      14       0  contained in
    0.9999983       3       0      11  shareholders
    0.9999981       3       0      10  companies that
    0.9999962       3       0       5  atlanta
    0.0000010       1     229       0  database
    0.0000010       1     132       0  use the
    0.0000010       1     122       0  box

Automate the Learning Phase

The SpamProbe database must be filled first. This can be done with a collection of spam and good email from the local drive, or on the mail server with the user mailboxes. It is recommended to give users a shared IMAP folder where they can upload their spam (Microsoft Exchange supports the IMAP Protocol). On a Linux server, use the following script:

#!/bin/bash

for SPAM in 'ls $HOME/Maildir/.spam/cur';
do
    spamprobe train-spam $SPAM;
    rm -f $SPAM;
done

The script should be run regularly as users upload more spam in the shared folder. cron can be use to run the script every hour:

[julien@asus ˜]$ crontab -e
0 * * * * /home/julien/spamprobe-train.sh

SpamProbe can be used in association with SpamAssassin (see SpamAssassin with Procmail) to feed the database initially.

If the IMAP server and the SMTP server are two different machines, fetchmail can be used to retrieve the email. After installing fetchmail on the SMTP server, $HOME/.procamilrc must be created for the user who downloads all the email automatically:

poll domain.net proto imap:
user "imap_user" with password "PASSWORD"

Email should not go to a mailbox on the SMTP server, but directly through SpamProbe:

[julien@asus ˜]$ fetchmail -a -n -m 'spamprobe spam'

Tip

If spam is not contained in the Inbox folder, add the argument --folder Folder to retrieve mails form Folder.

The -n argument prevents fetchmail from modifying the email headers.

This command can be placed in crontab to be run every hour:

[julien@asus ˜]$ fetchmail -a -n -m 'spamprobe spam'
0 * * * * /home/julien/spamprobe-train.sh

Maintenance

You should clean up the database in order to remove words with a low count that are not encountered often. If this cleanup is not done, the database grows larger and filtering spam takes longer. Add a command to cron to run every Sunday to remove words with a count of two or lower that were not updated for seven days:

[julien@asus ˜]$crontab -e
0 0 * * 0 /usr/bin/spamprobe/cleanup 2 7

It is important not to interrupt SpamProbe brutally with a kill signal; this could corrupt the database. If the database is corrupted, $HOME/.spamprobe/ must be deleted and a new database must be created.

SpamProbe with Procmail

SpamProbe must be used with Procmail to automatically filter all incoming emails. First, the word's score can be added to the email header:

COMMON=/share/.spamprobe

:0
SCORE=| /usr/bin/spamprobe -D $COMMON train

:0 fw
| formail -I "X-SpamProbe: $SCORE"

Tip

When the -D argument is used, SpamProbe uses another database when a word is not found in the user database. It is a good idea to share the initial spam database with all users.

There are three main commands to score an email: receive, train, and score.

To force updates to the database (SPAM or GOOD depending on the score) with the words contained in the email, use:

spamprobe receive file.mail

To update the database if the message was difficult to score, use:

spamprobe train file.mail

The command to not update the database is:

spamprobe score file.mail

It is recommended to use the spamprobe train command to keep the database up to date but with a minimum of database accesses to save resources.

The email can then be deleted if the score is more than 90 percent:

:0 a:
* ^X-SpamProbe: SPAM 0.9
/dev/null

Or the subject can be modified to show that is the email is suspected to be spam:

SUBJ='formail -xSubject: '

:0 afwh:
* ^X-SpamProbe: SPAM 0.9
| formail -I"Subject: ***SPAM*** ${SUBJ}"

Tip

Most email clients allow the creation of rules to filter email. A rule can be created to move email with the header X-SpamProbe: SPAM into a different email folder on the local machine.

Inconvenient

Using a Bayesian filter against spam can be inconvenient for a few reasons. First, enough spam has to be entered in order for the filter to do a good job. Second, SpamProbe cannot be used during its learning phase. Third, the proportion of spam and good email that updates the database must be approximately the same, which can be very difficult to control.

The nature of spam changes: there was a time for Viagra advertisements; then came low-rate mortgages, then Vioxx, and so on. If the words used in the new types of spam are too different, spam may be classified as good. The database must be rectified manually by repeating the learning phase; users have to manually move false positives (spam reported as good email) to an IMAP folder where they are entered into SpamProbe as spam.

Spammers are aware that Bayesian filters are used against them. You might have noticed that spam often contains several spelling mistakes. These are intentional. By spelling the same word in different ways (e.g., using 0 instead of o, _ in the middle of a word, missing letters), each word has a lower count in the database. Other anti-spam software (e.g., SpamAssassin) uses other techniques in addition to the Bayesian filter to detect spam.