Thu, 04 Mar 2004

More SpamAssassin Fun

It appears as if we are all stuck with spam for the foreseeable future, so it is now up to us to do something about it. Here at work, I've installed SpamAssassin as a spam filter that works in conjuntion with the rest of our mail delivery software (those being qmail, procmail, and Courier IMAP).

The path an incoming ei-mail takes can be briefly summarized as follows: qmail receives the incoming e-mail arriving from another SMTP server on the internet. qmail then passes the newly arrived e-mail to procmail which is in charge of running the e-mail through SpamAssassin, and then routing the e-mail to the appropriate mail folder (INBOX if it is not flagged as spam, or the Spam folder if it is). Finally, the user retrieves the message using an e-mail client which supports the IMAP protocol (almost all do, although Mozilla Thunderbird is one open source cross-platform e-mail client which I usually recommend to most people).

I have had this setup running for the last 9 months or so, and it has gone a long way towards minimizing the time wasted on weeding through spam. However, the version I had installed on the Debian GNU/Linux mail server was a bit long in the tooth. To install the newest version of SpamAssassin on the Debian stable server, I added a new repository to the /etc/apt/sources.lst file, pointing it to “deb http://www.backports.org/debian/ woody spamassassin”. A simple ‘apt-get update; apt-get upgrade-u’ and a few seconds later the latest version of SpamAssassin was humming along nicely.

The next step was to implement some of the new features which had been added to SpamAssassin since the last version I had installed. It was time to turn the mail server into a merciless spam-terminating machine. SpamAssassin now has built into it a Bayesian filter module which, in essence, can learn to recognize spam and non-spam (called ‘ham’) e-mails, if you train it. All you need to do to train SpamAssassin is to let it look over 200+ spam and 200+ ham messages, and then it is ready to add the Bayesian filter check to its repertoire of tests.

Instead of doing that for each and every user on the system, I decided to go ahead and install the filter on a site-wide basis. This caused a few problems due to permission issues, but I finally hammered out a nice solution that seems to be working well. What is even cooler (in an admittedly geeky sort of way) is that I've created a process by which spam that gets through the varied defenses can be assimilated (very Borg-like, eh?) into the filter, thereby continually fine tuning it through time. Basically, I created a sub-folder of the Spam folder called “Missed”. If a spam message sneaks past the defenses, then I've instructed the users here at work to slap it in the Missed folder. Every day the mail server then executes a shell script which goes through each user's Missed folder, learns from the spam it missed, and then deletes those messages! Here is the shell script, which is executed by the cron scheduler:


#!/bin/sh
                                                                                                                                           
# train-spam.sh
#
# Description: Checks each user's /home/Maildir/.Spam.Missed
# directories to see if the user placed any "missed" spam
# messages which got through SpamAssassin to their INBOX.
# If there are messages in this directory, then the script
# invokes sa-learn to update the site-wide tokens to try
# and improve the defenses for next time...
#
                                                                                                                                           
for file in $(ls /home); do
                                                                                                                                           
    if [ -d /home/$file/Maildir/.Spam.Missed/cur ]; then
                                                                                                                                           
        echo -n "missed spam for $file: "
                                                                                                                                           
        # run sa-learn on the conents of the directory
        sa-learn --spam -C /etc/spamassassin --showdots --dir /home/$file/Maildir/.Spam.Missed/cur
                                                                                                                                           
        # delete all of the missed messages
        rm -f /home/$file/Maildir/.Spam.Missed/cur/*
                                                                                                                                           
        echo "Done!"
                                                                                                                                           
    fi # end if
                                                                                                                                           
    # set up permissions on site-wide token files to allow
    # all users permission to read from and write to these files
    chmod a+r /etc/spamassassin/bayes_*
    chmod ug+w /etc/spamassassin/bayes_*
    chgrp users /etc/spamassassin/bayes_*
                                                                                                                                           
done # end for loop
                                                                                                                                           
echo "Done!"

Hopefully this solution will reduce spam which slips under the radar by 90% or more. Only time will tell...

posted at: 14:46 | path: /computers | permanent link to this entry