NYCU CSIT Mirror Site

Welcome to NYCU CSIT Mirror site

Spamcan: A Sendmail patch to capture spam by regular expression

Spamcan is a simple yet extremely effective and powerful junkmail capturing/nuking sofware patch to Sendmail v8. It is designed to capture site-wide junkmail to a global capture file, a user owned spam mailbox if it exists, or to /dev/null.

While user-level filtering programs such as procmail may be used to filter out spam by regular expression, Spamcan is the first of it's kind designed to capture spam site-wide by regular expression. In my experience, bouncing spam is futile and wasts bandwidth. Most spam has forged addresses and invalid return addresses.

Spamcan was written with the philosophy that unsolicited email is an evil and disruptive presence. In my opinion it is unreasonable to ask or require each and every user at a site with hundreds of users to put up with unsolicited email or to manage their own anti-spam rules. I beleve it is up to the system administration group to provide excellent service to the user community and I belive Spamcan is a powerful tool that aids in providing excellent service.

Spamcan works by attempting to match every header line of a mail message, including subject line, with a set of regular expressions you've prepared. By default matching is not case sensitive. The expression list typically contains lines such as the following:

x-advertisement
sexyhot
(^Subject: (.*(free|mak(e|ing)).*money))
fresh.*addresses
mega-mailer
ExtractorPro
((stealth|mass).*mailer)
make.*money.*fast
(^(From|To): (.*[0-9]{8,}@.*\.[a-z]{3}))
(cyber(market|shop|promo|gold))
((sales|srhot|foryou|allvip|mailman|succeed|success|everyon|megaweb|emailer|allinternetusers|market|4u|Friend)@.*\.[a-z]{3})
((sallynet|scholarship|shoppingplanet|answerme|onlineprofit|yourdomain|ispam|devotion|quantcom|savetrees|nowhere)\.[a-z]{3})
--- CLOAKED! ---

The envelope "from" is also examined. Using regular expressions is a very powerful and dangerous capability. Extreme caution must be taken in formulating your expressions. Fortunately, Spamcan does not discard or bounce email. Rather, Spamcan cans it to a spamcan. In the unfortunate event that you've managed to capture non-spam, it may be forwarded to the appropriate user. Spamcan adds the header X-Spamcan-Reason to every message it identifies as spam before sending it to the spamcan. It contains the reason for identifying the mail as spam and the intended recipient.

All spamcan diagnostics are logged via syslog with the mail facility at the debug level. Look for any problems relating to spamcan there.

Spamcan was designed to be installed on the internal mailhost although this is not a requirement. It likes to look in /var/spool/mail for two files named username.spam and .nospamcan.username. Users at your site may choose not to have Spamcan scan their mail by touching /var/spool/mail/.nospamcan.username. Users at your site may also choose to have spam directed a personal spam mailbox owned by them by touching /var/spool/mail/username.spam. The directory /var/spool/mail was chosen to be the root for these files because in many installations it is globablly writable. Also, it's generally not a good idea to access user home directories because large mail distributions may cause your automounter to mount everyone's home. I recommend you don't do this with .forward either.

Spamcan was designed stay out of the way of valid mail message traffic. It examines all non-message-id headers. Internal mail and outgoing mail is not examined. Incoming mail messages containing the in-reply-to header are also passed through un-examined. If spammers use this as a back-door this feature may have to be removed.

Read INSTALL.SPAMCAN for detailed installation notes.

Impliment a scoring system. Assign points to regular expressions in the configuration file. Add up the points for a mail message; if this number reaches some threashold, identify the message as spam. Otherwise deliver mail with an X-Spamlevel header with the number of points as the header value. This way the user community can filter out possible spam without having to know about formulating anti-spam rules. For example, a user's ~/.procmailrc could look something like the following.


# Recommend users add something like this to the end of their ~/.procmailrc

:0 h
SPAMLEVEL=|formail -xX-Spamlevel:

:0
* ? test $SPAMLEVEL -gt 60
Mail/spamorama

:0
* ? test $SPAMLEVEL -gt 40
Mail/spam

:0
* ? test $SPAMLEVEL -gt 30
Mail/junk

:0
* ? test $SPAMLEVLE -gt 20
Mail/junk_lists

:0
* ? test $SPAMLEVEL -gt 10
Mail/lists

Do some limited sanity checking on header files. Look for missing To: or From: headers. Check if local domain appears on To: or Cc:. Check if Reply-to: or From: is the same as To:. Add points for each of these to X-Spamlevel.

Spamcan is known to work on Linux, Solaris, SunOS, Irix and NextStep, but I'm sure it would not take much to make it work on other platforms.

Please let me know if you find Spamcan useful. Send bug reports, enhancements, or comments to timb@transmeta.com

I'm always interested in hearing from sites that have feedback, good or bad.

Tim Berger
Systems Administrator
Transmeta Corporation
Santa Clara, CA
timb@transmeta.com