Windows IT Pro is the leading independent community for IT professionals deploying Microsoft Windows server and client applications and technologies.
  
  
  Advanced Search 


February 18, 2003

Bayesian Spam Filters

RSS
Subscribe to Windows IT Pro | See More Exchange Server and Outlook Articles Here | Reprints | Or get the Monthly Online Pass—only $5.95 a month!

One of the most promising antidotes to spam is so-called Bayesian filtering, which calculates the probability that a given message is spam, based on analysis of messages previously identified as being spam or not being spam. The Bayesian approach demands less maintenance than keyword-based spam filters that require constant updating of word and phrase lists.

Much of the buzz around this technique started with Paul Graham's August 2002 article "A Plan for Spam" (see the first URL below). Although there is some debate about whether Graham's approach is precisely Bayesian, organizations have been exploring Bayesian methods and applying them to spam for several years. Microsoft Research's antispam effort, spearheaded by a group of Bayesian researchers, began in 1997 and has resulted in a patent. If you want to keep up with spam-fighting techniques, some understanding of the Bayesian technique is in order.

The Bayes in Bayesian was an 18th-century British clergyman and amateur mathematician, Thomas Bayes, who suggested in a posthumously published paper that the probability of some event occurring in the future is related to the proportion of times that event occurred in the past under the same circumstances. Later, mathematicians refined Bayes's ideas and, in the 20th century, built a formal system of classification and decision-making and began applying it to many tasks in science and engineering. (I first encountered Bayesian inference in the context of economics.) A key element of the Bayesian approach is that it depends on having some prior information about the problem at hand.

To some extent, the Bayesian approach models our everyday experience of using probability to try to determine the possible outcome of an action and make decisions. The Bayesian interpretation of probability is different from the coin-flipping experiments that most of us did in school (and which, I'm convinced, are largely an effort to convince students of the futility of gambling). Life isn't a series of random experiments from which we calculate frequency distributions. We must make decisions taking into account the likelihood of different consequences arising from those decisions and whether those consequences are good or bad.

In the case of spam, Bayesian inference suggests that if a new message contains text that appeared often in spam in the past but rarely in legitimate messages, then the new message is likely to be spam. The formal methods of calculating such a probability can also take into account the fact that a single false positive--a legitimate message quarantined as spam--is far more costly than many false negatives or spam messages left untouched in your Inbox.

Graham's method analyzes not just the message body but also the message header, which might contain information about the sender's mail server, foreign character sets, and attachments. He claims that his filter catches 99.5 percent of spam with less than one false positive for every 1000 messages received.

Graham presented an update at an antispam conference at MIT last month. He has expanded his list of "tokens"--telltale words and phrases to look for in incoming mail-–to about 187,000 items. And his method can now handle a word differently depending on whether it appears in the subject, in a URL, or in an address field.

Others following Graham's lead are experimenting with variations that calculate the "spamminess" of messages differently. The open-source SpamBayes effort has produced an Outlook add-in (see second URL below). Another free Outlook spam filter using a Bayesian technique is Spammunition, currently in beta. Spam Bully provides a commercial solution.

John Graham-Cumming, the author of POPFile, another open-source project (this one a mail proxy server using a Bayesian filter) reported to the MIT conference that, as well as statistical filters might work, parsing email messages so that such filters can analyze them will continue to be a hard job. Technically savvy spammers constantly devise new ways to make their messages easy for a user to read but difficult for a program to analyze.

"A Plan for Spam" http://www.paulgraham.com/spam.html

SpamBayes http://spambayes.sourceforge.net

Spammunition http://www.upserve.com/spammunition/default.asp

Spam Bully http://spambully.com

POPFile http://popfile.sourceforge.net

End of Article



Reader Comments

You must be a registered user or online subscriber to comment on this article. Please log on before posting a comment. Are you a new visitor? Register now




Top Viewed ArticlesView all articles
Command Prompt Tricks

One reader shares his tip for setting up the command prompt to reflect a remote path. ...

WinInfo Short Takes: Week of November 9, 2009

An often irreverent look at some of the week's other news, including some more Windows 7 sales momentum, some Sophos stupidity, Microsoft's cloud computing self-loathing, more whining from the browser makers, Zoho's "Fake Office," and much, much more ...

Understanding File-Size Limits on NTFS and FAT

A general confusion about files sizes on FAT seems to stem from FAT32's file-size limit of 4GB and partition-size limit of 2TB. ...


Exchange Server and Outlook Whitepapers Take Control of Your Email: Understand the Business Reasons for Email Storage Management

Continuous Data Protection and Recovery for Microsoft Exchange

Related Events WinConnections and Microsoft® Exchange Connections

Bail Out Your Exchange Environment

Continuous Application Virtualization: An Answer to Exchange Recovery Problems

Check out our list of Free Email Newsletters!

Exchange Server and Outlook eBooks Spam Fighting and Email Security for the 21st Century

Understanding and Leveraging Code Signing Technologies

The Expert's Guide for Exchange 2003: Preparing for, Moving to, and Supporting Exchange Server 2003

Related Exchange Server and Outlook Resources Introducing Left-Brain.com, the online IT bookstore
Looking for books, CDs, toolkits, eBooks? Prime your mind at Left-Brain.com

Discover Windows IT Pro eLearning Series!
Clear & detailed technical information and helpful how-to's, all in our trademark no-nonsense format

Exchange & Outlook UPDATE eNewsletter
News, strategies, products, and developments in Exchange Server and Outlook messaging.

Windows IT Pro Home Register FAQ for Windows WinInfo News
Europe Edition About Us Contact Us/Customer Service Media Kit Affiliates / Licensing  
SQL Server Magazine Office & SharePoint Pro DevProConnections IT Job Hound
Left-Brain.com Technology Resource Directory asp.netPRO ITTV Windows SuperSite 
 
 Windows IT Pro is a Division of Penton Media Inc.
 © 2009 Penton Media, Inc. Terms of Use | Privacy Statement