Welcome to my old blog, which I no longer maintain.

For details of my current professional services and activities see www.martingeddes.com.

February 6, 2005

Towards a better spam filter

As far as I can tell, there's no prior art for this idea, so here goes.

One technique spammers use is to create odd spellings of Vgiara and P'ron to try to evade filters. They also stuff randomly generated text at the end of mails with qskyrjsna words.

I'd like to propose a modest enhancement to spam filters to trap these. I call it a Bayesian Trigraph Filter.

Traditional Bayesian filters use the frequencies of whole words to score mail as spam. The word "prescription" might be 75% likely to indicate spam; "baccarat" is 99%; and "disintermediation", 0%. Bayes's algorithm just says how to accumulate and combine these. Words not seen by the filter before are assumed to be mildly spammy. The problem with this is that "new" words are often legitimate ones, either of a technical nature, or just text quoted in a foreign language.

My additional filtering stage is to break up the text into sequential trigaphs. A trigraph is a simply a sequence of three letters. An example:

Input: [ "The quick brown fox." ]

Output: [ "The", "he ", "e q", " qu", "qui", "uic", "ick", "ck ", "k b", " br", "bro", "row", "own", "wn ", "n f", " fo", "fox", "ox." ]

Now treat the trigraph sequence as the words of the input document, and apply standard Bayesian filtering. Certain letter combinations ("oth") are common; others are highly unlikely ("qqh"). Lots of mis-spellings generate unlikely combinations. Deliberate made-up "salt" words to overload your spam dictionary are also likely to be rejected.

Advantages: fast, easy, simple, unlikely to generate significant false positives.

Disadvantages: Might trip up on embedded uuencoded data, which is the one case where you deliberately have strange letter combinations; doesn't do anything for spams padded with valid corpus text.

Letter trigraph frequencies will vary between languages, but the filter will quickly learn which languages you converse in without having to learn the entire repertoire of words in those languages.

This approach would also be good for providing an initial seed for an untrained spam filter, when the user-specific Bayes dictionary is empty. Your trigraph distribution is much more likely to be similar to other people than your word use.

I'm due to upgrade my personal spam filter (dspam) at some point soon, so I may whip together some procmail/formail scripts and see how it goes. I've got a pretty decent personal corpus to test against. I'll let you know how it goes!

Posted by Martin Geddes at 9:02 PM