The Evolution Spam Filter

From Computer Tyme Support Wiki

Revision as of 23:54, 26 January 2016 by Marc (Talk | contribs)
Jump to: navigation, search

The Evolution Filter

The last big advancement in spam filtering was done by Paul Graham - A Plan for Spam. He was the first to apply Bayesian Filtering to blocking email back in 2002. Since then not a lot has happened to make spam filtering significantly better, till now. The Evolution Filter is a new plan for spam.

Overview of How it Works

Have you ever looked at a list of email messages and get the feeling that you can classify 70% of them just seeing the sender name and the subject line? Have you ever wondered how it is so easy to recognize ham and spam yet computer can't seem to figure out the obvious. When you look at the list - how is it that you can tell? What's going through your mind?

The way you recognize spam and ham is that when you see a subject line that looks similar to good email and dissimilar to spam then it's good email. And if it looks similar to spam because it says stuff that good email never says, it's spam.

For example, the subject line is "let's get some dinner". We know the message is good because spammers never say that. So if the subject is something you've seen before in good email and something that spammers never say - it's good email.

In fact - that's how the Evolution filter works. If, for example, the subject has words and phrases that match good email but spammers never say then the message is good. If the subject has words and phrases that match messages that spammers have used, but never seen in good email, then it's spam.

But, you might ask, where do I get a list of words and phrases that spammers never say? It's easier than you think. What I do is create a set of every word and phrase spammers do say and test to see if it's NOT on the list. In other words, I store all the words and phrases that are said in ham, and all the words and phrases that are said in spam. If the test message matches ham and doesn't match spam, it's ham. And if it matches spam and doesn't match ham, it's spam.

So - what do I mean by words and phrases? I take the subject and I break it down.

the quick brown fox jumps over the lazy dog

becomes

"the" "quick" "the quick" "brown" "quick brown" "the quick brown" "fox" "brown fox" "quick brown fox" "the quick brown fox" "jumps" "fox jumps" "brown fox jumps" "quick brown fox jumps" "over" "jumps over" "fox jumps over" "brown fox jumps over" "the" "over the" "jumps over the" "fox jumps over the" "lazy" "the lazy" "over the lazy" "jumps over the lazy" "dog" "lazy dog" "the lazy dog" "over the lazy dog"

Personal tools