Coding Horror

programming and human factors

Bayesian Kryptonite - spoofed email

I use POPFile bayesian filtering to keep email spam at bay. With a little training, this works amazingly well-- I'm at 99.8% accuracy, and that's with a little over a month of "training" precipitated by a recent server migration. But bayesian filtering has one big weakness that I'm seeing more and more: spoofed emails.

You know what I mean-- emails titled Your Account Has Been Violated from, ostensibly from The body is a direct cut and paste from a real PayPal email:

Security Center Advisory!

We recently noticed one or more attempts to log in to your PayPal account from a foreign IP address and we have reasons to belive that your account was hacked by a third party without your authorization. If you recently accessed your account while traveling, the unusual log in attempts may have been initiated by you.

If you are the rightful holder of the account you must click the link below and then complete all steps from the following page as we try to verify your identity.

Of course, the spoofer is desperately hoping you won't notice that the crazy URLs in their email ..;13012399;10693575;h?;13012399;10693575;h?

.. aren't actually pointing to (or, and you'll key in your account and password on their servers.

These spoof emails contain so-called "kryptonite" because they so closely mimic actual emails from PayPal with valid words and phrases. Bayesian filtering is useless against this type of spam; if the spammer knows what any email in your actual inbox looks like, he can construct one that will beat any Bayesian filter. This is a a strict requirement at the very heart of bayesian filtering itself; any knowledge of valid contents (eg, things that "get through") has to be strictly eliminated.

I usually just delete these emails from my inbox; what else can I do? One thing is for sure: popular web-based services can no longer communicate via email with their customers. That's like giving spoofers a free pass; once they have the "template" email they can copy and paste it into a spoof email that is almost guaranteed to get past bayesian filtering for users of that service.

eBay, for example, has almost given up altogether on email communication. You have to visit and check your web-based "message center" to communicate with them. I can't say I blame them; what other choice do they have?

Written by Jeff Atwood

Indoor enthusiast. Co-founder of Stack Overflow and Discourse. Disclaimer: I have no idea what I'm talking about. Find me here: