Coding Horror

programming and human factors

Obscenity Filters: Bad Idea, or Incredibly Intercoursing Bad Idea?

I'm not a huge fan of The Daily WTF for reasons I've previously outlined. There is, however, the occasional gem – such as this one posted by ezrec:

Browsing through a web archive of some old computer club conversations, I ran across this sentence:

"Apple made the clbuttic mistake of forcing out their visionary - I mean, look at what NeXT has been up to!"

Hmm. "clbuttic".

Google "clbuttic" - thousands of hits!

There's a someone who call his car 'clbuttic'.

There are "Clbuttic Steam Engine" message boards.

Webster's dictionary - no help.

Hmm. What can this be?

As programmers, this isn't much of a mystery to us; it seems every day a brand new software developer is born and immediately begins repeating all the same mistakes we made years ago. I can't resist linking to Language Log again on this topic, where a commenter disputes whether or not this is an actual real world problem:

The "clbuttics" story may be a little exaggerated if not actually a web-legend. Sure, Google returns 4,000 hits – but by the time one reaches page 2 (in search of a page that isn't reporting on the silliness, or reporting on the reports, etc.) we're down to 200 hits.

Almost all of those 200 seem to have a "clbuttic mistake" by Apple at their core. Google's redundancy-compacting routines are only invoked when requested, it seems, and even then, the variety of information in 200 hits may be small.

In short, it's an echo chamber. 200 or 4,000 or however many hits today aren't as impressive as the same number last year, etc. All the more so as web sites of all kinds put randomly chosen (even Googled!) words out there just to game Google.

While I agree this particular manifestation of the mistake is probably over-reported (because, haha, butts are funny) and fairly rare on the open web, I still get this shiner on page one of my search results:

Is the song Dueling Banjos considered blue grbutt?

Poor Bluegrass World. I'm pretty sure that site is legitimate, though I have no idea how they'd post an article in that state. Obligatory link to dueling banjos scene from Deliverance. I'm inclined to believe this is, in fact, still a problem. There are many, many examples besides "clbuttic" out there. Perhaps you've heard of the United States Consbreastution?

Of course, what we have here is failed obscenity filters implemented by (extremely) newbie developers with regular expressions. I could explain, but as they say, a picture is worth a thousand words, particularly when it's a picture of my very bestest friend, RegexBuddy:

RegexBuddy: Consbreastitution

Oh, great, an inexperienced developer had a problem, and thought they would use regular expressions. Now they have two problems. Well, technically through Google they now have many thousands of problems, but who's counting.

I'm not sure regular expressions are to blame here. The replacement is so mind-bendingly naive that it might as well have been a simple Replace operation. We, being extra-smart-gets-things-done developers, would write a superior regular expression using the b word boundary qualifier around the replacement, and use some capturing parens to handle both the singular and plural cases.

RegexBuddy: Great Tits

How about those Great Tits, eh?

Proving, yet again, that bad ideas are just plain bad ideas, regardless of language or implementation choices. Obscenity filters are like blacklists; using one is tantamount to admitting failure before you've even started.

But it still happens all the time. One of the most famous incidents was when the Yahoo! email developers created the accidental non-word Medireview. They weren't trying to filter obscenities, but JavaScript webmail exploits.

In 2001 Yahoo! silently introduced an email filter which changed some strings in HTML emails considered to be dangerous. While it was intended to stop spreading JavaScript viruses, no attempts were made to limit these string replacements to script sections and attributes, out of fear this would leave some loophole open. Additionally, word boundaries were not respected in the replacement.

The list of replacements:

Javascript java-script
Jscript j-script
Vbscript vb-script
Livescript live-script
Eval review
Mocha espresso
Expression statement

Some side-effects of this implementation:

medieval medireview
evaluation reviewuation
expressionist statementist

medireview.com is currently occupied by domain squatters. Perhaps that's a fitting end for this "company", though I perversely almost want the company to exist, as wholly formed from our imaginations, sort of like Jamcracker.

I can't help wondering just how freaked out the brass at Yahoo must have been about then-new JavaScript browser exploits to actually deploy such a brain-damaged "solution". To be fair, it was seven years ago, but still – did it not occur to anyone that such broad replacement criteria might have some serious side-effects? Or that replacing one thing with another, when it comes to human beings and written language, is an activity that is fraught with peril even in the best possible circumstances?

Obscenity filtering is an enduring, maybe even timeless problem. I'm doubtful it will ever be possible to solve this particular problem through code alone. But it seems some companies and developers can't stop tilting at that windmill. Which means you might want to think twice before you move to Scunthorpe.

Discussion

Programming Is Hard, Let's Go Shopping!

A few months ago, Dare Obasanjo noticed a brief exchange my friend Jon Galloway and I had on Twitter. Unfortunately, Twitter makes it unusually difficult to follow conversations, but Dare outlines the gist of it in Developers, Using Libraries is not a Sign of Weakness:

The problem Jeff was trying to solve is how to allow a subset of HTML tags while stripping out all other HTML so as to prevent cross site scripting (XSS) attacks. The problem with Jeff's approach which was pointed out in the comments by many people including Simon Willison is that using regexes to filter HTML input in this way assumes that you will get fairly well-formed HTML. The problem with that approach which many developers have found out the hard way is that you also have to worry about malformed HTML due to the liberal HTML parsing policies of many modern Web browsers. Thus to use this approach you have to pretty much reverse engineer every HTML parsing quirk of common browsers if you don't want to end up storing HTML which looks safe but actually contains an exploit. Thus to utilize this approach Jeff really should have been looking at using a full fledged HTML parser such as SgmlReader or Beautiful Soup instead of regular expressions.

The sad thing is that Jeff Atwood isn't the first nor will he be the last programmer to think to himself "It's just HTML sanitization, how hard can it be?". There are many lists of Top HTML Validation Bloopers that show tricky it is to get the right solution to this seemingly trivial problem. Additionally, it is sad to note that despite his recent experience, Jeff Atwood still argues that he'd rather make his own mistakes than blindly inherit the mistakes of others as justification for continuing to reinvent the wheel in the future. That is unfortunate given that is a bad attitude for a professional software developer to have.

My response?

Bad attitude? I think that's a matter of perspective.

(The phase "programming is hard, let's go shopping!" is a snowclone. As usual, Language Log has us covered. Ironically, we later had a brief run-in with Consultant Barbie "herself" on Stack Overflow – who you may know from reddit. There's no trace of her left on SO, but as griefing goes, it was fairly benign and even arguably on-topic.)

In the development of Stack Overflow, I determined early on that we'd be using Markdown for entering questions and answers in the system. Unfortunately, Markdown allows users to intermix HTML into the markup. It's part of the spec and everything. I sort of wish it wasn't, actually – one of the great attractions of pseudo-markup languages like BBCode is that they have nothing in common with HTML and thus sanitizing the input becomes trivial. Users have two choices:

  • Enter approved pseudo-markup.
  • Trick question. There is no other choice!

With BBCode, if the user enters HTML you blow it away with extreme prejudice – it's encoded, without exceptions. Easy. No thinking and barely any code required.
Since we use Markdown, we don't have that luxury. Like it or not, we are now in the nasty, brutish business of distinguishing "good" HTML markup from "evil" HTML markup. That's hard. Really hard. Dare and Jon are right to question the competency and maybe even the sanity of any developer who willingly decided to bite off that particular problem.

But here's the thing: deeply understanding HTML sanitization is a critical part of my business. Users entering markdown isn't just some little tickbox in a feature matrix for me, it is quite literally the entire foundation that our website is built on.

Here's a pop quiz from way back in 2001. See how you do.

  1. Code Reuse is:
    1. Good
    2. Bad
  2. Reinventing the Wheel is:
    1. Good
    2. Bad
  3. The Not-Invented-Here Syndrome is:
    1. Good
    2. Bad

I'm sure most developers are practically climbing over each other in their eagerness to answer at this point. Of course code reuse is good. Of course reinventing the wheel is bad. Of course the not-invented-here syndrome is bad.

Except when it isn't.

Joel Spolsky explains:

If it's a core business function – do it yourself, no matter what.

Pick your core business competencies and goals, and do those in house. If you're a software company, writing excellent code is how you're going to succeed. Go ahead and outsource the company cafeteria and the CD-ROM duplication. If you're a pharmaceutical company, write software for drug research, but don't write your own accounting package. If you're a web accounting service, write your own accounting package, but don't try to create your own magazine ads. If you have customers, never outsource customer service.

Being a "professional" developer, if there really is such a thing – I still have my doubts – doesn't mean choosing third-party libraries for every possible programming task you encounter. Nor does it mean blindly writing everything yourself out of a misguided sense of duty or the perception that's what gonzo, hardcore programming types do. Rather, experienced developers learn what their core business functions are and write whatever software they deem necessary to perform those functions extraordinarily well.

Do I regret spending a solid week building a set of HTML sanitization functions for Stack Overflow? Not even a little. There are plenty of sanitization solutions outside the .NET ecosystem, but precious few for C# or VB.NET. I've contributed the core code back to the community, so future .NET adventurers can use our code as a guidepost (or warning sign, depending on your perspective) on their own journey. They can learn from the simple, proven routine we wrote and continue to use on Stack Overflow every day.

(Sadly this code outlasted the refactormycode website, so I've reprinted the final version here; consider this MIT licensed.)

private static Regex _tags = new Regex("<[^>]*(>|$)",
    RegexOptions.Singleline | RegexOptions.ExplicitCapture | RegexOptions.Compiled);
private static Regex _whitelist = new Regex(@"
    ^</?(b(lockquote)?|code|d(d|t|l|el)|em|h(1|2|3)|i|kbd|li|ol|p(re)?|s(ub|up|trong|trike)?|ul)>$|
    ^<(b|h)r\s?/?>$",
    RegexOptions.Singleline | RegexOptions.ExplicitCapture | RegexOptions.Compiled | RegexOptions.IgnorePatternWhitespace);
private static Regex _whitelist_a = new Regex(@"
    ^<a\s
    href=""(\#\d+|(https?|ftp)://[-a-z0-9+&@#/%?=~_|!:,.;\(\)]+)""
    (\stitle=""[^""<>]+"")?\s?>$|
    ^</a>$",
    RegexOptions.Singleline | RegexOptions.ExplicitCapture | RegexOptions.Compiled | RegexOptions.IgnorePatternWhitespace);
private static Regex _whitelist_img = new Regex(@"
    ^<img\s
    src=""https?://[-a-z0-9+&@#/%?=~_|!:,.;\(\)]+""
    (\swidth=""\d{1,3}"")?
    (\sheight=""\d{1,3}"")?
    (\salt=""[^""<>]*"")?
    (\stitle=""[^""<>]*"")?
    \s?/?>$",
    RegexOptions.Singleline | RegexOptions.ExplicitCapture | RegexOptions.Compiled | RegexOptions.IgnorePatternWhitespace);


/// <summary>
/// sanitize any potentially dangerous tags from the provided raw HTML input using 
/// a whitelist based approach, leaving the "safe" HTML tags
/// CODESNIPPET:4100A61A-1711-4366-B0B0-144D1179A937
/// </summary>
public static string Sanitize(string html)
{
    if (String.IsNullOrEmpty(html)) return html;

    string tagname;
    Match tag;

    // match every HTML tag in the input
    MatchCollection tags = _tags.Matches(html);
    for (int i = tags.Count - 1; i > -1; i--)
    {
        tag = tags[i];
        tagname = tag.Value.ToLowerInvariant();
        
        if(!(_whitelist.IsMatch(tagname) || _whitelist_a.IsMatch(tagname) || _whitelist_img.IsMatch(tagname)))
        {
            html = html.Remove(tag.Index, tag.Length);
            System.Diagnostics.Debug.WriteLine("tag sanitized: " + tagname);
        }
    }

    return html;
}

Honestly, I'm not that great of a developer. I'm not so much talented as competent and loud. Start writing and talking and you can be loud, too. But I'll tell you this: in choosing to fight that HTML sanitizer battle, I've earned the scars of experience. I don't have to take anybody's word for it – I don't have to trust "libraries". I can look at the code, examine the input and output, and predict exactly what kinds of problems might arise. I have a deep and profound understanding of the risks, pitfalls, and tradeoffs of HTML sanitization.. and cross-site scripting vulnerabilities.

What I cannot create, I do not understand.

As Richard Feynman so famously wrote on his last blackboard, what I cannot create, I do not understand.

This is exactly the kind of programming experience I need to keep watch over Stack Overflow, and I wouldn't trade it for anything. You may not be building a website that depends on users entering markup, so you might make a different decision than I did. But surely there's something, some core business competency, so important that you feel compelled to build it yourself, even if it means making your own mistakes.

Programming is hard. But that doesn't mean you should always go shopping for third party libraries instead of writing code. If it's a core business function, write that code yourself, no matter what. If other programmers don't understand why it's so critically important that you sit down and write that bit of code – well, that's their problem.

They're probably too busy shopping to understand.

Discussion

Preventing CSRF and XSRF Attacks

In Cross-Site Request Forgeries and You I urged developers to take a close look at possible CSRF / XSRF vulnerabilities on their own websites. They're the worst kind of vulnerability -- very easy to exploit by attackers, yet not so intuitively easy to understand for software developers, at least until you've been bitten by one.

On the Freedom to Tinker blog, Bill Zeller offers one of the best, most concise explanation of XSRF that I've read to date:

CSRF vulnerabilities occur when a website allows an authenticated user to perform a sensitive action but does not verify that the user herself is invoking that action. The key to understanding CSRF attacks is to recognize that websites typically don't verify that a request came from an authorized user. Instead they verify only that the request came from the browser of an authorized user. Because browsers run code sent by multiple sites, there is a danger that one site will (unbeknownst to the user) send a request to a second site, and the second site will mistakenly think that the user authorized the request.

That's the key element to understanding XSRF. Attackers are gambling that users have a validated login cookie for your website already stored in their browser. All they need to do is get that browser to make a request to your website on their behalf. If they can either:

  1. Convince your users to click on a HTML page they've constructed
  2. Insert arbitrary HTML in a target website that your users visit

The XSRF game is afoot. Not too difficult, is it?

Bill Zeller and Ed Felten also identified new XSRF vulnerabilities in four major websites less than two weeks ago:

  1. ING Direct

    We discovered CSRF vulnerabilities in ING's site that allowed an attacker to open additional accounts on behalf of a user and transfer funds from a user's account to the attacker's account.

  2. YouTube

    We discovered CSRF vulnerabilities in nearly every action a user can perform on YouTube.

  3. MetaFilter

    We discovered a CSRF vulnerability in MetaFilter that allowed an attacker to take control of a user's account.

  4. The New York Times

    We discovered a CSRF vulnerability in NYTimes.com that makes user email addresses available to an attacker. If you are a NYTimes.com member, abitrary sites can use this attack to determine your email address and use it to send spam or to identify you.

If major public facing websites are falling prey to these serious XSRF exploits, how confident do you feel that you haven't made the same mistakes? Consider carefully. I'm saying this as a developer who has already made these same mistakes on his own website. I'm just as guilty as anyone.

It's our job to make sure future developers don't repeat the same stupid mistakes we made -- at least not without a fight. The Felten and Zeller paper (pdf) recommends the "double-submitted cookie" method to prevent XSRF:

When a user visits a site, the site should generate a (cryptographically strong) pseudorandom value and set it as a cookie on the user's machine. The site should require every form submission to include this pseudorandom value as a form value and also as a cookie value. When a POST request is sent to the site, the request should only be considered valid if the form value and the cookie value are the same. When an attacker submits a form on behalf of a user, he can only modify the values of the form. An attacker cannot read any data sent from the server or modify cookie values, per the same-origin policy. This means that while an attacker can send any value he wants with the form, he will be unable to modify or read the value stored in the cookie. Since the cookie value and the form value must be the same, the attacker will be unable to successfully submit a form unless he is able to guess the pseudorandom value.

The advantage of this approach is that it requires no server state; you simply set the cookie value once, then every HTTP POST checks to ensure that one of the submitted <input> values contains the exact same cookie value. Any difference between the two means a possible XSRF attack.

An even stronger, albeit more complex, prevention method is to leverage server state -- to generate (and track, with timeout) a unique random key for every single HTML FORM you send down to the client. We use a variant of this method on Stack Overflow with great success. That's why with every <form> you'll see the following:

<input id="fkey" name="fkey" type="hidden" value="df8652852f139" />

If you want to audit a website for XSRF vulnerabilities, start by asking this simple question about every single HTML form you can find: "where's the XSRF value?"

Discussion

The Importance of Sitemaps

So I've been busy with this Stack Overflow thing over the last two weeks. By way of apology, I'll share a little statistic you might find interesting: the percentage of traffic from search engines at stackoverflow.com.

Sept 16th
one day after public launch
10%
October 11th
less than one month after public launch
50%

I try to be politically correct in discussing web search, avoiding the g-word whenever possible, desperately attempting to preserve the illusion that web search is actually a competitive market. But it's becoming a transparent and cruel joke at this point. When we say "web search" we mean one thing, and one thing only: Google. Rich Skrenta explains:

I'm not a professional analyst, and my approach here is pretty back-of-the-napkin. Still, it confirms what those of us in the search industry have known for a long time.

The New York Times, for instance, gets nearly 6 times as much traffic from Google as it does from Yahoo. Tripadvisor gets 8 times as much traffic from Google vs. Yahoo.

Even Yahoo's own sites are no different. While it receives a greater fraction of Yahoo search traffic than average, Yahoo's own flickr service gets 2.4 times as much traffic from Google as it does from Yahoo.

My favorite example: According to Hitwise, [ex] Yahoo blogger Jeremy Zawodny gets 92% of his inbound search traffic from Google, and only 2.7% from Yahoo.

That was written almost two years ago. Guess which way those numbers have gone since then?

Google generally does a great job, so they deserve their success wholeheartedly, but I have to tell you: Google's current position as the start page for the internet kind of scares the crap out of me, in a way that Microsoft's dominance over the desktop PC never did. I mean, monopoly power over a desktop PC is one thing -- but the internet is the whole of human knowledge, or something rapidly approaching that. Do we really trust one company to be a benevolent monopoly over.. well, everything?

But I digress. Our public website isn't even a month old, and Google is already half our traffic. I'm perfectly happy to feed Google the kind of quality posts (well, mostly) fellow programmers are creating on Stack Overflow. The traffic graph provided by Analytics is amusingly predictable, as well.

stackoverflow.com traffic graph, sep. 16 - oct. 11

Giant peak of initial interest, followed by the inevitable trough of disillusionment, and then the growing weekly humpback pattern of a site that actually (shock and horror) appears to be useful to some people. Go figure. Guess they call it crackoverflow for a reason.

We knew from the outset that Google would be a big part of our traffic, and I wanted us to rank highly in Google for one very selfish reason -- writing search code is hard. It's far easier to outsource the burden of search to Google and their legions of server farms than it is for our tiny development team to do it on our one itty-bitty server. At least not well.

I'm constantly looking up my own stuff via Google searches, and I guess I've gotten spoiled. I expect to type in a few relatively unique words from the title and have whatever web page I know is there appear instantly in front of me. For the first two weeks, this was definitely not happening reliably for Stack Overflow questions. I'd type in the exact title of a question and get nothing. Sometimes I'd even get copies of our content from evil RSS scraper sites that plug in their own ads of questionable provenance, which was downright depressing. Other times, I'd enter a question title and get a perfect match. Why was old reliable Google letting me down? Our site is simple, designed from the outset to be easy for search engines to crawl. What gives?

What I didn't understand was the importance of a little file called sitemap.xml.

On a Q&A site like Stack Overflow, only the most recent questions are visible on the homepage. The URL to get to the entire list of questions looks like this:

http://stackoverflow.com/questions
http://stackoverflow.com/questions?page=2
http://stackoverflow.com/questions?page=3
..
http://stackoverflow.com/questions?page=931

Not particularly complicated. I naively thought Google would have no problem crawling all the questions in this format. But after two weeks, it wasn't happening. My teammate, Geoff, clued me in to Google's webmaster help page on sitemaps:

Sitemaps are particularly helpful if:

  • Your site has dynamic content.
  • Your site has pages that aren't easily discovered by Googlebot during the crawl process - for example, pages featuring rich AJAX or Flash.
  • Your site is new and has few links to it. (Googlebot crawls the web by following links from one page to another, so if your site isn't well linked, it may be hard for us to discover it.)
  • Your site has a large archive of content pages that are not well linked to each other, or are not linked at all.

I guess I was spoiled by my previous experience with blogs, which are almost incestuously hyperlinked, where everything ever posted has a permanent and static hyperlink attached to it, with simple monthly and yearly archive pages. With more dynamic websites, this isn't necessarily the case. The pagination links on Stack Overflow were apparently enough to prevent full indexing.

Enter sitemap.xml. The file itself is really quite simple; it's basically a non-spammy, non-shady way to have a "page" full of links that you feed to search engines. A way that is officially supported and endorsed by all the major web search engines. An individual record looks something like this:

<url>
<loc>http://stackoverflow.com/questions/24109/c-ide-for-linux</loc>
<lastmod>2008-10-11</lastmod>
<changefreq>daily</changefreq>
<priority>0.6</priority>
</url>

The above element is repeated for each one of the ~27,000 questions on Stack Overflow at the moment. Most search engines assume the file is at the root of your site, but you can inform them of an alternate location through robots.txt:

User-Agent: *
Allow: /
Sitemap: /sitemap.xml

There are also limits on size. The sitemaps.xml file cannot exceed 10 megabytes in size, with no more than 50,000 URLs per file. But you can have multiple sitemaps in a sitemap index file, too. If you have millions of URLs, you can see where this starts to get hairy fast.

I'm a little aggravated that we have to set up this special file for the Googlebot to do its job properly; it seems to me that web crawlers should be able to spider down our simple paging URL scheme without me giving them an explicit assist.

The good news is that since we set up our sitemaps.xml, every question on Stack Overflow is eminently findable. But when 50% of your traffic comes from one source, perhaps it's best not to ask these kinds of questions.

pixelated google overlords

Just smile and nod and follow the rules like everyone else. I, for one, welcome our pixelated Google overlords!

Discussion

Cross-Site Request Forgeries and You

As the web becomes more and more pervasive, so do web-based security vulnerabilities. I talked a little bit about the most common web vulnerability, cross-site scripting, in Protecting Your Cookies: HttpOnly. Although XSS is incredibly dangerous, it's a fairly straightforward exploit to understand. Do not allow users to insert arbitrary HTML on your site. The name of the XSS game is sanitizing user input. If you stick to a whitelist based approach -- only allow input that you know to be good, and immediately discard anything else -- then you're usually well on your way to solving any XSS problems you might have.

I thought we had our website vulnerabilies licked with XSS. I was wrong. Steve Sanderson explains:

Since XSS gets all the limelight, few developers pay much attention to another form of attack that's equally destructive and potentially far easier to exploit. Your application can be vulnerable to cross-site request forgery (CSRF) attacks not because you the developer did something wrong (as in, failing to encode outputs leads to XSS), but simply because of how the whole Web is designed to work. Scary!

It turns out I didn't understand how cross-site request forgery, also known as XSRF or CSRF, works. It's not complicated, necessarily, but it's more.. subtle.. than XSS.

Let's say we allow users to post images on our forum. What if one of our users posted this image?

<img src="http://foo.com/logout">

Not really an image, true, but it will force the target URL to be retrieved by any random user who happens to browse that page -- using their browser credentials! From the webserver's perspective, there is no difference whatsoever between a real user initiated browser request and the above image URL retrieval.

If our logout page was a simple HTTP GET that required no confirmation, every user who visited that page would immediately be logged out. That's XSRF in action. Not necessarily dangerous, but annoying. Not too difficult to envision much more destructive versions of this technique, is it?

There are two obvious ways around this sort of basic XSRF attack:

  1. Use a HTTP POST form submission for logout, not a garden variety HTTP GET.
  2. Make the user confirm the logout.

Easy fix, right? We probably should never have never done either of these things in the first place. Duh!

Not so fast. Even with both of the above fixes, you are still vulnerable to XSRF attacks. Let's say I took my own advice, and converted the logout form to a HTTP POST, with a big button titled "Log Me Out" confirming the action. What's to stop a malicious user from placing a form like this on their own website ..

<body onload="document.getElementById('f').submit()">
<form id="f" action="http://foo.com/logout" method="post">
<input name="Log Me Out" value="Log Me Out" />
</form>
</body>

.. and then convincing other users to click on it?

Remember, the browser will happily act on this request, submitting this form along with all necessary cookies and credentials directly to your website. Blam. Logged out. Exactly as if they had clicked on the "Log Me Out" button themselves.

Sure, it takes a tiny bit more social engineering to convince users to visit some random web page, but it's not much. And the possibilities for attack are enormous: with XSRF, malicious users can initiate any arbitrary action they like on a target website. All they need to do is trick unwary users of your website -- who already have a validated user session cookie stored in their browser -- into clicking on their links.

So what can we do to protect our websites from these kinds of cross site request forgeries?

  1. Check the referrer. The HTTP referrer, or HTTP "referer" as it is now permanently misspelled, should always come from your own domain. You could reject any form posts from alien referrers. However, this is risky, as some corporate proxies strip the referrer from all HTTP requests as an anonymization feature. You would end up potentially blocking legitimate users. Furthermore, spoofing the referrer value is extremely easy. All in all, a waste of time. Don't even bother with referrer checks.

  2. Secret hidden form value. Send down a unique server form value with each form -- typically tied to the user session -- and validate that you get the same value back in the form post. The attacker can't simply scrape your remote form as the target user through JavaScript, thanks to same-domain request limits in the XmlHttpRequest function.

  3. Double submitted cookies. It's sort of ironic, but another way to prevent XSRF, essentially a cookie-based exploit, is to add more cookies! Double submitting means sending the cookie both ways in every form request: first as a traditional header value, and again as a form value -- read via JavaScript and inserted. The trick here is that remote XmlHttpRequest calls can't read cookies. If either of the values don't match, discard the input as spoofed. The only downside to this approach is that it does require your users to have JavaScript enabled, otherwise their own form submissions will be rejected.

If your web site is vulnerable to XSRF, you're in good company. Digg, GMail, and Wikipedia have all been successfully attacked this way before.

Maybe you're already protected from XSRF. Some web frameworks provide built in protection for XSRF attacks, usually through unique form tokens. But do you know for sure? Don't make the same mistake I did! Understand how XSRF works and ensure you're protected before it becomes a problem.

Discussion