Obscenity Filters: Bad Idea, or Incredibly Intercoursing Bad Idea?
Browsing through a web archive of some old computer club conversations, I ran across this sentence:
"Apple made the clbuttic mistake of forcing out their visionary - I mean, look at what NeXT has been up to!"
Google "clbuttic" - thousands of hits!
There's a someone who call his car 'clbuttic'.
There are "Clbuttic Steam Engine" message boards.
Webster's dictionary - no help.
Hmm. What can this be?
As programmers, this isn't much of a mystery to us; it seems every day a brand new software developer is born and immediately begins repeating all the same mistakes we made years ago. I can't resist linking to Language Log again on this topic, where a commenter disputes whether or not this is an actual real world problem:
The "clbuttics" story may be a little exaggerated if not actually a web-legend. Sure, Google returns 4,000 hits – but by the time one reaches page 2 (in search of a page that isn't reporting on the silliness, or reporting on the reports, etc.) we're down to 200 hits.
Almost all of those 200 seem to have a "clbuttic mistake" by Apple at their core. Google's redundancy-compacting routines are only invoked when requested, it seems, and even then, the variety of information in 200 hits may be small.
In short, it's an echo chamber. 200 or 4,000 or however many hits today aren't as impressive as the same number last year, etc. All the more so as web sites of all kinds put randomly chosen (even Googled!) words out there just to game Google.
While I agree this particular manifestation of the mistake is probably over-reported (because, haha, butts are funny) and fairly rare on the open web, I still get this shiner on page one of my search results:
Poor Bluegrass World. I'm pretty sure that site is legitimate, though I have no idea how they'd post an article in that state. Obligatory link to dueling banjos scene from Deliverance. I'm inclined to believe this is, in fact, still a problem. There are many, many examples besides "clbuttic" out there. Perhaps you've heard of the United States Consbreastution?
Of course, what we have here is failed obscenity filters implemented by (extremely) newbie developers with regular expressions. I could explain, but as they say, a picture is worth a thousand words, particularly when it's a picture of my very bestest friend, RegexBuddy:
Oh, great, an inexperienced developer had a problem, and thought they would use regular expressions. Now they have two problems. Well, technically through Google they now have many thousands of problems, but who's counting.
I'm not sure regular expressions are to blame here. The replacement is so mind-bendingly naive that it might as well have been a simple
Replace operation. We, being extra-smart-gets-things-done developers, would write a superior regular expression using the
b word boundary qualifier around the replacement, and use some capturing parens to handle both the singular and plural cases.
How about those Great Tits, eh?
Proving, yet again, that bad ideas are just plain bad ideas, regardless of language or implementation choices. Obscenity filters are like blacklists; using one is tantamount to admitting failure before you've even started.
The list of replacements:
Some side-effects of this implementation:
medireview.com is currently occupied by domain squatters. Perhaps that's a fitting end for this "company", though I perversely almost want the company to exist, as wholly formed from our imaginations, sort of like Jamcracker.
Obscenity filtering is an enduring, maybe even timeless problem. I'm doubtful it will ever be possible to solve this particular problem through code alone. But it seems some companies and developers can't stop tilting at that windmill. Which means you might want to think twice before you move to Scunthorpe.