Coding Horror

programming and human factors

Sorting for Humans : Natural Sort Order

The default sort functions in almost every programming language are poorly suited for human consumption. What do I mean by that? Well, consider the difference between sorting filenames in Windows explorer, and sorting those very same filenames via Array.Sort() code:

Explorer shell sort Array.Sort()
alphabetical sort ASCIIbetical sort

Quite a difference.

I can say without the slightest hint of exaggeration that this exact sorting problem has been a sore point on every single project I've ever worked on. Users will inevitably complain that their items aren't sorting properly, and file bugs on these "errors". Being card-carrying members of the homo logicus club, we programmers produce a weary sigh, and try to keep any obvious eye-rolling in check as we patiently inform our users that this isn't an error. Items are sorting in proper order. Proper ASCII order, that is. As we're walking away, hopefully you won't hear us mutter under our breath what we're actually thinking-- "Stupid users! They don't even understand how sorting works!"

I always felt a pang of regret when rejecting these requests. Honestly, look at those two lists-- what sane person would want ASCII order? It's a completely nonsensical ordering to anyone who doesn't have the ASCII chart committed to memory (and by the way, uppercase A is decimal 65). I never really understood that there was another way to sort, even though natural sort has been right in front of us all along in the form of Mac Finder and Windows Explorer file listings. I had language-induced blinders on. If our built-in sort returns in ASCII order, then that must be correct. It was bequeathed upon us by the Language Gods. Can there be any other way?

Kate Rhodes is a bit up in arms about our collective ignorance of ASCIIbetical vs. Alphabetical. Can't say I blame her. I'm as guilty as anyone. Turns out the users weren't the stupid ones after all -- I was.

Silly me, I just figured that alphabetical sorting was such a common need (judging by the number of people asking how to do it I'm not wrong either) that I wouldn't have to write the damn thing. But I didn't count on the stupid factor. Jesus Christ people. You're programmers. You're almost all college graduates and none of you know what the f**k "Alphabetical" means. You should all be ashamed. If any of you are using your language's default sort algorithm, which is almost guaranteed to be ASCIIbetical (for good reason) to get alphabetical sorting you proceed to the nearest mirror and slap yourself repeatedly before returning to your desks and fixing your unit tests that didn't catch this problem.

It isn't called "Alphabetical sort"; it's collectively known as natural sort. But she's right about one thing: it's hard to find information on natural sorting, and many programmers are completely ignorant of it. None of the common computer languages (that I know of) implement anything other than ASCIIbetical sorts. There are a few places you can find natural sort algorithms, however:

Don't let Ned's clever Python ten-liner fool you. Implementing a natural sort is more complex than it seems, and not just for the gnarly i18n issues I've hinted at, above. But the Python implementations are impressively succinct. One of Ned's commenters posted this version, which is even shorter:

import re
def sort_nicely( l ):
""" Sort the given list in the way that humans expect.
"""
convert = lambda text: int(text) if text.isdigit() else text
alphanum_key = lambda key: [ convert(c) for c in re.split('([0-9]+)', key) ]
l.sort( key=alphanum_key )

I tried to come up with a clever, similarly succinct C# 3.0 natural sort implementation, but I failed. I'm not interested in a one-liner contest, necessarily, but it does seem to me that a basic natural sort shouldn't require the 40+ lines of code it takes in most languages.

As programmers, we'd do well to keep Kate's lesson in mind: ASCIIbetical does not equal alphabetical. ASCII sorting serves the needs of the computer and the compiler, but what about us human beings? Perhaps a more human-friendly natural sort option should be built into mainstream programming languages, too.

Discussion

Blacklists Don't Work

Jon Galloway and I got into a heated debate a few weeks ago about the efficacy of anti-virus software. My position is that anti-virus software sucks, and worst of all, it doesn't work anyway. That's what I've been saying all along, and it's exactly what I told Jon, too:

The performance cost of virus scanning (lose 50% of disk performance, plus some percent of CPU speed) does not justify the benefit of a 33% detection rate and marginal protection. I would argue the illusion of protection is very, very dangerous as well.

Ask yourself this: why don't Mac users run anti-virus software? Why don't UNIX users run anti-virus software? Because they don't need to. They don't run as administrators. Sadly, the cost of running as non-admin is severe on Windows, because MS made some early, boneheaded architectural decisions and perpetuated them over a decade. But the benefit is substantial. There's almost nothing a virus, malware, or trojan can do to a user who isn't running as an administrator.

I believe we should invest our money, time, and effort in things that make sense, things that work. Things like running as a non-administrator. And we should stop wasting our time on voodoo, which is what anti-virus software ultimately is.

To be fair, anti-virus software is more effective than I realized. In the August 2007 Anti-Virus Comparatives, the lowest detection rate was 90%, and the highest was 99.6%.

But I have a problem with the test methodology that produced these results. If we build a library of tests using all the viruses and malware in all of recorded history, we'll get an absurdly high detection rate. But who really cares if Kapersky can detect a year old virus, much less a three or four year old one? What matters most, I think, is detection rate for new threats. That's what's really dangerous, not some ancient strain of a long-forgotten DOS virus. I'm sure anti-virus vendors love comparatives like this. It makes for great ad copy: we can detect 99.7% of threats! The bad news, which is hidden by a footnote marker and placed in 4-point text at the bottom of the page, is that 99.3% of them are so old as to be utterly irrelevant and meaningless. (Update: in a comment, Anders pointed out that a November 27th "proactive/retrospective" test (pdf) from the same site, using threats only a month old, showed far lower detection rates: between 80% and 33%.)

We could appeal to the data. Of the top 5 threats on the virus radar, only one is younger than six months. However, the youngest dates from December 4th, a mere eight days ago. And it only takes one. If anything gets through your anti-virus software, you're just as compromised as you would be if you were running no anti-virus software at all.

But for now, let's assume these comparative statistics are correct. The heroic anti-virus teams can detect 99.7% of all the evil code in the world, and protect you from them, in the name of truth, justice, and the American Way. But it's far from automatic. It only works if you stick to the plan. You know, the plan:

  1. Purchase the best, most effective third party anti-virus software available. On a subscription plan. And install it.
  2. Suck up the massive real-time virus check performance penalty.
  3. Keep your anti-virus religiously up to date at all times. (Hourly? Daily? Weekly?)
  4. Pray your anti-virus vendor can deliver signature updates faster than all the combined virus, trojan, and malware writers on the internet can create and deliver their payloads.

Wow, not much can go wrong there. And then you only have a 0.33% chance (or a 20% chance, depending which set of data you believe) of getting in very big trouble. Problem solved!

Or you could just, y'know, not run as an administrator, and then you'd never have any chance of getting in trouble. Ever. Well, at least not from trojans, malware, or viruses. But evidently a few children's programs fail to run as non-administrator, and programming as a non-administrator is difficult, so that's a deal-breaker for Jon.

After a lot (really, a lot) of back and forth with Jon on this topic, I realized that my position boiled down to one core belief:

Blacklists don't work.

At its heart, anti-virus software is little more than a glorified blacklist. It maintains an internal list of evil applications and their unique byte signatures, and if it sees one on your system, kills it for you. Sure, anti-virus vendors will dazzle you with their ad copy, their heuristic this and statistical that; they'll tell you (with a straight face, even) that their software is far more than a simple blacklist. It's a blacklist with lipstick. It's the prettiest, shiniest, most kissable blacklist you've ever seen!

I could waste your time by writing a long diatribe here about how blacklisting is a deeply flawed approach to security. But I don't have to. We can turn to our old friend Mark Pilgrim for the most radical deconstruction of blacklisting you'll probably ever read.

I see from Jay's Comment Spam Clearinghouse that the latest and greatest tool available to us is a master [black]list of domain names and a few regular expressions. No offense to Jay or all the people who have contributed to the list so far, but how quaint! I mean really. Savor this moment, folks. You can tell your children stories of how, back in the early days of weblogging, you could print out the entire spam blacklist on a single sheet of paper. Maybe with two or three columns and a smallish font, but still. Boy, those were the days.

And they won't last. They absolutely won't last. They won't last a month. The domain list will grow so unwieldy so quickly, you won't know what hit you. It'll get so big that it will take real bandwidth just to host it. Keeping it a free download will make you go broke. Code is free, but bandwidth never will be. Do you have a business plan? You'll need one within 6 months.

And then people will start complaining because a regex matches their site. Or spammers will set up fake identities to report real sites and try to poison the list. Are you manually screening new contributions? That won't scale. Are you not manually screening new contributions? That won't work either. Weighing contributions with a distributed Whuffie system? Yeah, that's possible, but it's a tricky balance, and still open to manipulation.

It's all been done. It's all been done before, and it was completely all-consuming, and it still didn't work. Spammers register dozens of new domains each day; you can't possibly keep up with them. They're bigger and smarter and faster than you. It's an arms race, and you'll lose, and along the way there will be casualties, massive casualties as innocent bystanders start getting blacklisted. (You do have a process for people to object to their inclusion, right? Yeah, except the spammers will abuse that too.)

Oh, and it goes on. That's a mere slice. Read the rest. Like Mark, blacklists make me angry. Angry because I have to waste my time manually entering values in a stupid blacklist. Angry because the resulting list really doesn't work worth a damn, and I'll have to do the same exact thing again tomorrow, like clockwork. And most of all, angry because they're a dark mirror into the absolute worst parts of human nature.

I've had plenty of experience with blacklists. A miniscule percentage of spammers have the resources to bypass my naive CAPTCHA. They hire human workers to enter spam comments. That's why I enter URLs into a blacklist every week on this very site. It's an ugly, thankless little thing, but it's necessary. I scrutinize every comment, and I remove a tiny percentage of them: they might be outright spam, patently off-topic, or just plain mean. I like to refer to this as weeding my web garden. It's a productivity tax you pay if you want to grow a bumper crop of comments, which, despite what Joel says, often bear such wonderful fruit. The labor can be minimized with improved equipment, but it's always there in some form. And I'm OK with that. The myriad benefits of a robust comment ecosystem outweighs the minor maintenance effort.

I've also had some experience with the fancy, distributed crowdsourcing style of blacklist. It's a sort of consensual illusion; many hands may make light work, but they won't miraculously fix the fundamentally broken security model of a blacklist. You'll have the same core problems I have with the unpleasant little blacklist I maintain, writ much larger. The world's largest decentralized blacklist is still, well, a blacklist.

So, in the end, perhaps I should apologize to Jon. I suppose anti-virus software does work, in a fashion... at a steep mental and physical cost. Like any blacklist, the effort necessary to maintain an anti-virus blacklist will slowly expand to occupy all available space and time. In philosophical terms, keeping an exhaustive and authoritative list of all the evil that men can do is an infinitely large task. At best, you can only hope to be ahead at any particular moment, if you're giving 110%, and if you're doing everything exactly the right way. Every single day. And sleep lightly, because tomorrow you'll wake up to face a piping hot batch of fresh new evil.

If a blacklist is your only option, then by all means, use it.

With comments, I'm stuck. There's no real alternative to the blacklist approach as a backup for my CAPTCHA. Furthermore, the ultimate value of a comment is subjective, so some manual weeding is desirable anyway. But when it comes to anti-virus we do have another option. A much better option. We can run as non-administrators. Running as a non-administrator has historically proven to be completely effective on OS X and UNIX, where the notion of anti-virus software barely exists.

Isn't that the way it should be? Relying on a blacklist model for security is tantamount to admitting failure before you've even started. Why perpetuate the broken anti-virus blacklist model when we don't have to?

Discussion

Are You a Doer or a Talker?

Today's lesson comes to you courtesy of your local Department of Transportation:

 

The Utah DOT is spending $6 million on a feasibility study for a bridge across a lake. Meanwhile, the DOT doesn't have enough money to put up traveler information video cameras at dangerous mountain passes and canyons to help motorists make safe driving choices. That $6 million could have easily paid for the cameras, while still leaving money for a reasonable analysis of a bridge feasibility.

In Washington State, budgeted money for a drainage project has been tied up in studies. Meanwhile, recent flooding has wiped out several small cities with damages in the tens of millions, destroying many people's lives. And $16 million is being spent on a rail trail conversion study. That money would be enough to build a new freeway interchange and relieve millions in congestion delay and improved safety today.

John Taber, the author of the article, is professionally involved in exactly these kinds of studies. His opinion?

 

Heck, I make a living off transportation studies and even I'm saying there's too much study and not enough construction.

It's an easy conceptual trip from building bridges to building software. In software, some developers take up residence on planet architecture, an otherworldly place where software is eternally planned and discussed but never actually constructed. Having endless discussions about software in a conference room or mailing list feels like useful work-- but is it? Until you've produced a working artifact for the rest of the world to experience, have you really done anything?

One of my favorite quotes on this subject comes from Christopher Baus.

 

Software isn't about methodologies, languages, or even operating systems. It is about working applications. At Adobe I would have learned the art of building massive applications that generate millions of dollars in revenue. Sure, PostScript wasn't the sexiest application, and it was written in old school C, but it performed a significant and useful task that thousands (if not millions) of people relied on to do their job. There could hardly be a better place to learn the skills of building commercial applications, no matter the tools that were employed at the time. I did learn an important lesson at ObjectSpace. A UML diagram can't push 500 pages per minute through a RIP.

There are two types of people in this industry. Talkers and Doers. ObjectSpace was a company of talkers. Adobe is a company of doers. Adobe took in $430 million in revenue last quarter. ObjectSpace is long bankrupt.

So that's what you have to ask yourself: are you a doer, or a talker? Ideally, you'd be a little of both, as I've said many times here. There's certainly value in some discussion and planning on your team. But if you must pick one or the other, try to err on the side that produces useful, working code:

 

The best way to start an open source project is with code. Working code. Hack away at home on weekends, maybe get a couple of friends to help you out, and don't go public until you have something to show people that does something interesting, and that other people can use to build more interesting stuff on top of. You need this for a bunch of different reasons: it establishes the original contributor's bona fides in the open-source meritocracy, it shortcuts all sorts of damaging debates about coding styles and architecture that can stop a project before it starts, and so on.

Working code attracts people who want to code. Design documents attract people who want to talk about coding.

There's a definite upper bound on the value of pure, theoretical discussion and planning in software. The trick is knowing when you've reached that limit, and how to funnel that effort into producing real code assets instead of letting it dissipate in a harmless puff of hot air.

Discussion

Gifts for Geeks: 2007 Edition

In case you hadn't noticed, it's that time of year again: let the wholesale buying of crap begin!

As a technology enthusiast with a bad impulse purchase habit, I get a lot of complaints that I am difficult to buy for. That's sort of intentional. I spent my entire childhood waiting to grow into an adult partially so I could afford to buy myself all the crap my parents wouldn't buy me when I was a kid.

I now regret that. Well, a little. Man, it's fun to buy crap.

So here are my favorite lists of cool, quirky, offbeat geek gifts for 2007:

Phew. If that doesn't give you some solid gift ideas for your favorite geek (or incite your own techno-gadget-lust frenzy), then I can't help you.

This year, I treated myself to the D-Link DGL-4500 Xtreme N Gaming Router.

D-Link DGL-4500

Do I need a new router? Well, no, my DGL-4300 still works well enough. But just look at this thing -- it's bristling with awesomeness:

  • Onboard OLED real-time network activity display
  • Dual-band 2.4 Ghz and 5 GHz alphabet soup 802.11, including 802.11n (draft 2.0)
  • Four gigabit ethernet networking ports
  • Upstream and downstream Quality of Service "GameFuel" support
  • USB port for optional windows connect now technology
  • Three, count 'em, three antennas

I loved my DGL-4300, and I love its big brother the DGL-4500 even more. I know QoS isn't exactly a new feature, but having it work out of the box is a huge perk -- I can saturate my connection with torrents and still experience perfect, lag-free online gaming. It's still amazing to me that this works. Although you pay a premium for a "gaming" class router versus a generic one, it's not too much of a premium -- the DGL-4500 is currently $169 at Amazon and that includes the well-reviewed World in Conflict real time strategy game. So, that's my indulgence, and I can recommend it.

Enjoy this year's crap-buying season. I'll leave you with one final bit of gifting advice-- if there's any serious doubt whether the recipient will find the gift useful, go for gift cards. The economics of gift cards may be a little wonky, but personally, I'd much prefer receiving a gift certificate/card over some physical item that requires trudging to the store to return.

Discussion

The Danger of Naïveté

In my previous post on shuffling, I glossed over something very important. The very first thing that came to mind for a shuffle algorithm is this:

for (int i = 0; i < cards.Length; i++)
{
  int n = rand.Next(cards.Length);
  Swap(ref cards[i], ref cards[n]);
}

It's a nice, simple solution to the shuffling problem:

  1. Loop through each card in the deck.
  2. Swap the current card with another randomly chosen card.

At first blush, this seems like a perfectly reasonable way to shuffle. It's simple, it's straightforward, and the output looks correct. It's the very definition of a naïve algorithm:

A naïve algorithm is a very simple solution to a problem. It is meant to describe a suboptimal algorithm compared to a "clever" (but less simple) algorithm. Naïve algorithms usually consume larger amounts of resources (time, space, memory accesses, ...), but are simple to devise and implement.

An example of a naïve algorithm is bubble sort, which is only a few lines long and easy to understand, but has a O(n2) time complexity. A more "clever" algorithm is quicksort, which, although being considerably more complicated than bubble sort, has a O(n log n) average complexity.

But there's a deep, dark problem with this naïve shuffling algorithm, a problem that most programmers won't see. Do you see it? Heck, I had the problem explained to me and I still didn't see it.

Watch what happens when I use this naïve shuffling algorithm to shuffle a three-card deck 600,000 times. There are 3! or 6 possible combinations in that deck. If the shuffle is working properly, we should see each combination represented around 100,000 times.

naive shuffle on a 3 card deck, 600k iterations

As you can see, 231, 213, and 132 are over-represented, and the other three possibilities are under-represented. The naïve shuffle algorithm is biased and fundamentally broken. Moreover, the bias isn't immediately obvious; you'd have to shuffle at least a few thousand times to see real statistical evidence that things aren't working correctly. It's a subtle thing.

Usually, naïve algorithms aren't wrong -- just oversimplified and inefficient. The danger, in this case, is rather severe. A casual programmer would implement the naïve shuffle, run it a few times, see reasonably correct results, and move on to other things. Once it gets checked in, this code is a landmine waiting to explode.

Let's take a look at the correct Knuth-Fisher-Yates shuffle algorithm.

for (int i = cards.Length - 1; i > 0; i--)
{
  int n = rand.Next(i + 1);
  Swap(ref cards[i], ref cards[n]);
}

Do you see the difference? I missed it the first time. Compare the swaps for a 3 card deck:

Naïve shuffle Knuth-Fisher-Yates shuffle
rand.Next(3);
rand.Next(3);
rand.Next(3);
rand.Next(3);
rand.Next(2);

The naive shuffle results in 33 (27) possible deck combinations. That's odd, because the mathematics tell us that there are really only 3! or 6 possible combinations of a 3 card deck. In the KFY shuffle, we start with an initial order, swap from the third position with any of the three cards, then swap again from the second position with the remaining two cards.

Knuth-Fisher-Yates shuffle combinations diagram

The KFY shuffle produces exactly 3 * 2 = 6 combinations, as pictured above. Based on your experience shuffling physical cards, you might think the more shuffling that goes on, the more random the deck becomes. But more shuffling results in worse, not better, results. That's where the naïve algorithm goes horribly wrong. Let's compare all possible permutations of a 3 card deck for each algorithm:

Naïve shuffle Knuth-Fisher-Yates shuffle
123
123
123
123
132
132
132
132
132
213
213
213
213
213
231
231
231
231
231
312
312
312
312
321
321
321
321
123 132 213 231 312 321

You can plainly see how some of the deck combinations appear unevenly in the 27 results of the naïve algorithm. Stated mathematically, 27 is not evenly divisible by six.

Enough theory. Let's see more results. How about a four card deck, shuffled 600,000 times?

Naive vs. Knuth shuffle, 4 card deck, 600k iterations

600,000 divided by 24 is 25,000; that's almost exactly what we see right down the line for every possible combination of cards with the KFY shuffle algorithm. The naïve algorithm, in comparison, is all over the map.

It gets worse with larger decks. Here's the same comparison for a six card deck.

Naive vs. Knuth shuffle, 6 card deck, 600k iterations

With a 6 card deck, the differences between the two algorithms grow even larger. The math, yet again, explains why.

6! = 720
66 = 46,656

With 46,656 paths to only 720 real world outputs, it's inevitable that some of those paths will be severely over-represented or under-represented in the output. And are they ever. If you shipped a real card game with a naïve shuffle, you'd have some serious exploits to deal with.

I know this may seem like remedial math to some of you, but I found this result strikingly counterintuitive. I had a very difficult time understanding why the naïve shuffle algorithm, which is barely different from the KFY shuffle algorithm, produces such terribly incorrect results in practice. It's a minute difference in the code, but a profound difference in results. Tracing through all the permutations and processing the statistics helped me understand why it was wrong.

Naïve implementations are often preferred to complex ones. Simplicity is a virtue. It's better to be simple, slow, and understandable than complex, fast, and difficult to grasp. Or at least it usually is. Sometimes, as we've seen here with our shuffling example, the simplicity of the naïve implementation can mislead. It is possible for the code to be both simple and wrong. I suppose the real lesson lies in testing. No matter how simple your code may be, there's no substitute for testing it to make sure it's actually doing what you think it is.

Discussion