Coding Horror

programming and human factors

Hashtables, Pigeonholes, and Birthdays

One of the most beloved of all data structures in computer science is the hash table.

A hash table is a data structure that associates keys with values. The primary operation it supports efficiently is a lookup: given a key (e.g. a person's name), find the corresponding value (e.g. that person's telephone number). It works by transforming the key using a hash function into a hash, a number that is used to index into an array to locate the desired location ("bucket") where the values should be.

Key-value pairs are quite common in real world data, and hashtables are both reasonably efficient in storage and quite fast at lookups, offering O(1) performance in most cases. That's why hashtables are the go-to data structure for many programmers. It may not be the optimal choice, but unlike so many things in computer science, it's rarely a bad choice.

But hash tables do have one crucial weakness: they are only as good as the hash function driving them. As we add each new item to the hashtable, we compute a hash value from the key for that item, and drop the item in the bucket represented by that hash value. So how many buckets do we need? Let's consider the extremes:

  • If we had one giant bucket, everything would get piled in together. We'd have to look at each and every item in our one bucket to find the one we want, which reduces us to worst-case performance: an O(n) linear search.

  • If we had exactly the same number of buckets as items, each item is placed in its own unique, individual bucket. We know each bucket will contain one, and only one, item. That's a perfect hash function, delivering best-case performance: an O(1) lookup.

Reality, of course, lies somewhere in between these two extremes. The choice of hash function is critical, so you don't end up with a bucket shortage. As you place more and more items in each bucket (ie, "collisions") you edge closer to the slow O(n) end of the performance spectrum.

There's something magical about these hash functions that drive the hashtable. The idea of the hash as a unique digital fingerprint for every chunk of data in the entire world is a fascinating one. It's a fingerprint that cleverly fits into a mere 32 bits of storage, yet is somehow able to uniquely identify any set of data ever created.

Of course, this is a lie, for several reasons. Let's start with the most obvious one. Consider all possible values of a 32-bit hash function:

232 ~= 4.3 billion

The current population of the earth is about 6.6 billion people. If we were to apply a perfect 32-bit hash function to the DNA of every man, woman, and child on the planet, we could not guarantee uniqueness-- we simply don't have enough possible hash values to represent them all!

This is known as the pigeonhole principle. It's not complicated. If you try to put 6 pigeons in 5 holes, one will inevitably be left out in the cold.

pigeonholes

You'll definitely want to use a large enough hash value so you can avoid the pigeonhole principle. How much you care about this depends on how many things you're planning to store in your hashtable, naturally.

The other reason hashes can fail as digital fingerprints is because collisions are a lot more likely than most people realize. The birthday paradox illustrates how quickly you can run into collision problems for small hash values. I distinctly remember the birthday paradox from my college calculus class, and I'll pose you the same question our TA asked us:

In a typical classroom of 30 students, what are the odds that two of the students will have the same birthday?

Don't read any further until you've taken a guess. What's your answer?

2007 chinese calendar

Everyone has completely unique DNA, but shares one of 365* possible birthdays with the rest of us. Birthdays are effectively a tiny 365 value hash function. Using such a small hash value, there's a 50% chance of two people sharing the same birthday after a mere 23 people. With the 30 students in our hypothetical classroom, the odds of two students having a shared birthday rise to 70%. The statistics don't lie: when the question was posed in that classroom so many years ago, there were in fact two students who shared the same birthday.

A rule of thumb for estimating the number of values you need to enter in a hashtable before you have a 50 percent chance of an existing collision is to take the square root of 1.4 times the number of possible hash values.

SQRT(1.4 * 365) = 23
SQRT(1.4 * 232) = 77,543

When using a 32-bit hash value, we have a 50% chance that a collision exists after about 77 thousand entries-- a pretty far cry from the 4 billion possible values we could store in that 32-bit value. This is not a big deal for a hashtable; so what if a few of our buckets have more than one item? But it's a huge problem if you're relying on the hash as a unique digital fingerprint.

The hashing functions behind our precious hashtables may be a lie. But they're a convenient lie. They work. Just keep the pigeonhole principle and the birthday paradox in mind as you're using them, and you'll do fine.

* No, let's forget leap years for now. And other variables like birth patterns. Yes, I know this is how programmers think. Imagine how much it would suck to have one birthday every four years, though. Ouch.

Discussion

Sharing The Customer's Pain

In this interview with Werner Vogels, the CTO of Amazon, he outlines how Amazon's developers stay in touch with their users:

Remember that most of our developers are in the loop with customers, so they have a rather good understanding about what our customers like, what they do not like, and what is still missing.

We have a lot of feedback coming out of customer service. Many Amazonians have to spend some time with customer service every two years, actually listening to customer service calls, answering customer service e-mails, really understanding the impact of the kinds of things they do as technologists. This is extremely useful, because they begin to understand that our user base is not necessarily the techno-literate engineer. Rather, you may get a call from a grandma in front of a computer in a library who says she wants to buy something for her grandson who is at college and who has a Wishlist on Amazon.

Customer service statistics are also an early indicator if we are doing something wrong, or what the real pain points are for our customers. Sometimes in meetings we use a "voice of the customer," which is a realistic story from a customer about some specific part of the Amazon experience. This helps managers and engineers to connect with the fact that we build many of these technologies for real people.

It's easy to fall into a pattern of ivory tower software development. All too often, software developers are merely tourists in their own codebase. Sure, they write the code, but they don't use it on a regular basis like their customers do. They will visit from time to time, but they lack the perspective and understanding of users who -- either by choice or by corporate mandate-- live in that software as a part of their daily routine. As a result, problems and concerns are hard to communicate. They arrive as dimly heard messages from a faraway land.

When was the last time you even met a customer, much less tried to talk to them about a problem they're having with your website or software?

It's incredibly important for developers to be in the trenches with the customers-- the people who have to live with their code. Without a basic understanding of your users and customers, it's impossible to build considerate software. And nothing builds perspective like being on the front lines of support. I'm not proposing that all programmers pull double duty as support. It'd be impossible to get any work done. But a brief rotation through support would do wonders for the resulting quality and usability of your software, exactly as described by Mr. Vogels.

Dogfooding can be difficult. But manning the support desk, if only for a short while, isn't. Software developers should share the customer's pain. I know it's not glamorous. But until you've demonstrated a willingness to help the customers using the software you've built-- and more importantly, learn why they need help-- you haven't truly finished building that software.

Discussion

Please Don't Steal My Focus

Has this ever happened to you? You're merrily typing away in some application, minding your own business, when-- suddenly-- a dialog pops up and steals the focus from you.

Example of a dialog stealing focus from an application

At best, your flow is interrupted. You'll have to switch back to the window that you were using, figure out where you were, and resume your work.

But it can be worse. So, so much worse. If you happen to be typing something that can be interpreted as an action by that dialog-- and remember, pressing the space bar is the same as clicking a button when it happens to have the focus -- you could suddenly and very much accidentally be in a world of pain. Like this poor, unfortunate soul, who recently posted a plaintive comment to my XP Automatic Update Nagging post.

Great news! Microsoft developed a solution to this problem! Microsoft's most talented programmer figured out how to make "Reboot later" mean "Reboot when user says reboot". It only took some tweaking to 1 line of code, 180 days for approvals from 80 managers, 80 resource files for different languages, and 18 days for testing in one of the languages. It worked.

The programmer opened a SourceSafe^H^H^H^H^H^H^H^H^H^H^H Team Foundation window in order to check in the fix. An expert programmer, she was used to using the keyboard. She didn't click her mouse on the "OK" button, she just hit the Enter key.

The "Reboot now" / "Reboot later" prompt flashed so briefly, she didn't even notice it. She thought she hadn't pounded the Enter key hard enough. Looking at Team Foundation's "OK" button still waiting there for her to hit the Enter key to check in her work, she hit the Enter key again.

The check in started. The check in got killed while her workstation rebooted. There we remain today, with the check in half-in and half-out, unusable, with no good copy of the code. So that's why the fix was never released.

It's a perfect example of how stealing the user's focus can lead to catastrophic results if the user is particularly unlucky. Unfortunately, this burden falls heaviest on us keyboarders.

Another classic example is the IE download notification window, which loves to pop up, steal the focus, and tell you the great news: your download is complete! Oh, and your newly downloaded file is copying to its destination! Hooray! Unfortunately, this very same download notification dialog also contains a "Cancel" button. Guess which button just so happens to have the focus when this pops up? Why you'd want to cancel a download after it is complete is a mystery to me, but I've inadvertently pressed the space bar on this dialog more than once.

Stealing focus from the user is never acceptable. I can't imagine any circumstance where this would be desirable or even defensible behavior. Modal dialogs are bad enough, but this is even worse-- it's almost a system modal dialog, so self-important that all work must cease as the user is forced to pay attention to whatever earth-shattering message it urgently has to deliver. It's an extreme form of stopping the proceedings with idiocy. I'm not the first person to complain about this, of course. Fellow members of the "Don't Steal My Focus" club wrote about this back in 2002, again in 2005, and a few months ago. It's not exactly an unknown or new problem. So why do we have to keep talking about it and dealing with it? What gives?

The strange thing is, there are provisions built into the operating system to protect us from badly written, focus stealing applications. The ForegroundLockTimeout registry setting is expressly designed to prevent applications from stealing focus from the user. The OS silently converts that inappropriate focus stealing behavior into friendlier, less invasive taskbar button flashing, which is the subject of the ForegroundFlashCount registry setting.

I've seen this work. Most of the time, it does work. This setting is enabled by default in Windows XP and Vista. And yet, applications are occasionally able to steal the focus from me and screw up my flow. I'd say it happens a few times a week on average. It's perplexing. I'm wondering if it's because badly behaved programmers abuse the "Always on Top" window flag in a misguided attempt to get the user's attention. I suppose as long as there are bad programmers, there will be some unorthodox way they can devise to steal the focus from the user. At some level, sufficiently advanced incompetence is indistinguishable from malice. Maybe we'd have better luck educating programmers on the evils of focus stealing and, more generally, the futility of unnecessary notifications the user isn't going to read anyway.

But in the meantime, please don't steal my focus. I'm using it right now. Really. I am.

Discussion

Shuffling

Pop quiz, hotshot. How would you write code to shuffle a deck of cards?

a complete deck of 52 cards, arranged by suit and in ascending order

I was thinking about this after reading Mike's card-shuffling algorithm woes:

Here's where the non-CS mind comes into play. My first thought was to generate an unshuffled deck as an array-like structure -- all cards in order by suit. Then I'd create a second array-like structure. I'd walk through each card in the unshuffled deck, pick a random number, and insert the card at the randomly selected spot in the second array. If the randomly chosen position in the second array was already occupied, I'd choose another random number, see if it was used, and so on until the random selection happened to land on a free spot. I'll call this the Random Insert approach.

It seemed an odd approach to me, but unlike Mike, I have the dubious benefit of a programming background. I went immediately for my old friend, the loop. Let's assume we have an array with 52 members representing the 52 cards in the deck.

var rand = new Random();
for (int i = cards.Length - 1; i > 0; i--)
{
int n = rand.Next(i + 1);
int temp = cards[i];
cards[i] = cards[n];
cards[n] = temp;
}

So we loop through the deck, switching each card with another card from a random position in the deck. Seems straightforward enough, although I do wish there was a built in Swap command in the C# language to simplify the code a bit. It's eerily similar to the Knuth or Fisher-Yates shuffle, which doesn't mean I'm particularly smart, but that shuffling is an easily solved problem.

Or is it? This looks correct; there's nothing obviously wrong here. But there are two problems with this code. Can you see them?

The first problem is right here:

new Random();

Computers are lousy random number generators. Any shuffling you do, whatever the algorithm, will only be as good as your random number generator. So if you're running, say, an online casino, you need to be very careful when you start throwing around the word "Random" in your code. If you aren't careful, there will be.. problems.

The flaw exists in the card shuffling algorithm used to generate each deck. Ironically, the code was publicly displayed at www.planetpoker.com/ppfaq.htm with the idea of showing how fair the game is to interested players (the relevant question has since been removed). In the code, a call to randomize() is included to produce a random deck before each deck is generated. The implementation, built with Delphi 4 (a Pascal IDE), seeds the random number generator with the number of milliseconds since midnight according to the system clock. That means the output of the random number generator is easily predicted. A predictable "random number generator" is a very serious security problem.

By synchronizing our clock with the clock on the online casino and hitting the "shuffle" button, our program can calculate the exact shuffle. That means we know all the cards that have yet to appear, everyone's hand, and who will win. The screen shot below shows the information displayed by our program in realtime during an actual game. Our program knows what cards are to appear in advance, before they are revealed by the online game.

To be fair, this was 1999. I'd assume most online casinos have hired competent cryptographers and statisticians by now. With the ever looming specter of insider cheating and poker bots, they'd be fools not to.

The second problem with this code is that it's too complicated. Eric "purplicious" Lippert explains why, in his own inimitable way:

The standard way of implementing this algorithm is: associate each card with a random real number between 0.0 and 1.0. Sort the list based on its associated number. That's O(n log n) and has no bias.

As it turns out, the easiest way to implement a shuffle is by sorting. It's not exactly faster, as the typical sort is O(n log n) compared to the O(n) of the Knuth Fisher-Yates shuffle algorithm. We'll just sort by a random number-- in this case, a GUID.

var cards = Enumerable.Range(0, 51);
var shuffledcards = cards.OrderBy(a => Guid.NewGuid());

So we can ultimately implement a secure, unbiased shuffle as a one-liner in a modern programming language.

Which proves.. well, nothing, I suppose, but like so many other programming problems, there are a lot of ways to get shuffling wrong if you're not careful.

Discussion

Presentation: Be Vain

Frets on Fire is an open source clone of Guitar Hero. It's a great idea. Think of all the user-created songs we could play! My excitement quickly faded after I downloaded it and tried it out.

Frets on Fire, screenshot

I'll be first in line to champion gameplay over graphics, but the presentation in Frets on Fire is so bare-bones, so rudimentary that it's hard to muster any enthusiasm for the game whatsoever. Even equipped with a nice plastic USB guitar, there's only so much rocking you can do when you're staring at the loose OpenGL equivalent of a completely ASCII interface. It is incredibly primitive.

For comparison, here's a screenshot I captured of Guitar Hero III running on my PC.

Guitar Hero III for the PC, screenshot

These rhythm games are functionally identical. Press some combination of the five colored buttons on the USB plastic guitar and strum in time to the pre-recorded music. It's hardly guitar practice, but it is a fun new way to enjoy the music you already love.

Based on these screenshots, which fake plastic guitar rhythm game would you rather play?

Despite the universe of modding and custom songs possible with Frets on Fire, I'd much rather spend my time hacking new songs into the PC version of Guitar Hero III, even with the additional reverse engineering hurdles. The lesson I take from this is that presentation matters.

Of course, this comparison is grossly unfair to Frets on Fire. It is open source and completely free -- whereas the Mac and PC version of Guitar Hero III costs $79.99 bundled with the guitar. There's a commercial army of artists, developers, and producers behind Guitar Hero III. It's unreasonable to expect Frets on Fire to have the same production values. I'm not exactly going to march over to the SourceForge page and demand my money back or anything. I applaud what they've done.

But playing Frets on Fire makes me feel like a keyboard jockey instead of a rock god. This isn't just a personal deficiency of mine-- it's a presentation problem with real world ramifications. Better presentation would win many converts to the Frets on Fire camp, and woo them away from the alternatives. I think this presentation rule applies to all software. If you want people to get excited about your software, you have to make it look reasonably good, as Joel Spolsky points out:

I learned this lesson as a consultant, when I did a demo of a major web-based project for a client's executive team. The project was almost 100% code complete. We were still waiting for the graphic designer to choose fonts and colors and draw the cool 3-D tabs. In the meantime, we just used plain fonts and black and white, there was a bunch of ugly wasted space on the screen, basically it didn't look very good at all. But 100% of the functionality was there and was doing some pretty amazing stuff.

What happened during the demo? The clients spent the entire meeting griping about the graphical appearance of the screen. They weren't even talking about the UI. Just the graphical appearance. "It just doesn't look slick," complained their project manager. That's all they could think about. We couldn't get them to think about the actual functionality. Obviously fixing the graphic design took about one day. It was almost as if they thought they had hired painters.

I'd argue that presentation cuts a little deeper than Joel is insinuating:

''Most people make the mistake of thinking design is what it looks like,'' says Steve Jobs, Apple's C.E.O. ''People think it's this veneer -- that the designers are handed this box and told, 'Make it look good!' That's not what we think design is. It's not just what it looks like and feels like. Design is how it works.''

Avoid creating software that's beautiful on the inside but ugly on the outside. Be vain. Make something that looks as good as it works. If you pay attention to the presentation of your software, you just may find the rest of the world is a lot more willing to pay attention, too.

Discussion