Coding Horror

programming and human factors

Slaying Mighty Dragons: Competitive Ranking and Matching Systems

Attending yesterday's Halo 3 launch event at the Silicon Valley Microsoft campus -- and the large Halo3 tournament we helped moderate -- got me thinking about player ranking and matching systems. Without a well-designed ranking and matching system in place, you'll get horribly mismatched games, where one team demolishes the other by a margin of 3x or more. This is counterproductive for both teams:

  • The higher skilled team learns absolutely nothing by defeating their underskilled opponent. They're coasting by without a challenge. I suppose it's fun, in a way, to win every match and crush your opponents. But it's ultimately an empty and meaningless style of play.

  • The lower skilled team is demoralized by a total, crushing defeat at the hands of the higher skilled team. There's no incentive to continue. And when you're consistently matched against opponents that you have no chance of winning against, why play at all?

An ideal system would always match players of approximately equal skill against each other. How do you know when you've achieved that goal? You look at the data. The record of wins, losses, and ties for each player provides the answer: if the matching and ranking system is working properly, every player or team will eventually plateau at a skill level where they are winning about 50% of the time.

There are dozens of possible ranking and rating systems we could use. Christopher Allen and Shannon Appelcline's post Collective Choice: Competitive Ranking Systems explores many of them. But perhaps the most famous ranking system is the Elo rating system, which originated in the chess world. It's a simple statistical model used to objectively rate player skill levels:

Here's how Elo works:

  • A new player is assigned a default rating, say 1200.
  • Two players compete, with one of three results: win, loss, tie.
  • The two player's ratings are fed into an algorithm along with the end state of the game. A new rating for each player is returned.

Let's say two players, both rated 1200, play a game. Player 1 wins. Player 1 will now be rated 1205 and player 2 will be rated 1195.

Players that win a lot will achieve higher ratings. But the higher rated player also starts to see diminishing returns for defeating low ranked players. In order to increase his rank, he must defeat other higher ranked players. If a high ranked player loses to a low ranked player, he loses much more of his rating then he'd gain if he won the match.

Over time the game players will end up being rated based on their Elo skill level rather than other factors.

It's simpler than it seems. Let me put this in language us geeks can understand: you gain more XP from slaying a massive, mighty dragon than you do from beating up on a common wharf rat.

Dungeons and Dragons rulebook detail

As you level up, it's entirely possible that you may become so very powerful that you'll have to move on from those massive dragons to even more intimidating opponents. Seems obvious enough to those of us who understand the statistical conventions of Dungeons and Dragons.

The Elo system, like the Dungeons and Dragons system, is proven to work. Although it was originally designed to rank chess players, it's used throughout the online and offline gaming worlds to rank and match players, in everything from Pokemon to Scrabble to World of Warcraft.

When it comes to matching players online, Blizzard Entertainment is arguably one of the most experienced companies on the planet. They pioneered online ranking systems back in 1996 with Diablo and battle.net, and extended that lead through Starcraft, Diablo II, Warcraft III, and the juggernaut that is World of Warcraft. Blizzard has collectively matched and ranked millions of players-- if anyone knows how to get this right, it's Blizzard.

Warcraft III is an excellent case study in player ranking and matching. Despite Blizzard's prior experience, and even though it uses the proven Elo ranking system, Warcraft III had a lot of problems matching players -- problems that took years for Blizzard to work out. I myself was an avid Warcraft III ranked player for about a year, so I've experienced this first hand. Warcraft III's automatic matching system was radically overhauled with Patch 1.15 in 2004, a full two years after the game was released. If you scrutinize the Blizzard support FAQs, you can read between the lines to see where the exploits and pathologies are:

1. Until a player has played a certain number of games, their performance will be highly variable, and accurately matching them is difficult.

The skill of players with low-level accounts can vary radically, from a player who has just begun playing multiplayer games of Warcraft III to a highly experienced player who has played hundreds of games but has just created a new account. Novice players would all too often find themselves in unfair matches against very skilled opponents. In an attempt to better match players of similar skill together into games, Battle.net's matchmaking system no longer uses a player's level as the only determining factor when creating games.

2. Unless players are consistently playing games, their rating should decay over time.

Players may lose XP if they do not play a minimum number of games per week, as listed on Chart 1. Failure to play the minimum game requirement during a week results in an XP penalty. For each game you don't play below the amount required for your level, you incur one loss [to an opponent of the same skill level].

3. Players cannot cherry-pick opponents or environments to improve their ranking. They must play randomly chosen opponents on neutral ground.

Battle.net's ladder system has been revamped for Warcraft III. This new system promotes active competition and maintains the ladder's integrity. The anonymous matchmaking system (AMM) prevents win trading and ensures that high-level players face opponents of similar skill. The AMM anonymously matches players based on skill level, and randomly selects maps based on the players' preferences. Players can choose their race, but most other game options are preset by the designers, resulting in a higher level of play.

If you're not into anonymous play, arranged team games are possible, but they are isolated on a special ladder of their own. This is done to prevent any kind of player collusion from tainting the main ladder results.

You can find many of the very same dangers echoed in the "practical issues" section of the Elo rating Wikipedia entry. Chess games and real-time strategy games are completely different things, but they have strikingly similar ranking and matching pathologies.

One particular weakness of the Elo system is the overly simplified assumption that the performance of every player will follow a normal distribution. This is a more severe problem than you might think, as it disproportionately affects new and beginning players. As Blizzard noted above, "Novice players would all too often find themselves in unfair matches against very skilled opponents." For many games, a large proportion of your audience will be novices. If these novice players experience several bad mismatches, they may never come back, and pretty soon you won't have a community of gamers to match anyone against.

Most modern implementations of the Elo system make adjustments to rectify this glaring flaw:

Subsequent statistical tests have shown that chess performance is almost certainly not normally distributed. Weaker players have significantly greater winning chances than Elo's model predicts. Therefore, both the USCF and FIDE have switched to formulas based on the logistic distribution. However, in deference to Elo's contribution, both organizations are still commonly said to use "the Elo system".

Perhaps the most modern variant of Elo is Microsoft's TrueSkill ranking system used on its Xbox Live online gaming network. The TrueSkill system has better provisions for scoring team games, whereas Elo is based on the single player per game model. But TrueSkill's main innovation is incorporating the uncertainty of a player's rating deeply into the mathematical model.

Rather than assuming a single fixed skill for each player, the system characterises its belief using a bell-curve belief distribution (also referred to as Gaussian) which is uniquely described by its mean μ (mu) ("peak point") and standard deviation σ (sigma) ("spread"). An exemplary belief is shown in the figure.

TrueSkill graph

Note that the area under the skill belief distribution curve within a certain range corresponds to the belief that the player's skill will lie in that range. For example, the green area in the figure on the right is the belief that the player's skill is within level 15 and 20. As the system learns more about a player's skill, σ has the tendency to become smaller, more tightly bracketing that player's skill. Another way of thinking about the μ and σ values is to consider them as the "average player skill belief" and the "uncertainty" associated with that assessment of their skill.

It's more complex math than the relatively simple classic Elo ranking system, but TrueSkill should result in more accurate ranking and matching. Remember, that's why we're doing this: to achieve the best possible gameplay experience for all players, not just the elite players who have hundreds of games under their belt.

Our goal is to match players of all skill levels to their counterparts, and to let players reliably rise in rank until they reach a natural plateau of 50% win/loss ratio. We don't want blowouts. We want nail-biting, edge-of-your-seat cliffhangers every time, games that go down to the wire. We will know we've succeeded when each player feels like every game was the equivalent of slaying a massive, mighty dragon-- and not beating up on some puny wharf rats.

Discussion

Steve McConnell in the Doghouse

I often trot out Steve McConnell's doghouse analogy to illustrate how small projects aren't necessarily representative of the problems you'll encounter on larger projects.

People who have written a few small programs in college sometimes think that writing large, professional programs is the same kind of work -- only on a larger scale. It is not the same kind of work. I can build a beautiful doghouse in my backyard in a few hours. It might even take first prize at the county fair's doghouse competition. But that does not imply that I have the expertise to build a skyscraper. The skyscraper project requires an entirely more sophisticated kind of expertise.

There's a similar passage in Rapid Development, which I cited in Following the Instructions on the Paint Can.

What happens if you don't follow the instructions? If you're painting a doghouse on a hot Tuesday night after work, you might only have 2 hours to do the job, and Fido needs a place to sleep that night. You don't have time to follow the instructions. You might decide that you can skip steps 1 through 3 and apply a thick coat rather than a thin one in step 4. If the weather's right and Fido's house is made of wood and isn't too dirty, your approach will probably work fine.

an old doghouse

Over the next few months the paint might crack from being too thick or it might flake off from the metal surfaces of the nails where you didn't prime them, and you might have to repaint it again next year, but it really doesn't matter.

What if, instead of a doghouse, you're painting a Boeing 747? In that case, you had better follow the instructions to the letter. If you don't strip off the previous coat, you'll incur significant fuel efficiency and safety penalties: a coat of paint on a 747 weighs 400 to 800 pounds. If you don't prepare the surface adequately, wind and rain attacking the paint at 600 miles per hour will take their toll much quicker than a gentle wind and rain will on Fido's doghouse.

The underlying lesson is the same: what works for small projects may be a total disaster on a larger scale. Being a competent software engineer means choosing appropriate strategies for the size of the project you're working on. Are you working on a doghouse, a skyscraper, a jet airliner, or the space shuttle?

Perhaps that's why I was so entertained by Steve's most recent blog post. He documents building a fort for his kids. It's not exactly a doghouse, but it's close. Along the way, Steve applies his considerable software estimation and project planning skills to the project. (Remember, this is the guy who quite literally wrote the books on these subjects.) It's a small project, too, so our odds of success are about as good as they're going to get.

Whenever I do a physical construction project like this I try to pay attention to which attributes of the project are similar to software projects and which are different. The comparisons are made more challenging by the fact that my construction projects are recreational, whereas I'm trying to draw comparisons to commercial software projects. For the first half of the project, no good similarities jumped at out me. But as the project started to take much longer than I expected, I began to see more and more similarities between my estimates on the fort and problems people run into with software estimates.

How did it go?

Days 3-6 went about like Days 1 & 2 had gone, which is to say there were lots of little tasks that turned out to be medium-sized tasks, there were little tasks that I just hadn't anticipated, and most things took longer than I had planned. By the end of Day 7 (my buffer day), I was done with the tasks I had planned for Day 3 and had a tiny start on Day 4, which is to say that I'd completed the decking, hadn't started on the railings or framing, and had one wall of the fort framed, but that was all.

My original plan had called for about a week full time and then another couple of weeks of finishing up loose ends like painting, installing trim, and so on. I finished the fort about 6 weeks after I started it, so I was about 100% over my planned schedule, and I ended up at 2-3x my originally planned effort.

Steve got subsumed in the unpredictable details. This completely mirrors my software project experience. Often, you can't even begin to accurately estimate how long something will take until you start doing it. At least some of it. That's why so many teams turn to agile, iterative development techniques; part of each iteration involves exploring all those unknowns and turning them into slightly-less-unknowns for the next iteration. The faster we iterate, the closer we get to an accurate estimate, and the more work we get done along the way. We plan by doing.

This is easily my favorite post in Steve's blog to date. Do read the entire post for all the gory details of how things went awry. It's a storybook example of how an avalanche of little problems can snowball into one huge project delay – even if you're Steve McConnell. And even if you're only building a doghouse fort.

Discussion

LCD Monitor Arms

Steve Olson contacted me a few weeks ago after he saw my post on ergonomic computing. Steve works for Ergotron, and offered to comp me some monitor arms.

Usually when I get offered free items related to my blog, I politely decline. I don't want a conflict of interest, and frankly most of the time what I am offered doesn't appeal to me anyway. But as a long time triple monitor enthusiast, I've been pondering the idea of using LCD monitor arms for a few months now, and Ergotron is a model I've seen recommended a bunch of places. So I accepted Steve's offer. Take this as a disclaimer; I was provided these monitor arms gratis, so my opinion may be biased.

Initially I was interested in their triple monitor mount, but Steve waved me away from that and recommended three Ergotron LX Desk Mount LCD Arms instead. I figured he's the expert, so I followed his advice.

I got the LX monitor arms installed earlier this week, after a few trials and tribulations, and here's the end result. What you're seeing is a triple monitor configuration, two 24" widescreen SyncMaster 245BW monitors (1920 x 1200), with a single 21.3" SyncMaster 213T in the middle (1600 x 1200).

ergotron monitor arms on 3 LCDs, front

The arms free up a ton of space under the monitors. With the large base of the 245BW, I couldn't put my mouse where I wanted to. Now I can.

All LCDs have a standard VESA mounting plate in the rear, which the arms are designed to slot into like lego bricks. Installing the arms is a simple matter of unbolting the default base from your LCD monitor, and bolting the monitor to the arm. However, be careful – there is more than one VESA mounting plate standard, and larger monitors require larger plates. Ergotron provides a display compatibility page where you can check to see what you'll need. The 100 x 100 mm VESA mounting panel is very common, but if your monitor is 22" or larger you might have something bigger, as I did on my 24"s.

ergotron monitor arms on 3 LCDs, back closeup

Before I set the monitor arms up, I had them sitting on my desk for test purposes, and they were quite the conversation pieces. People kept stopping by my office to check them out and ask what they were, and how they worked.

ergotron monitor arms on 3 LCDs, back, angled

As you can see, I have a 2x1 and 1x1 configuration, two bases with three arms. The base clamps against the bottom of the desk, so you'll need an unobstructed edge to attach them to. We found the best results came from clamping both bases through the holes cut in the desk for cable routing. These holes were never obstructed and offered the best positioning options. There's a clever little integrated cable conduit built into the underside of the larger part of the arm, so the cables don't droop behind the monitor.

ergotron monitor arms on 3 LCDs, back

The Ergotron LX monitor arms have exceeded my expectations. They …

  • give your monitors a sleek "floating" look
  • free up a substantial amount of space on your desk
  • allow you to reposition the monitors at will

The arms are counterweighted and spring tensioned; once you adjust the tension screws for the monitor, you can (almost) effortlessly rotate, angle, and move the LCDs horizontally or vertically within the range of the arm. Turning a monitor to demo something, and then slotting it back into place when I'm done, is no problem.

When I initially considered monitor arms, I decided it was a better investment to upgrade one of my monitors than it was to buy three monitor arms that I may or may not like. Now that I've used these arms for a few days, I like them so much I'm considering buying a set for myself to use at home. Yes, with my own money. Amazon carries the Ergotron LX in black and silver for about $115. You can even use these arms to hold your laptop.

Fancy monitor arms are by no means required. The stock LCD monitor bases included by the manufacturers work just fine. But if you're willing to invest a bit and customize your setup, I think the payoff – in terms of flexibility, looks, and convenience – is well worth it.

Discussion

On Expose, Flip3D, and Switcher

I'm one of the rare people who actually likes Windows Vista. Sure, it's far from what was originally promised in terms of features, but it's still a solid quality of life improvement from the crusty old 2001 version of Windows XP. Or at least it will be, once Service Pack 1 is released.

Like anything else, there's plenty to be critical of in Vista. One particular feature of Vista that I find almost unforgivably lame is Flip3D.

Windows Vista Flip3D

It's a third-rate clone of Apple's OSX Expose feature, which itself is an exploration of zoomable UI.

OSX Expose

Vista's Flip3D certainly looks cool enough, and you can use your mouse wheel to spin the windows around, which is entertaining for a few seconds. But it fails miserably in terms of actual usability:

  • It only uses the primary monitor to show the window list, so any additional display space you have is completely wasted.
  • The windows are stacked on top of each other, partially obscuring every window except the topmost one. This also makes the target area for selecting and clicking on any stacked window very small.
  • The arbitrary switch from a 2D desktop into a 3D display space is mentally disconcerting. This change also slants and distorts the windows, so readability is lower than it should be.

In their effort to distinguish themselves from OSX, Redmond created a complete non-feature. Flip3D is barely better than nothing.

But we don't have to suffer through Flip3D when we can replace it. There are several nice alternatives. Personally, I recommend disabling Flip3D and mapping Bao Nguyen's outstanding Switcher to the Windows+Tab key combination.

switcher-screenshot-small.jpg

I just noticed that Bao released a new beta version of Switcher:

  • Middle-click a window to close it.
  • The first 9 windows can be selected by number; the numeric shortcut is superimposed over the window.
  • Right-click a window to open it, and minimize all other windows.
  • Windows now have a large text label superimposed in the corner with the title and icon so you can tell what they are. This is helpful if you have many similar-looking windows, or if they're thumbnailed particularly small.
  • You can perform an incremental filtering search on all open windows.

These are some killer new features. I've wanted to close windows by middle-clicking on them from the zoomed view forever. But the last item on that list is huge. It's a game changer. Instead of playing Where's Waldo with my windows, I can press Windows+Tab, then type what I want. It's exactly what I described in The Problem With Tabbed Interfaces:

So how can we fix this? How can we integrate tabs with the existing navigational features of the operating system, such as the taskbar, and Expose? I keep coming back to search as the dominant computing metaphor. The only thing I can think of is a plain-text search facility where I type "Gmail", and the OS would automatically highlight that tab (or window) and bring it to the front. That presupposes a very high level of integration between the application tabs and the operating system, however.

It looks like Bao Nguyen was reading my mind. Pressing Windows+Tab, then typing "Gmail" is the best thing ever as far as I'm concerned. No, I can't search tab contents, but I can now match by any window title, which is good enough. The way I can begin typing and watch the windows dynamically fling themselves offscreen as they fall out of my filter in real time is a huge productivity boost. I cannot understate how significant this feature is. It redefines the way I deal with windows; I can type what I want instead of expending the mental effort to visually scan thumbnails of 20 different windows.

Unlike Flip3d, the graphical frills of Switcher are all in service to the functionality. That's the way it should be. I Highly recommend trying out the latest beta of Switcher. But be sure to bring a fast video card to the table for the best experience.

Discussion

Everything Is Fast For Small n

Let's say you're about to deploy an application. Said app has been heavily tested by your development team, who have all been infected by unit testing fever. It's also been vetted by your QA group, who spent months spelunking into every crevice of the app. You even had a beta test period with real live users, who dutifully filed bugs and put the application through its paces before finally signing off on the thing.

Your application is useful and popular. Your users love it. Your users love you. But over the next week, something curious happens. As people use the application, it gets progressively slower and slower. Soon, the complaints start filtering in. Within a few weeks, the app is well-neigh unusable due to all the insufferable delays it subjects users to – and your users turn on you.

Raise your hand if this has ever happened to a project you've worked on. If I had a buck for every time I've personally seen this, I'd have enough for a nice lunch date. Developers test with tiny toy data sets, assume all is well, and then find out the hard way that everything is fast for small n.

I remember a client-side Javascript sort routine we implemented in a rich intranet web app circa 2002. It worked great on our small test datasets, but when we deployed it to production, we were astonished to find that sorting a measly hundred items could take upwards of 5 seconds on a user's desktop machine. JavaScript isn't known for its speed, but what the heck?

Well, guess which sort algorithm we used?

animation of sorting algorithms

InsertSort is n2 (worst case), ShellSort is n3/2, and QuickSort is n log n. But we could have done worse – we could have picked Bubble Sort, which is n2 even in the best case.

Friends, do not do this. Test your applications with large data sets, at least large enough to cover your most optimistic usage projections over the next year. Otherwise, you may be sorry. And your users definitely will be sorry.

Big O notation is one of those things that's easier seen than explained. But it's a fundamental building block of computer science.

Big O notation: A theoretical measure of the execution of an algorithm, usually the time or memory needed, given the problem size n, which is usually the number of items. Informally, saying some equation f(n) = O(g(n)) means it is less than some constant multiple of g(n). The notation is read, "f of n is big oh of g of n".

Developers rely on data structures and databases that have favorable big O notation performance baked in, without thinking much about it. But if you stray from those well-worn paths, you can be in a world of performance pain – and much sooner than you could have possibly imagined. The importance of big O notation is best illustrated in this graph from Programming Pearls:

runtime on TRS-80 vs. Dec Alpha

The TRS-80 algorithm is 48n, and the DEC Alpha algorithm is n3.

When n is 10, they're within a second of each other. But when n grows to 100,000, the modern DEC Alpha takes a month to do what a prehistoric TRS-80 can accomplish in a few hours. Having a big O notation bottleneck in your app is a one-way ticket in the performance wayback machine to 1997 – or worse. Much worse.

There are friendly names for the common big O notations; saying "n squared" is equivalent to saying "quadratic":

notation friendly name
O(1) constant
O(log n) logarithmic
O([log n]c) polylogarithmic
O(n) linear
O(n log n) sometimes called "linearithmic" or "supralinear"
O(n2) quadratic
O(nc) polynomial, sometimes "geometric"
O(cn) exponential
O(n!) factorial

Tom Niemann has handy charts that compare the growth rates of common algorithms, which I've adapted here:

n lg n n7/6 n lg n n2 7/6n n!
1 0 1 0 1 1 1
16 4 25 64 256 12 20.9 trillion
256 8 645 2,048 65,536 137 quadrillion -
4,096 12 16,384 49,152 16,777,216 - -
65,536 16 416,128 1,048,565 4,294,967,296 - -
1,048,476 20 10,567,808 20,969,520 1.09 trillion - -
16,775,616 24 268,405,589 402,614,784 281.4 trillion - -

Here are sample execution times, assuming one unit of execution is equal to one millisecond of CPU time. That's probably far too much on today's CPUs, but you get the idea:

n lg n n7/6 n lg n n2 7/6n n!
1 <1 sec <1 sec <1 sec <1 sec <1 sec <1 sec
16 <1 sec <1 sec <1 sec <1 sec <1 sec  663 years
256 <1 sec <1 sec 2 sec 65 sec 4.3 mlln yrs -
4,096 <1 sec 16 sec 49 sec 4.6 hr - -
65,536 <1 sec 7 min 17 min 50 days - -
1,048,476 <1 sec 3 hr 6 hr 35 years - -
16,775,616 <1 sec 3 days 4.6 days 8,923 years - -

Notice how quickly we get into trouble as the number of items (n) increases. Unless you've chosen data structures that offer ideal near-logarithmic performance across the board, by the time you get to 4,096 items you're talking about some serious CPU time. Beyond that, I used dash as shorthand for "forever". That's how bad it can get.

Everything is fast for small n. Don't fall into this trap. It's an easy enough mistake to make. Modern apps are incredibly complex, with dozens of dependencies. Neglect to index one little field in your database and you're suddenly back in TRS-80 land. The only way to truly know if you've accidentally slipped an algorithmic big O bottleneck into your app somewhere is to test it with a reasonably large volume of data.

Discussion