Coding Horror

programming and human factors

Thanks For Ruining Another Game Forever, Computers

In 2006, after visiting the Computer History Museum's exhibit on Chess, I opined:

We may have reached an inflection point. The problem space of chess is so astonishingly large that incremental increases in hardware speed and algorithms are unlikely to result in meaningful gains from here on out.

So. About that. Turns out I was kinda … totally completely wrong. The number of possible moves, or "problem space", of Chess is indeed astonishingly large, estimated to be 1050:

100,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000

Deep Blue was interesting because it forecast a particular kind of future, a future where specialized hardware enabled brute force attack of the enormous chess problem space, as its purpose built chess hardware outperformed general purpose CPUs of the day by many orders of magnitude. How many orders of magnitude? In the heady days of 1997, Deep Blue could evaluate 200 million chess positions per second. And that was enough to defeat Kasparov, the highest ever ranked human player – until 2014 at least. Even though one of its best moves was the result of a bug.

200,000,000

In 2006, about ten years later, according to the Fritz Chess benchmark, my PC could evaluate only 4.5 million chess positions per second.

4,500,000

Today, about twenty years later, that very same benchmark says my PC can evaluate a mere 17.2 million chess positions per second.

17,200,000

Ten years, four times faster. Not bad! Part of that is I went from dual to quad core, and these chess calculations scale almost linearly with the number of cores. An eight core CPU, no longer particularly exotic, could probably achieve ~28 million on this benchmark today.

28,000,000

I am not sure the scaling is exactly linear, but it's fair to say that even now, twenty years later, a modern 8 core CPU is still about an order of magnitude slower at the brute force task of evaluating chess positions than what Deep Blue's specialized chess hardware achieved in 1997.

But here's the thing: none of that speedy brute forcing matters today. Greatly improved chess programs running on mere handheld devices can perform beyond grandmaster level.

In 2009 a chess engine running on slower hardware, a 528 MHz HTC Touch HD mobile phone running Pocket Fritz 4 reached the grandmaster level – it won a category 6 tournament with a performance rating of 2898. Pocket Fritz 4 searches fewer than 20,000 positions per second. This is in contrast to supercomputers such as Deep Blue that searched 200 million positions per second.

As far as chess goes, despite what I so optimistically thought in 2006, it's been game over for humans for quite a few years now. The best computer chess programs, vastly more efficient than Deep Blue, combined with modern CPUs which are now finally within an order of magnitude of what Deep Blue's specialized chess hardware could deliver, play at levels way beyond what humans can achieve.

Chess: ruined forever. Thanks, computers. You jerks.

Despite this resounding defeat, there was still hope for humans in the game of Go. The number of possible moves, or "problem space", of Go is estimated to be 10170:

1,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000

Remember that Chess had a mere fifty zeroes there? Go has more possible moves than there are atoms in the universe.

Wrap your face around that one.

Deep Blue was a statement about the inevitability of eventually being able to brute force your way around a difficult problem with the constant wind of Moore's Law at your back. If Chess is the quintessential European game, Go is the quintessential Asian game. Go requires a completely different strategy. Go means wrestling with a problem that is essentially impossible for computers to solve in any traditional way.

A simple material evaluation for chess works well – each type of piece is given a value, and each player receives a score depending on his/her remaining pieces. The player with the higher score is deemed to be 'winning' at that stage of the game.

However, Chess programmers innocently asking Go players for an evaluation function would be met with disbelief! No such simple evaluation exists. Since there is only a single type of piece, only the number each player has on the board could be used for a simple material heuristic, and there is almost no discernible correlation between the number of stones on the board and what the end result of the game will be.

Analysis of a problem this hard, with brute force completely off the table, is colloquially called "AI", though that term is a bit of a stretch to me. I prefer to think of it as building systems that can learn from experience, aka machine learning. Here's a talk which covers DeepMind learning to play classic Atari 2600 videogames. (Jump to the 10 minute mark to see what I mean.)

As impressive as this is – and it truly is – bear in mind that games as simple as Pac-Man still remain far beyond the grasp of Deep Mind. But what happens when you point a system like that at the game of Go?

DeepMind built a system, AlphaGo, designed to see how far they could get with those approaches in the game of Go. AlphaGo recently played one of the best Go players in the world, Lee Sedol, and defeated him in a stunning 4-1 display. Being the optimist that I am, I guessed that DeepMind would win one or two games, but a near total rout like this? Incredible. In the space of just 20 years, computers went from barely beating the best humans at Chess, with a problem space of 1050, to definitively beating the best humans at Go, with a problem space of 10170. How did this happen?

Well, a few things happened, but one unsung hero in this transformation is the humble video card, or GPU.

Consider this breakdown of the cost of floating point operations over time, measured in dollars per gigaflop:

1961$8,300,000,000
1984$42,780,000
1997$42,000
2000$1,300
2003$100
2007$52
2011$1.80
2012$0.73
2013$0.22
2015$0.08

What's not clear in this table is that after 2007, all the big advances in FLOPS came from gaming video cards designed for high speed real time 3D rendering, and as an incredibly beneficial side effect, they also turn out to be crazily fast at machine learning tasks.

The Google Brain project had just achieved amazing results — it learned to recognize cats and people by watching movies on YouTube. But it required 2,000 CPUs in servers powered and cooled in one of Google’s giant data centers. Few have computers of this scale. Enter NVIDIA and the GPU. Bryan Catanzaro in NVIDIA Research teamed with Andrew Ng’s team at Stanford to use GPUs for deep learning. As it turned out, 12 NVIDIA GPUs could deliver the deep-learning performance of 2,000 CPUs.

Let's consider a related case of highly parallel computation. How much faster is a GPU at password hashing?

Radeon 79708213.6 M c/s
6-core AMD CPU52.9 M c/s

Only 155 times faster right out of the gate. No big deal. On top of that, CPU performance has largely stalled in the last decade. While more and more cores are placed on each die, which is great when the problems are parallelizable – as they definitely are in this case – the actual performance improvement of any individual core over the last 5 to 10 years is rather modest.

But GPUs are still doubling in performance every few years. Consider password hash cracking expressed in the rate of hashes per second:

GTX 295200925k
GTX 690201254k
GTX 780 Ti2013100k
GTX 980 Ti2015240k

The latter video card is the one in my machine right now. It's likely the next major revision from Nvidia, due later this year, will double these rates again.

(While I'm at it, I'd like to emphasize how much it sucks to be an 8 character password in today's world. If your password is only 8 characters, that's perilously close to no password at all. That's also why why your password is (probably) too damn short. In fact, we just raised the minimum allowed password length on Discourse to 10 characters, because annoying password complexity rules are much less effective in reality than simply requiring longer passwords.)

Distributed AlphaGo used 1202 CPUs and 176 GPUs. While that doesn't sound like much, consider that as we've seen, each GPU can be up to 150 times faster at processing these kinds of highly parallel datasets — so those 176 GPUs were the equivalent of adding ~26,400 CPUs to the task. Or more!

Even if you don't care about video games, they happen to have a profound accidental impact on machine learning improvements. Every time you see a new video card release, don't think "slightly nicer looking games" think "wow, hash cracking and AI just got 2× faster … again!"

I'm certainly not making the same mistake I did when looking at Chess in 2006. (And in my defense, I totally did not see the era of GPUs as essential machine learning aids coming, even though I am a gamer.) If AlphaGo was intimidating today, having soundly beaten the best human Go player in the world, it'll be no contest after a few more years of GPUs doubling and redoubling their speeds again.

AlphaGo, broadly speaking, is the culmination of two very important trends in computing:

  1. Huge increases in parallel processing power driven by consumer GPUs and videogames, which started in 2007. So if you're a gamer, congratulations! You're part of the problem-slash-solution.

  2. We're beginning to build sophisticated (and combined) algorithmic approaches for entirely new problem spaces that are far too vast to even begin being solved by brute force methods alone. And these approaches clearly work, insofar as they mastered one of the hardest games in the world, one that many thought humans would never be defeated in.

Great. Another game ruined forever by computers. Jerks.

Based on our experience with Chess, and now Go, we know that computers will continue to beat us at virtually every game we play, in the same way that dolphins will always swim faster than we do. But what if that very same human mind was capable of not only building the dolphin, but continually refining it until they arrived at the world's fastest minnow? Where Deep Blue was the more or less inevitable end result of brute force computation, AlphaGo is the beginning of a whole new era of sophisticated problem solving against far more enormous problems. AlphaGo's victory is not a defeat of the human mind, but its greatest triumph.

(If you'd like to learn more about the powerful intersection of sophisticated machine learning algorithms and your GPU, read this excellent summary of AlphaGo and then download the DeepMind Atari learner and try it yourself.)

[advertisement] Find a better job the Stack Overflow way - what you need when you need it, no spam, and no scams.
Discussion

We Hire the Best, Just Like Everyone Else

One of the most common pieces of advice you'll get as a startup is this:

Only hire the best. The quality of the people that work at your company will be one of the biggest factors in your success – or failure.

I've heard this advice over and over and over at startup events, to the point that I got a little sick of hearing it. It's not wrong. Putting aside the fact that every single other startup in the world who heard this same advice before you is already out there frantically doing everything they can to hire all the best people out from under you and everyone else, it is superficially true. A company staffed by a bunch of people who don't care about their work and aren't good at their jobs isn't exactly poised for success. But in a room full of people giving advice to startups, nobody wants to talk about the elephant in that room:

It doesn't matter how good the people are at your company when you happen to be working on the wrong problem, at the wrong time, using the wrong approach.

Most startups, statistically speaking, are going to fail.

And they will fail regardless of whether they hired "the best" due to circumstances largely beyond their control. So in that context does maximizing for the best possible hires really make sense?

Given the risks, I think maybe "hire the nuttiest risk junkie adrenaline addicted has-ideas-so-crazy-they-will-never-work people you can find" might actually be more practical startup advice. (Actually, now that I think about it, if that describes you, and you have serious Linux, Ruby, and JavaScript chops, perhaps you should email me.)

Okay, the goal is to increase your chance of success, however small it may be, therefore you should strive to hire the best. Seems reasonable, even noble in its way. But this pursuit of the best unfortunately comes with a serious dark side. Can anyone even tell me what "best" is? By what metrics? Judged by which results? How do we measure this? Who among us is suitable to judge others as the best at … what, exactly? Best is an extreme. Not pretty good, not very good, not excellent, but aiming for the crème de la crème, the top 1% in the industry.

The real trouble with using a lot of mediocre programmers instead of a couple of good ones is that no matter how long they work, they never produce something as good as what the great programmers can produce.

Pursuit of this extreme means hiring anyone less than the best becomes unacceptable, even harmful:

In the Macintosh Division, we had a saying, “A players hire A players; B players hire C players” – meaning that great people hire great people. On the other hand, mediocre people hire candidates who are not as good as they are, so they can feel superior to them. (If you start down this slippery slope, you’ll soon end up with Z players; this is called The Bozo Explosion. It is followed by The Layoff.) — Guy Kawasaki

There is an opportunity cost to keeping someone when you could do better. At a startup, that opportunity cost may be the difference between success and failure. Do you give less than full effort to make your enterprise a success? As an entrepreneur, you sweat blood to succeed. Shouldn’t you have a team that performs like you do? Every person you hire who is not a top player is like having a leak in the hull. Eventually you will sink. — Jon Soberg

Why am I so hardnosed about this? It’s because it is much, much better to reject a good candidate than to accept a bad candidate. A bad candidate will cost a lot of money and effort and waste other people’s time fixing all their bugs. Firing someone you hired by mistake can take months and be nightmarishly difficult, especially if they decide to be litigious about it. In some situations it may be completely impossible to fire anyone. Bad employees demoralize the good employees. And they might be bad programmers but really nice people or maybe they really need this job, so you can’t bear to fire them, or you can’t fire them without pissing everybody off, or whatever. It’s just a bad scene.

On the other hand, if you reject a good candidate, I mean, I guess in some existential sense an injustice has been done, but, hey, if they’re so smart, don’t worry, they’ll get lots of good job offers. Don’t be afraid that you’re going to reject too many people and you won’t be able to find anyone to hire. During the interview, it’s not your problem. Of course, it’s important to seek out good candidates. But once you’re actually interviewing someone, pretend that you’ve got 900 more people lined up outside the door. Don’t lower your standards no matter how hard it seems to find those great candidates. — Joel Spolsky

I don't mean to be critical of anyone I've quoted. I love Joel, we founded Stack Overflow together, and his advice about interviewing and hiring remains some of the best in the industry. It's hardly unique to express these sort of opinions in the software and startup field. I could have cited two dozen different articles and treatises about hiring that say the exact same thing: aim high and set out to hire the best, or don't bother.

This risk of hiring not-the-best is so severe, so existential a crisis to the very survival of your company or startup, the hiring process has to become highly selective, even arduous. It is better to reject a good applicant every single time than accidentally accept one single mediocre applicant. If the interview process produces literally anything other than unequivocal "Oh my God, this person is unbelievably talented, we have to hire them", from every single person they interviewed with, right down the line, then it's an automatic NO HIRE. Every time.

This level of strictness always made me uncomfortable. I'm not going to lie, it starts with my own selfishness. I'm pretty sure I wouldn't get hired at big, famous companies with legendarily difficult technical interview processes because, you know, they only hire the best. I don't think I am one of the best. More like cranky, tenacious, and outspoken, to the point that I wake up most days not even wanting to work with myself.

If your hiring attitude is that it's better to be possibly wrong a hundred times so you can be absolutely right one time, you're going to be primed to throw away a lot of candidates on pretty thin evidence.

Perhaps worst of all, if the interview process is predicated on zero doubt, total confidence … maybe this candidate doesn't feel right because they don't look like you, dress like you, think like you, speak like you, or come from a similar background as you? Are you accidentally maximizing for hidden bias?

One of the best programmers I ever worked with was Susan Warren, an ex-Microsoft engineer who taught me about the People Like Us problem, way back in 2004:

I think there is a real issue around diversity in technology (and most other places in life). I tend to think of it as the PLU problem. Folk (including MVPs) tend to connect best with folks most like them ("People Like Us"). In this case, male MVPs pick other men to become MVPs. It's just human nature.

As one reply notes, diversity is good. I'd go as far as to say it's awesome, amazing, priceless. But it's hard to get to -- the classic chicken and egg problem -- if you rely on your natural tendencies alone. In that case, if you want more female MVPs to be invited you need more female MVPs. If you want more Asian-American MVPs to be invited you need more Asian-American MVPs, etc. And the (cheap) way to break a new group in is via quotas.

IMO, building diversity via quotas is bad because they are unfair. Educating folks on why diversity is awesome and how to build it is the right way to go, but also far more costly.

Susan was (and is) amazing. I learned so much working under her, and a big part of what made her awesome was that she was very much Not Like Me. But how could I have appreciated that before meeting her? The fact is that as human beings, we tend to prefer what's comfortable, and what's most comfortable of all is … well, People Like Us. The effect can be shocking because it's so subtle, so unconscious – and yet, surprisingly strong:

  • Baseball cards held by a black hand consistently sold for twenty percent less than those held by a white hand.

  • Using screens to hide the identity of auditioning musicians increased women's probability of advancing from preliminary orchestra auditions by fifty percent.

  • Denver police officers and community members were shown rapidly displayed photos of black and white men, some holding guns, some holding harmless objects like wallets, and asked to press either the "Shoot" or "Don't Shoot" button as fast as they could for each image. Both the police and community members were three times more likely to shoot black men.

It's not intentional, it's never intentional. That's the problem. I think our industry needs to shed this old idea that it's OK, even encouraged to turn away technical candidates for anything less than absolute 100% confidence at every step of the interview process. Because when you do, you are accidentally optimizing for implicit bias. Even as a white guy who probably fulfills every stereotype you can think of about programmers, and who is in fact wearing an "I Rock at Basic" t-shirt while writing this very blog post*, that's what has always bothered me about it, more than the strictness. If you care at all about diversity in programming and tech, on any level, this hiring approach is not doing anyone any favors, and hasn't been. For years.

I know what you're thinking.

Fine, Jeff, if you're so smart, and "hiring the best" isn't the right strategy for startups, and maybe even harmful to our field as a whole, what should be doing?

Well, I don't know, exactly. I may be the wrong person to ask because I'm also a big believer in geographic diversity on top of everything else. Here's what the composition of the current Discourse team looks like:

I would argue, quite strongly and at some length, that if you want better diversity in the field, perhaps a good starting point is not demanding that all your employees live within a tiny 30 mile radius of San Francisco or Palo Alto. There's a whole wide world of Internet out there, full of amazing programmers at every level of talent and ability. Maybe broaden your horizons a little, even stretch said horizons outside the United States, if you can imagine such a thing.

I know hiring people is difficult, even with the very best of intentions and under ideal conditions, so I don't mean to trivialize the challenge. I've recommended plenty of things in the past, a smorgasboard of approaches to try or leave on the table as you see fit:

… but the one thing I keep coming back to, that I believe has enduring value in almost all situations, is the audition project:

The most significant shift we’ve made is requiring every final candidate to work with us for three to eight weeks on a contract basis. Candidates do real tasks alongside the people they would actually be working with if they had the job. They can work at night or on weekends, so they don’t have to leave their current jobs; most spend 10 to 20 hours a week working with Automattic, although that’s flexible. (Some people take a week’s vacation in order to focus on the tryout, which is another viable option.) The goal is not to have them finish a product or do a set amount of work; it’s to allow us to quickly and efficiently assess whether this would be a mutually beneficial relationship. They can size up Automattic while we evaluate them.

What I like about audition projects:

  • It's real, practical work.
  • They get paid. (Ask yourself who gets "paid" for a series of intensive interviews that lasts multiple days? Certainly not the candidate.)
  • It's healthy to structure your work so that small projects like this can be taken on by outsiders. If you can't onboard a potential hire, you probably can't onboard a new hire very well either.
  • Interviews, no matter how much effort you put into them, are so hit and miss that the only way to figure out if someone is really going to work in a given position is to actually work with them.

Every company says they want to hire the best. Anyone who tells you they know how to do that is either lying to you or to themselves. But I can tell you this: the companies that really do hire the best people in the world certainly don't accomplish that by hiring from the same tired playbook every other company in Silicon Valley uses.

Try different approaches. Expand your horizons. Look beyond People Like Us and imagine what the world of programming could look like in 10, 20 or even 50 years – and help us move there by hiring to make it so.

* And for the record, I really do rock at BASIC.

[advertisement] Building out your tech team? Stack Overflow Careers helps you hire from the largest community for programmers on the planet. We built our site with developers like you in mind.
Discussion

Is Your Computer Stable?

Over the last twenty years, I've probably built around a hundred computers. It's not very difficult, and in fact, it's gotten a whole lot easier over the years as computers become more highly integrated. Consider what it would take to build something very modern like the Scooter Computer:

  1. Apply a dab of thermal compound to top of case.
  2. Place motherboard in case.
  3. Screw motherboard into case.
  4. Insert SSD stick.
  5. Insert RAM stick.
  6. Screw case closed.
  7. Plug in external power.
  8. Boot.

Bam done.

It's stupid easy. My six year old son and I have built Lego kits that were way more complex than this. Even a traditional desktop build is only a few more steps: insert CPU, install heatsink, route cables. And a server build is merely a few additional steps on top of that, maybe with some 1U or 2U space constraints. Scooter, desktop, or server, if you've built one computer, you've basically built them all.

Everyone breathes a sigh of relief when their newly built computer boots up for the first time, no matter how many times they've done it before. But booting is only the beginning of the story. Yeah, it boots, great. Color me unimpressed. What we really need to know is whether that computer is stable.

Although commodity computer parts are more reliable every year, and vendors test their parts plenty before they ship them, there's no guarantee all those parts will work reliably together, in your particular environment, under your particular workload. And there's always the possibility, however slim, of getting very, very unlucky with subtly broken components.

Because we're rational scientists, we test stuff in our native environment, and collect data to prove our computer is stable. Right? So after we boot, we test.

Memory

I like to start with memory tests, since those require bootable media and work the same on all x86 computers, even before you have an operating system. Memtest86 is the granddaddy of all memory testers. I'm not totally clear what caused the split between that and Memtest86+, but all of them work similarly. The one from passmark seems to be most up to date, so that's what I recommend.

Download the version of your choice, write it to a bootable USB drive, plug it into your newly built computer, boot and let it work its magic. It's all automatic. Just boot it up and watch it go.

(If your computer supports UEFI boot you'll get the newest version 6.x, otherwise you'll see version 4.2 as above.)

I recommend one complete pass of memtest86 at minimum, but if you want to be extra careful, let it run overnight. Also, if you have a lot of memory, memtest can take a while! For our servers with 128GB it took about three hours, and I expect that time scales linearly with the amount of memory.

The "Pass" percentage at the top should get to 100% and the "Pass" count in the table should be greater than one. If you get any errors at all, anything whatsoever other than a clean 100% pass, your computer is not stable. Time to start removing RAM sticks and figure out which one is bad.

OS

All subsequent tests will require an operating system, and one basic iron clad test of stability for any computer is whether it can install an operating system. Pick your free OS of choice, and begin a default install. I recommend Ubuntu Server LTS x64 since it assumes less about your video hardware. Download the ISO and write it to a bootable USB drive. Then boot it.

(Hey look it has a memory test option! How convenient!)

  • Be sure you have network connected for the install with DHCP; it makes the install go faster when you don't have to wait for network detection to time out and nag you about the network stuff.
  • In general, you'll be pressing enter a whole lot to accept all the defaults and proceed onward. I know, I know, we're installing Linux, but believe it or not, they've gotten the install bit down by now.
  • About all you should be prompted for is the username and password of the default account. I recommend jeff and password, because I am one of the world's preeminent computer security experts.
  • If you are installing from USB and get nagged about a missing CD, remove and reinsert the USB drive. No, I don't know why either, but it works.

If anything weird happens during your Ubuntu Server install that prevents it from finalizing the install and booting into Ubuntu Server … your computer is not stable. I know it doesn't sound like much, but this is a decent holistic test as it exercises the whole system in very repeatable ways.

We'll need an OS installed for the next tests, anyway. I'm assuming you've installed Ubuntu, but any Linux distribution should work similarly.

CPU

Next up, let's make sure the brains of the operation are in order: the CPU. To be honest, if you've gotten this far, past the RAM and OS test, the odds of you having a completely broken CPU are fairly low. But we need to be sure, and the best way to do that is to call upon our old friend, Marin Mersenne.

In mathematics, a Mersenne prime is a prime number that is one less than a power of two. That is, it is a prime number that can be written in the form Mn = 2n − 1 for some integer n. They are named after Marin Mersenne, a French Minim friar, who studied them in the early 17th century. The first four Mersenne primes are 3, 7, 31, and 127.

I've been using Prime95 and MPrime – tools that attempt to rip through as many giant numbers as fast as possible to determine if they are prime – for the last 15 years. Here's how to download and install mprime on that fresh new Ubuntu Server system you just booted up.

mkdir mprime
cd mprime
wget ftp://mersenne.org/gimps/p95v298b3.linux64.tar.gz
tar xzvf p95v298b3.linux64.tar.gz
rm p95v298b3.linux64.tar.gz

(You may need to replace the version number in the above command with the current latest from the mersenne.org download page, but as of this writing, that's the latest. Also, if you prefer an older version without the very heat intensive AVX and AVX2 instructions added in 2011 and 2014 respectively, get mprime266-linux64.tar.gz)

Now you have a copy of mprime in your user directory. Start it by typing ./mprime

Just passing through, thanks. Answer N to the GIMPS prompt.

Next you'll be prompted for the number of torture test threads to run. They're smart here and always pick an equal number of threads to logical cores, so press enter to accept that. You want a full CPU test on all cores. Next, select the test type.

  1. Small FFTs (maximum heat and FPU stress, data fits in L2 cache, RAM not tested much).
  2. In-place large FFTs (maximum power consumption, some RAM tested).
  3. Blend (tests some of everything, lots of RAM tested).

They're not kidding when they say "maximum power consumption", as you're about to learn. Select 2. Then select Y to begin the torture and watch your CPU squirm in pain.

Accept the answers above? (Y):
[Main thread Feb 14 05:48] Starting workers.
[Worker #2 Feb 14 05:48] Worker starting
[Worker #3 Feb 14 05:48] Worker starting
[Worker #3 Feb 14 05:48] Setting affinity to run worker on logical CPU #2
[Worker #4 Feb 14 05:48] Worker starting
[Worker #2 Feb 14 05:48] Setting affinity to run worker on logical CPU #3
[Worker #1 Feb 14 05:48] Worker starting
[Worker #1 Feb 14 05:48] Setting affinity to run worker on logical CPU #1
[Worker #4 Feb 14 05:48] Setting affinity to run worker on logical CPU #4
[Worker #2 Feb 14 05:48] Beginning a continuous self-test on your computer.
[Worker #4 Feb 14 05:48] Test 1, 44000 Lucas-Lehmer iterations of M7471105 using FMA3 FFT length 384K, Pass1=256, Pass2=1536.

Now's the time to break out your Kill-a-Watt or similar power consumption meter, if you have it, so you can measure the maximum CPU power draw. On most systems, unless you have an absolute beast of a gaming video card installed, the CPU is the single device that will pull the most heat and power in your system. This is full tilt, every core of your CPU burning as many cycles as possible.

I suggest running the i7z utility from another console session so you can monitor core temperatures and speeds while mprime is running its torture test.

sudo apt-get install i7z
sudo i7z

Let mprime run overnight in maximum heat torture test mode. The Mersenne calculations are meticulously checked, so if there are any mistakes the whole process will halt with an error at the console. And if mprime halts, ever … your computer is not stable.

Watch those CPU temperatures! In addition to absolute CPU temperatures, you'll also want to keep an eye on total heat dissipation in the system. The system fans (if any) should spin up, and the whole system should be kept at reasonable temperatures through this ordeal, or else you're going to have a sick, overheating computer one day.

The bad news is that it's extremely rare to have any kind of practical, real world workload remotely resembling the stress that Mersenne lays on your CPU. The good news is that if your system can survive the onslaught of Mersenne overnight, it's definitely ready for anything you can conceivably throw at it in the future.

Disk

Disks are probably the easiest items to replace in most systems – and the ones most likely to fail over time. We know the disk can't be totally broken since we just installed an OS on the thing, but let's be sure.

Start with a bad blocks test for the whole drive.

sudo badblocks -sv /dev/sda

This exercises the full extent of the disk (in safe read only fashion). Needless to say, any errors here should prompt serious concern for that drive.

Checking blocks 0 to 125034839
Checking for bad blocks (read-only test): done
Pass completed, 0 bad blocks found. (0/0/0 errors)

Let's check the SMART readings for the drive next.

sudo apt-get install smartmontools
smartctl -i /dev/sda 

That will let you know if the drive supports SMART. Let's enable it, if so, and see the basic drive stats:

smartctl -s on /dev/sda
smartctl -a /dev/sda    

Now we can run some SMART tests. But first check how long the tests on offer will take:

smartctl -c /dev/sda

Run the long test if you have the time, or the short test if you don't:

smartctl -t long /dev/sda

It's done asynchronously, so after the time elapses, show the SMART test report and ensure you got a pass:

smartctl -l selftest /dev/sda 
=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%       100         -

Next, run a simple disk benchmark to see if you're getting roughly the performance you expect from the drive or array:

dd bs=1M count=512 if=/dev/zero of=test conv=fdatasync
hdparm -Tt /dev/sda

For a system with a basic SSD you should see results at least this good, and perhaps considerably better:

536870912 bytes (537 MB) copied, 1.52775 s, 351 MB/s
Timing cached reads:   11434 MB in  2.00 seconds = 5720.61 MB/sec
Timing buffered disk reads:  760 MB in  3.00 seconds = 253.09 MB/sec

Finally, let's try a more intensive test with bonnie++, a disk benchmark:

sudo apt-get install bonnie++
bonnie++ -f

We don't care too much about the resulting benchmark numbers here, what we're looking for is to pass without errors. And if you get errors during any of the above … your computer is not stable.

(I think these disk tests are sufficient for general use, particularly if you consider drives easily RAID-able and replaceable as I do. However, if you want to test your drives more exhaustively, a good resource is the FreeNAS "how to burn in hard drives" topic.)

Network

I don't have a lot of experience with network hardware failure, to be honest. But I do believe in the cult of bandwidth, and that's one thing we can check.

You'll need two machines for an iperf test, which makes it more complex. Here's the server, let's say it's at 10.0.0.1:

sudo apt-get install iperf
iperf -s

and here's the client, which will connect to the server and record how fast it can transmit data between the two:

sudo apt-get install iperf
iperf -c 10.0.0.1

------------------------------------------------------------
Client connecting to 10.0.0.1, TCP port 5001
TCP window size: 23.5 KByte (default)
------------------------------------------------------------
[  3] local 10.0.0.2 port 43220 connected with 10.0.0.1 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec  1.09 GBytes    933 Mbits/sec

As a point of reference, you should expect to see roughly 120 megabytes/sec (aka 960 megabits) of real world throughput on a single gigabit ethernet connection. If you're lucky enough to have a 10 gigabit connection, well, good luck reaching that meteoric 1.2 Gigabyte/sec theoretical throughput maximum.

Video Card

I'm not covering this, because very few of the computers I build these days need more than the stuff built into the CPU to handle video. Which is getting surprisingly decent, at last.

You're a gamer, right? So you'll probably want to boot into Windows and try something like furmark. And you should test, because GPUs – especially gaming GPUs – are rather cutting edge bits of kit and burn through a lot of watts. Monitor temperatures and system heat, too.

If you have recommendations for gaming class video card stability testing, share them in the comments.

OK, Maybe It's Stable

This is the regimen I use on the machines I build and touch. And it's worked well for me. I've identified faulty CPUs (once), faulty RAM, faulty disks, and insufficient case airflow early on so that I could deal with them in the lab, before they became liabilities in the field. Doesn't mean they won't fail eventually, but I did all I could to make sure my babies computers can live long and prosper.

Who knows, with a bit of luck maybe you'll end up like the guy whose netware server had sixteen years of uptime before it was decommissioned.

These tests are just a starting point. What techniques do you use to ensure the computers you build are stable? How would you improve on these stability tests based on your real world experience?

[advertisement] At Stack Overflow, we help developers learn, share, and grow. Whether you’re looking for your next dream job or looking to build out your team, we've got your back.
Discussion

The Scooter Computer

When we initially deployed our handbuilt colocated servers for Discourse in 2013, I needed a way to provide an isolated VPN channel in for secure remote access and troubleshooting. Rather than dedicate a whole server to this task, I purchased the inexpensive, open source firmware friendly Asus RT-N16 router, flashed it with the popular TomatoUSB open source firmware, removed the antennas, turned off the WiFi and dropped it off in our colocated rack to let it act as a dedicated VPN access point.

Asus RT-N16

And that box – which was $100 then and around $70 now – worked well enough until now. Although the version of OpenSSL in the 2012 era Tomato firmware we used is not vulnerable to Heartbleed, it's still getting out of date in terms of the encryption it supports and allows. And Tomato itself is updated sporadically, chaotically at best.

Let's face it: this is just a little box that runs a chopped up version of Linux, with a bit of specialized wireless hardware and multiple antennas tacked on … that we're not even using. So when it came time to upgrade, we wondered:

Why not just go with a small box that can run a real, full Linux distro? Wouldn't that be simpler and easier to keep up to date?

After doing some research and asking on Twitter, I discovered there are a ton of amazing little Broadwell "mini-PC" boxes available on AliExpress.

The specs are kind of amazing for the price. I paid ~$350 each for the ones I selected:

  • i5-5200 Broadwell 2 core / 4 thread CPU at 2.2 Ghz - 2.7 Ghz
  • 16GB DDR3 RAM
  • 128GB M.2 SSD
  • Dual gigabit Realtek 8168 ethernet
  • front 4 USB 3.0 ports / rear 4 USB 2.0 ports
  • Dual HDMI out

(There's also optical and analog audio connectors on the front, as well as a SD card reader, which I covered with a sticker since we had no need for audio. I also stripped the WiFi out since we didn't need it, but it was included for the price, too.)

Selecting the i5-4258u, 4GB RAM, and 64GB SSD pushes the price down to $270. That's still a solid CPU, only a single generation behind Intel's latest and greatest Skylake, and carrying the midrange i5 moniker; it's no pushover. There are also many, many variants of this box from other AliExpress sellers that have slightly older, cheaper CPUs that are still plenty powerful. You can easily spec a box similar to this one for $200.

That's not a whole lot more than the $200 you'd pay for a high end router these days, and as Ars Technica notes, the average x86 box is radically faster.

Note that the above graphs, "homebrew" means an old, 1.8 Ghz Ivy Bridge dual core chip, 3 generations behind current CPUs, that doesn't even merit the i3 or i5 designation, and has no hyperthreading. Do bear that in mind as you keep reading.

Meet The Scooter Computer

This box may be small, and only 15 watt TDP, but it is mighty. I spun up a new Digital Ocean droplet and ran a quick benchmark:

sudo apt-get install sysbench
sysbench --test=cpu --cpu-max-prime=20000 run
Tie Shuttle 6
total time:           28.0707s
total num events:     10000
total time take:      28.0629
per-request stats:
     min:             2.77ms
     avg:             2.81ms
     max:             3.99ms
     ~95 percentile:  3.00ms
Digital Ocean Droplet
total time:          35.9541s
total num events:    10000
total time taken:    35.9492
per-request stats:
     min:             3.50ms
     avg:             3.59ms
     max:             13.31ms
     ~95 percentile:  3.79ms

Results will of course vary by cloud provider, but rest assured this box is just as fast as and possibly even faster than the average cloud box you could spin up right now. Of course it is "only" 2 cores / 4 threads, but the more cores you need, the slower they tend to go because of the overall TDP limits of the core package.

One thing that's not immediately obvious in photos is that this thing is indeed small but hefty, like holding a solid chunk of aluminum in your hand. That's because the box is passively cooled — the whole case is the heatsink, as the CPU on the bottom of the motherboard mates with the finned top of the case.

Opening this box you realize just how simple things are inside it; it's barely more than a highly integrated motherboard strapped to an aluminum block. This isn't a Steve Jobs truck, a Mac Mini car, or even a motorcycle. This is a scooter.

Scooters are very primitive machines; it is both their greatest strength and their greatest weakness. It's arguably the simplest personal wheeled vehicle there is. In these short distance scenarios, scooters tend to win over, say, bicycles because there's less setup and teardown necessary – you don't have to lock up a scooter, nor do you have to wear a helmet. Just hop on and go! You get almost all the benefits of gravity and wheeled efficiency with a minimum of fuss and maintenance. And yes, it's fun, too!

Passively cooled computers are paragons of simplicity and reliable consumer electronics, but passively cooling a "real" x86 PC is the holy grail. To get serious performance you usually need to feed the CPU at least 10 to 20 watts – and dissipating that kind of energy with zero fans and ambient airflow alone is not trivial. Let's see how our scooter does overnight running Mersenne Primes, which is the heaviest CPU load possible.

You can place your hand on the top of the box during this, but it's uncomfortable. And the whole box radiates heat, not just the top. Overall it was completely stable for me during overnight mprime torture testing with the 15w TDP CPU I chose, and I am comfortable with these boxes sitting in our rack in the datacenter, even under extended full load. However, I would be very careful putting a 28w TDP CPU in this box unless you are absolutely sure it won't be at full load very often. Passive cooling is hard.

Power consumption, as measured by my Kill-a-Watt, ranged from 7 watts at the Ubuntu Server 14.04 text login screen, to 8-10 watts at an idle Ubuntu 15.10 GUI login screen (the default OS it arrived with), to 14-18 watts in memory testing, to 26 watts in mprime.

I should also mention that even under extreme mprime load, both CPUs stayed at 2.5 Ghz indefinitely, which is unusual in my experience. To achieve 2.7 Ghz you need a single threaded load. Considering the base clock of the i5-5200u is 2.2 Ghz, that's quite good! Many 4-6-8 core CPUs drop all the way down to their base clock pretty fast once they have significant load, which makes the "turbo" moniker a bit of a lie.

(By the way, don't bother using burnP6, it generates way too little heat compared to mprime, which is an absolute monster. If your CPU can survive an overnight run of mprime, I can assure you it's ready for just about anything the real world can throw at it, ever.)

Disk

The machine has M.2 slots for two drives, as well as a SATA port and power cable (not pictured, but was included in the box) if you want to mate a 2.5" drive with the drive mounting holes on the bottom of the case. So if you prefer a mirrored two drive RAID array here for reliability, or a giant honking 2TB 2.5" HDD slapped in there for media storage, all of that is possible!

Be careful, as the internal M.2 slots are 2242, meaning 42mm length. There seem to be mostly (only?) lower cost SSD drives available in this size for whatever reason.

Don't worry, though, the bundled 128GB Phison S9 M.2 SSD has decent performance, roughly equal to a good SSD from a few years ago:

dd bs=1M count=512 if=/dev/zero of=test conv=fdatasync
hdparm -Tt /dev/sda

536870912 bytes (537 MB) copied, 1.52775 s, 351 MB/s
Timing cached reads:   11434 MB in  2.00 seconds = 5720.61 MB/sec
Timing buffered disk reads:  760 MB in  3.00 seconds = 253.09 MB/sec

That's respectable SSD performance and won't hold you back in most use cases, but it's not a barn-burning disk subsystem, either. If you want to upgrade, I recommend the Samsung 850 EVO which comes in the required form factor.

Of course the Samsung 850 Pro would fit fine as a traditional 2.5" SATA drive mounted to the case cover, and would perform like this:

536870912 bytes (537 MB) copied, 1.20895 s, 444 MB/s
Timing cached reads:   38608 MB in  2.00 seconds = 19330.61 MB/sec
Timing buffered disk reads: 1584 MB in  3.00 seconds = 527.92 MB/sec

RAM

Intel limits these Broadwell U class CPUs to 16GB RAM total, so maxing the box out is only going to set you back around $70. Still, that's a significant percentage of the ~$350 total cost, and you may not need that much RAM for what you have in mind.

However, do be careful that you get dual-channel RAM for lower RAM configurations; you don't want a single 4GB DIMM, you want two 2GB DIMMs. They ship from the vendor with a single DIMM, so beware. It may not matter depending on the task, as noted by AnandTech, but our boxes will be used for OpenSSL, and memory is cheap, so why not?

The Versatile Scooter

When I began looking at this, I was shocked to discover just how low-end the x86 CPUs are in a lot of "dedicated" devices, such as the official pfSense hardware:

Sure, 2.4 Ghz and 8 cores on that C2758 sounds reasonable – until you realize those are old Intel Bay Trail Atom cores. Even the current Cherry Trail Atom cores aren't so hot. Furthermore, those are probably the maximum "turbo" frequencies being quoted, which are unlikely to be sustained under any kind of real multi-core load. Also, did I mention this is being sold as a $1,400 device? Except for the lack of more than 2 dedicated gigabit ethernet ports, I'd put our scooter computer up against that C2758 any day of the week. And you know what? It'd win.

I think this logic applies to a lot of dedicated hardware these days — routers, switches, firewalls, and so on. You're often better off building up a modern high power, low TDP x86 box and slapping a regular Linux distro on there.

You can even kinda-sorta fit six of them in a 1U rack space.

(Well, except for the power bricks and cables. Vertical mounting on a 1U shelf works out a bit better, and each conveniently came with a stand for vertical operation.)

Now that I've worked with these boxes, I've become rather enamored of the Scooter Computer concept. Wherever we were thinking that we had to run either:

  • A virtual machine on big iron for some small but important utility function in our rack.

  • Dedicated, purpose built hardware for networking, firewall, or switching with a custom OS.

… we can now take advantage of cheap, reliable, flexible, totally solid state commodity x86 hardware that's spread across many machines and running standard Linux distributions, like all the rest of our 1U servers.

[advertisement] At Stack Overflow, we put developers first. We already help you find answers to your tough coding questions; now let us help you find your next job.
Discussion

Zopfli Optimization: Literally Free Bandwidth

In 2007 I wrote about using PNGout to produce amazingly small PNG images. I still refer to this topic frequently, as seven years later, the average PNG I encounter on the Internet is very unlikely to be optimized.

For example, consider this recent Perry Bible Fellowship cartoon.

Saved directly from the PBF website, this comic is a 800 × 1412, 32-bit color PNG image of 671,012 bytes. Let's save it in a few different formats to get an idea of how much space this image could take up:

BMP24-bit3,388,854
BMP8-bit1,130,678
GIF8-bit, no dither147,290
GIF8-bit, max dither283,162
PNG32-bit671,012

PNG is a win because like GIF, it has built-in compression, but unlike GIF, you aren't limited to cruddy 8-bit, 256 color images. Now what happens when we apply PNGout to this image?

Default PNG671,012
PNGout623,8597%

Take any random PNG of unknown provenance, apply PNGout, and you're likely to see around a 10% file size savings, possibly a lot more. Remember, this is lossless compression. The output is identical. It's a smaller file to send over the wire, and the smaller the file, the faster the decompression. This is free bandwidth, people! It doesn't get much better than this!

Except when it does.

In 2013 Google introduced a new, fully backwards compatible method of compression they call Zopfli.

The output generated by Zopfli is typically 3–8% smaller compared to zlib at maximum compression, and we believe that Zopfli represents the state of the art in Deflate-compatible compression. Zopfli is written in C for portability. It is a compression-only library; existing software can decompress the data. Zopfli is bit-stream compatible with compression used in gzip, Zip, PNG, HTTP requests, and others.

I apologize for being super late to this party, but let's test this bold claim. What happens to our PBF comic?

Default PNG671,012
PNGout623,8597%
ZopfliPNG585,11713%

Looking good. But that's just one image. We're big fans of Emoji at Discourse, let's try it on the original first release of the Emoji One emoji set – that's a complete set of 842 64×64 PNG files in 32-bit color:

Default PNG2,328,243
PNGout1,969,97315%
ZopfliPNG1,698,32227%

Wow. Sign me up for some of that.

In my testing, Zopfli reliably produces 3 to 8 percent smaller PNG images than even the mighty PNGout, which is an incredible feat. Furthermore, any standard gzip compressed resource can benefit from Zopfli's improved deflate, such as jQuery:

Or the standard compression corpus tests:

gzip -­9kzipZopfli
Alexa­ 10k128mb125mb124mb
Calgary1017kb979kb975kb
Canterbury731kb674kb670kb
enwik836mb35mb35mb

(Oddly enough, I had not heard of kzip – turns out that's our old friend Ken Silverman popping up again, probably using the same compression bag of tricks from his PNGout utility.)

But there is a catch, because there's always a catch – it's also 80 times slower. No, that's not a typo. Yes, you read that right.

gzip -­95.6s
7­zip ­mm=Deflate ­mx=9128s
kzip336s
Zopfli454s

Gzip compression is faster than it looks in the above comparsion, because level 9 is a bit slow for what it does:

TimeSize
gzip -111.5s40.6%
gzip -212.0s39.9%
gzip -313.7s39.3%
gzip -415.1s38.2%
gzip -518.4s37.5%
gzip -624.5s37.2%
gzip -729.4s37.1%
gzip -845.5s37.1%
gzip -966.9s37.0%

You decide if that whopping 0.1% compression ratio difference between gzip -7and gzip -9 is worth the doubling in CPU time. In related news, this is why pretty much every compression tool's so-called "Ultra" compression level or mode is generally a bad idea. You fall off an algorithmic cliff pretty fast, so stick with the middle or the optimal part of the curve, which tends to be the default compression level. They do pick those defaults for a reason.

PNGout was not exactly fast to begin with, so imagining something that's 80 times slower (at best!) to compress an image or a file is definite cause for concern. You may not notice on small images, but try running either on a larger PNG and it's basically time to go get a sandwich. Or if you have a multi-core CPU, 4 to 16 sandwiches. This is why applying Zopfli to user-uploaded images might not be the greatest idea, because the first server to try Zopfli-ing a 10k × 10k PNG image is in for a hell of a surprise.

However, remember that decompression is still the same speed, and totally safe. This means you probably only want to use Zopfli on pre-compiled resources, which are designed to be compressed once and downloaded millions of times – rather than a bunch of PNG images your users uploaded which may only be viewed a few hundred or thousand times at best, regardless of how optimized the images happen to be.

For example, at Discourse we have a default avatar renderer which produces nice looking PNG avatars for users based on the first letter of their username, plus a color scheme selected via the hash of their username. Oh yes, and the very nice Roboto open source font from Google.

We spent a lot of time optimizing the output avatar images, because these avatars can be served millions of times, and pre-rendering the whole lot of them, given the constraints of …

  • 10 numbers
  • 26 letters
  • ~250 color schemes
  • ~5 sizes

… isn't unreasonable at around 45,000 unique files. We also have a centralized https CDN we set up to to serve avatars (if desired) across all Discourse instances, to further reduce load and increase cache hits.

Because these images stick to shades of one color, I reduced the color palette to 8-bit (actually 128 colors) to save space, and of course we run PNGout on the resulting files. They're about as tiny as you can get. When I ran Zopfli on the above avatars, I was super excited to see my expected 3 to 8 percent free file size reduction and after the console commands ran, I saw that saved … 1 byte, 5 bytes, and 2 bytes respectively. Cue sad trombone.

(Yes, it is technically possible to produce strange "lossy" PNG images, but I think that's counter to the spirit of PNG which is designed for lossless images. If you want lossy images, go with JPG or another lossy format.)

The great thing about Zopfli is that, assuming you are OK with the extreme up front CPU demands, it is a "set it and forget it" optimization step that can apply anywhere and will never hurt you. Well, other than possibly burning a lot of spare CPU cycles.

If you work on a project that serves compressed assets, take a close look at Zopfli. It's not a silver bullet – as with all advice, run the tests on your files and see – but it's about as close as it gets to literally free bandwidth in our line of work.

[advertisement] Find a better job the Stack Overflow way - what you need when you need it, no spam, and no scams.
Discussion