Coding Horror

programming and human factors

Lazyweb Calling

It's hard to pin down the exact etymology of the word Lazyweb, but it seems to have one primary meaning:

  1. Asking a question of an internet audience in the hopes that they will be able to find a solution that you were too lazy or inexperienced to find yourself.

The word is attributed to old-school internet ultrageek jwz, which I could totally believe. Or it might have been designer Matt Jones. It can also have a more specific meaning in the context of writing code, as Clay Shirky notes.

  1. Waiting for someone else on the internet to write/build/design what you were thinking of.
  2. Describing something in the hopes that someone else on the internet will write/build/design it.

Of course, this kind of feature daydreaming can have amusing results.

The common theme of all Lazyweb requests is a tacit acknowledgement that none of us is as dumb as all of us.

demotivational poster: Lazyweb. None of us is as dumb as all of us.

I don't mind Lazyweb requests, within reason. Contrary to popular belief, there is such a thing as a stupid question. It's asked by people who failed to do even the most basic kind of research on their question before they asked. I'm not expecting everyone to read a 32 page document prior to asking any questions, but at least cover the basics before casually deciding to make your problem everyone's problem.

There's nothing wrong with harnessing the collective power of the internet to solve a problem you tried-- and failed-- to solve yourself. Assuming you get an answer, once web crawlers find and index your problem, that question and answer is part of the fabric of the web and can potentially help any future travellers as well. Done right, Lazyweb is a part of the net public good.

Discussion

Practicing the Fundamentals: The New Turing Omnibus

While researching Classic Computer Science Puzzles, our CEO Scott Stanfield turned me on to A.K. Dewdney's The New Turing Omnibus: 66 Excursions in Computer Science.

The New Turing Omnibus: 66 Excursions in Computer Science

This is an incredibly fun little book. Sure, it's got Towers of Hanoi, but it's also got so much more:

The book is designed to appeal both to the educated layperson and to the student of computer science. But how is that possible? The answer lies in the variety of treatments as well as topics. Some of the topics are inherently easy or I have been lucky enough to stumble upon just the right expository mechanisms. Some of the topics are inherently deep or complicated and there is no way around a certain rigor, including occasional mathematical symbolism.

For students of computer science, the 66 chapters that follow will give a sneak preview of the major ideas and techniques they will encounter in their undergraduate careers and some they may only encounter as graduate students. For professors of computer science, my colleagues, the 66 chapters will amount to a sneak review. Trying to remember how the Boyer-Moore string-matching algorithm went? It's right there in Chapter 61, Searching Strings. As for your lectures, if you like to deliver your own material this book may be what you've been looking for.

At one end of its spectrum of uses, The (New) Turing Omnibus may be ideal in bringing students from diverse backgrounds "up to speed." At the other end of the spectrum, you retain creative control but draw a few (or many) of your lectures from this book. Finally, for educated laypersons, the book provides a brief roadmap of computability.

I have no idea why I hadn't heard of this book, originally published in 1988 and updated with a second edition in 1993, until now. The New Turing Omnibus is is probably the single closest published equivalent to what I do on this very blog. It's a grab-bag of computing topics. Each chapter is the equivalent of a short blog post examining a particular topic, peppered with tables, diagrams, and illustrations. And the topics aren't presented in any particular order. Browse and find something you like; discard the rest. Here's a short excerpt from Chapter 33, Shannon's Theory - The Elusive Codes:

The New Turing Omnibus, page 330   The New Turing Omnibus, page 331

A complete table of contents for all 66 chapters of The New Turing Omnibus is enumerated at Everything2. I think there's a very high probability that if you enjoy reading this blog on a regular basis, you'll also enjoy this remarkable little book. As promised, it's a great way to keep practicing the fundamentals for professionals:

Bert Bates (my co-author) is a blackbelt level go player, one of the best amateur players in the state. But when a visiting expert-- four belt levels above Bert-- showed up at the local go tournament, Bert was surprised to see the guy reading a book on fundamental go problems that Bert had read much earlier in his learning. The expert said, "I must have read this at least a hundred times. My goal each time is to see how much more quickly I can solve all the problems in the book than I did the last time."

Some of the best athletes never forget the fundamentals-- whether it's Tiger Woods practicing the basics, or a pro basketball player working on free throws. A good musician might still practice arpeggios. A programmer might... I don't know, actually. What would be the fundamentals that a good programmer might forget? I'll have to think about that one.

But it's not just a book for programmers; it's also got a broad, down-to-earth appeal. It's an intriguing collection of thought puzzles for laypeople with at least a passing interest in the field of computer science.

If you'd like to see more, you can browse through a few pages of the book at Amazon. A few more pages are available in Google books, but beware the randomly inserted "copyrighted image" placeholder instead of the many illustrations and diagrams throughout the book.

Discussion

You're Probably Storing Passwords Incorrectly

The web is nothing if not a maze of user accounts and logins. Almost everywhere you go on the web requires yet another new set of credentials. Unified login seems to elude us at the moment, so the status quo is an explosion of usernames and passwords for every user. As a consequence of all this siloed user identity data, Facebook and most other web apps encourage us to give out our credentials like Halloween candy, as Dare Obasanjo notes:

On Facebook, there is an option to import contacts from Yahoo! Mail, Hotmail, AOL and Gmail which requires me to enter my username and password from these services into their site. Every time I login to Yahoo! Mail there is a notice that asks me to import my contacts from other email services which requires me to give them my credentials from these services as well.

This is a deplorable state of affairs. We're teaching users that their credentials are of little value and should be freely handed out to any passing website that catches their fancy. It's an incredibly dangerous habit to inculcate in users; it makes them far more vulnerable to phishing:

If users get comfortable with entering their credentials in all sorts of random places then it makes them more susceptible to phishing attacks. This is one of the reasons services like Meebo are worrying to me.

The last thing we should be doing is coming up with ways to make phishing more powerful. Phishing is a social engineering exploit so timeless, so effective, and so powerful, I call it the forever hack. Rainbow table and brute force attacks can be defeated through judicious use of technology. Phishing can't.

find your friends on facebook

Users collect usernames and passwords like they do Pokemon. It's a sorry state of affairs, but for better or worse, that's the way it is. We, as software developers, are trusted with storing all these usernames and passwords in some sort of database. The minute we store a user's password, we've taken on the responsibility of securing their password, too. Let's say a hacker somehow obtains a list of all our usernames and passwords. Either it was an inside job by someone who had access to the database, or the database was accidentally exposed to the public web. Doesn't matter how. It just happened.

Even if hackers have our username and password table, we're covered, right? All they'll see is the hash values. Only the most grossly incompetent of developers would actually store passwords as plaintext in the database, right? Right?

Recently, the folks behind Reddit.com confessed that a backup copy of their database had been stolen. Later, spez, one of the Reddit developers, confirmed that the database contained password information for Reddit's users, and that the information was stored as plain, unprotected text. In other words, once the thief had the database, he had everyone's passwords as well.

Wrong.

I'm only a mouth-breathing Windows developer, not one of the elite Y-combinating Lisp, oops, Python developers working on Reddit. And even I know better than that.

You might think it's relatively unimportant if someone's forum password is exposed as plain text. After all, what's an attacker going to do with crappy forum credentials? Post angry messages on the user's behalf? But most users tend to re-use the same passwords, probably because they can't remember the two dozen unique usernames and passwords they're forced to have. So if you obtain their forum password, it's likely you also have the password to something a lot more dangerous: their online banking and PayPal.

My point here is not to drag the good names of the developers at Reddit through the mud. We're all guilty. I'm sure every developer reading this has stored passwords as plain text at some point in their career. I know I have. Forget the blame. The important thing is to teach our peers that storing plaintext passwords in the database is strictly forbidden-- that there's a better way, starting with basic hashes.

Hashing the passwords prevents plaintext exposure, but it also means you'll be vulnerable to the astonishingly effective rainbow table attack I documented last week. Hashes alone are better than plain text, but barely. It's not enough to thwart a determined attacker. Fortunately, the kryptonite for rainbow table attacks is simple enough-- add a salt value to the hashes to make them unique. I provided an example of a salted hash in my original post:

hash = md5('deliciously-salty-' + password)

But IANAC -- I Am Not A Cryptographer. I meant this only as an example, not as production code that you should copy and paste into that hugely popular enterprise banking solution you're working on. In fact, I ripped it almost directly from Rich Skrenta's excellent post on using MD5 hashes as utility functions.

Thomas Ptacek, on the other hand, is a cryptographer, and he has a bone to pick with my choice of salting techniques.

What have we learned? We learned that if it's 1975, you can set the ARPANet on fire with rainbow table attacks. If it's 2007, and rainbow table attacks set you on fire, we learned that you should go back to 1975 and wait 30 years before trying to design a password hashing scheme.*

We learned that if we had learned anything from this blog post, we should be consulting our friends and neighbors in the security field for help with our password schemes, because nobody is going to find the game-over bugs in our MD5 schemes until after my Mom's credit card number is being traded out of a curbside stall in Tallinn, Estonia.

We learned that in a password hashing scheme, speed is the enemy. We learned that MD5 was designed for speed. So, we learned that MD5 is the enemy. Also Jeff Atwood and Richard Skrenta.

Finally, we learned that if we want to store passwords securely we have three reasonable options: PHK's MD5 scheme, Provos-Maziere's Bcrypt scheme, and SRP. We learned that the correct choice is Bcrypt.

* editor's note: Not quite true. Most "modern" software does not use a modern password scheme. Windows XP/2000/NT, phpBB, the majority of custom password schemes; all vulnerable to rainbow table precomputation attacks.

In summary, if we're storing passwords, we're probably storing those passwords incorrectly. If it isn't obvious by now, cryptography is hard, and the odds of us getting it right on our own are basically nil. That's why we should rely on existing frameworks, and the advice of experts like Thomas. What higher praise is there than that of praise from your sworn enemy?

Let's recap:

  1. Do not invent your own "clever" password storage scheme. I know, you're smart, and you grok this crypto stuff. But through this door lies madness-- and abominations like LMHash that have ongoing, worldwide security ramifications we're still dealing with today. Take advantage of whatever password storage tools your framework provides, as they're likely to be a heck of a lot better tested and more battle-proven than any crazy scheme you and your team can come up with on your own. Security vulnerabilities, unlike functionality bugs in your application, run deep and silent. They can lay dormant for years.
  2. Never store passwords as plaintext. This feels like security 101 and is completely obvious in retrospect. But not everyone knows what you know -- just ask Reddit. Store the hashes, never the actual passwords. Educate your fellow developers.
  3. Add a long, unique random salt to each password you store. The point of a salt (or nonce, if you prefer) is to make each password unique and long enough that brute force attacks are a waste of time. So, the user's password, instead of being stored as the hash of "myspace1", ends up being stored as the hash of 128 characters of random unicode string + "myspace1". You're now completely immune to rainbow table attack.
  4. Use a cryptographically secure hash. I think Thomas hates MD5 so very much it makes him seem a little crazier than he actually is. But he's right. MD5 is vulnerable. Why pick anything remotely vulnerable, when you don't have to? SHA-2 or Bcrypt would be a better choice.

Of course, none of this guarantees you'll be able to prevent someone from deducing that Joe User's Myspace account password is "myspace1".

But when they do, at least it won't be your fault.

Discussion

Classic Computer Science Puzzles

Software developers do have a proclivity for puzzles. Perhaps that's why books like To Mock a Mockingbird exist. It's a collection of logic puzzles which is considered an introduction to lambda calculus, one of the core concepts of Lisp.

To Mock a Mockingbird

Such puzzle questions are de rigueur for many programming interviews, though they're often abused. There is a downside to thinking of programming languages as solutions to arbitrarily difficult abstract mathematical puzzles. That's probably why Lisp has a rich reputation for being powerful but simultaneously dense and impenetrable.

I prefer to think of programming languages as utilitarian tools for real world problems. They let me accomplish pragmatic (and often prosaic) goals. PHP is about as unsexy a language as you'll ever find, but does that matter when it's the technology driving the current Boardwalk and Park Place of the web world? I'm not a fan of puzzle questions in interviews; I'd rather have potential developers give me a presentation or write a reasonably useful program in the real development environment they'll be using on the job. Solve all the puzzles you want, but the only one we're getting paid to solve is the customer's problem.

That said, many fundamental computer science concepts can be summarized well in puzzle form, which aids tremendously in teaching and learning these key concepts. Here's a quick list of the classic computer science puzzles that I remember from my university days:

Dining Philosophers
Concurrency and Deadlocks
dining philosophers problem
Five philosophers sit around a circular table. In front of each philosopher is a large plate of rice. The philosophers alternate their time between eating and thinking. There is one chopstick between each philosopher, to their immediate right and left. In order to eat, a given philosopher needs to use both chopsticks. How can you ensure all the philosophers can eat reliably without starving to death?
Travelling Salesman
P=NP
travelling salesman problem
A salesperson has a route of cities that make up his or her beat. What's the most efficient sales route that visits each city exactly once, and then returns to the home city?
Eight Queens
Algorithm Design
eight-queens.png
Given eight queens on a standard 8 x 8 chessboard, how many unique positions-- exclusive of rotations and mirror images-- can those eight queens occupy without attacking each other?
Two Generals
Communication Protocols
two generals problem
Two armies, each led by a general, are preparing to attack a city. The armies are encamped outside the city on two mountains separated by a large valley. In order to capture the city, the generals must attack at exactly the same time. The only way for the generals to communicate is by sending messengers through the valley. Unfortunately, the valley is occupied by the city's defenders, so there's a chance any given messenger will be captured. Each general has no way of knowing if their messenger arrived. How do the generals coordinate their attack?
Towers of Hanoi
Recursion
towers-of-hanoi.png
You have a stack of discs, from largest to smallest, that slide on to the first peg of a three peg board. Your goal is to move the entire stack of discs from the first peg to the third peg. However, you can only move the topmost disc of any peg, and smaller discs must always be placed on larger discs. How many moves will it take? 

I consider this the "greatest hits" of classic computer science puzzles. But I'm sure I've forgotten a few. Are there any other puzzles I've missed that express fundamental computer science concepts, the type that would be taught in a typical undergraduate computer science course?

Discussion

Gigabyte: Decimal vs. Binary

Everyone who has ever purchased a hard drive finds out the hard way that there are two ways to define a gigabyte.  

When you buy a "500 Gigabyte" hard drive, the vendor defines it using the decimal powers of ten definition of the "Giga" prefix.

500 * 109 bytes = 500,000,000,000 = 500 Gigabytes

But the operating system determines the size of the drive using the computer's binary powers of two definition of the "Giga" prefix:

465 * 230 bytes = 499,289,948,160 = 465 Gigabytes

If you're wondering where 35 Gigabytes of your 500 Gigabyte drive just disappeared to, you're not alone. It's an old trick perpetuated by hard drive makers-- they intentionally use the official SI definitions of the Giga prefix so they can inflate the the sizes of their hard drives, at least on paper. This was always an annoyance, but now it's much more difficult to ignore, as it results in large discrepancies with today's enormous hard drives. When is a Terabyte hard drive not a Terabyte? When it's 931 GB.

As Ned Batchelder notes, the hard drive manufacturers are technically conforming to the letter of the SI prefix definitions. It's us computer science types who are abusing the official prefix designations:

Year Approved Official Definition Informal Meaning Difference Prefix Derived From
giga GB 1960 109230 7% Greek root for giant
tera TB 1960 1012 240 10% Greek root for monster
peta PB 1975 1015 250 13% Greek root for five, "penta"
exa EB 1975 1018 260 15% Greek root for six, "hexa"
zetta ZB 1991 1021 270 18% Latin root for seven, "septum", p dropped, first letter changed to S to avoid confusion with other SI symbols
yotta YB 1991 1024 280 21% Greek root for eight, "octo", c dropped, y added to avoid having symbol of zero-like letter O

As the size of the prefix grows, so does the gap between the official and informal meaning of the prefix. And yes, there are larger official SI prefixes beyond these, just in case someone needs more than 1000 yottabytes. Ned noted that one of the SI proposals is for the prefix "luma", representing 1063.

Speaking of impossibly large numbers, if you're like most people reading this article, then you probably arrived here through Google. Google is a tragically but forever misspelled version of Googol:

A googol is 10100, i.e. a 1 followed by 100 zeros. In official SI prefix terms, a googol is approximately a yotta squared, squared. Even larger is the googolplex, which is equal to 10 to the power of a googol (10googol); this number is about the same size as the number of possible games of chess. Even larger numbers have been defined, such as Skewes' number, Graham's number, and the Moser, which I won't even try to describe.

But I digress. When we use gigabyte to mean 230, that's an inaccurate and informal usage. Instead, we're supposed to be using the more accurate and disambiguated IEC prefixes. They were introduced in 1998 and formalized with IEEE 1541 in 2000.

kibibyte KiB 210
mebibyte MiB 220
gibibyte GiB 230
tebibyte TiB 240
pebibyte PiB 250
exbibyte EiB 260
zebibyte ZiB 270
yobibyte YiB 280

You occasionally see these more correct prefixes used in software, but adoption has been slow at best. There are several problems:

  1. They sound ridiculous. I hear the metric system used more often in the United States than I hear the words "kibibyte" or "mebibyte" uttered by anyone with a straight face. Which is to say, never.

  2. Hard drive manufacturers won't use them. Drive manufacturers don't care about being correct. What they do care about is consumers buying their drives because they have the largest possible number plastered on the front of the box. If a big lawsuit wasn't enough to get them to mend their ways, I seriously doubt that the recommendation of an international standards body is going to sway them.

  3. Tradition rules. It's hard to give up on the rich binary history of kilobytes, megabytes, and gigabytes, particularly when the alternatives are so questionable.

It's good to keep in mind the discrepancy between the decimal and binary meanings of the SI prefixes. The difference can bite you if you're not careful. But I think we're stuck with contextual, dual-use meanings of the SI prefixes for the forseeable future. Or perhaps we're all overthinking this, as Alan Green notes:

Whenever I try to discuss [this] with my friends, they say, "Yotta getta life".

Discussion