What’s Worse Than Crashing?

Jeff Atwood

02 Aug 2007 — 2 min read — Comments

Here’s an interesting thought question from Mike Stall: what’s worse than crashing?

Mike provides the following list of crash scenarios, in order from best to worst:

Application works as expected and never crashes.
Application crashes due to rare bugs that nobody notices or cares about.
Application crashes due to a commonly encountered bug.
Application deadlocks and stops responding due to a common bug.
Application crashes long after the original bug.
Application causes data loss and/or corruption.

Mike points out that there’s a natural tension between...

failing immediately when your program encounters a problem, e.g. “fail fast”
attempting to recover from the failure state and proceed normally

The philosophy behind “fail fast” is best explained in Jim Shore’s article (pdf).

Some people recommend making your software robust by working around problems automatically. This results in the software “failing slowly.” The program continues working right after an error but fails in strange ways later on. A system that fails fast does exactly the opposite: when a problem occurs, it fails immediately and visibly. Failing fast is a nonintuitive technique: “failing immediately and visibly” sounds like it would make your software more fragile, but it actually makes it more robust. Bugs are easier to find and fix, so fewer go into production.

Fail fast is reasonable advice – if you’re a developer. What could possibly be easier than calling everything to a screeching halt the minute you get a byte of data you don’t like? Computers are spectacularly unforgiving, so it’s only natural for developers to reflect that masochism directly back on users.

But from the user’s perspective, failing fast isn’t helpful. To them, it’s just another meaningless error dialog preventing them from getting their work done. The best software never pesters users with meaningless, trivial errors – it’s more considerate than that. Unfortunately, attempting to help the user by fixing the error could make things worse by leading to subtle and catastrophic failures down the road. As you work your way down Mike’s list, the pain grows exponentially. For both developers and users. Troubleshooting #5 is a brutal death march, and by the time you get to #6 – you’ve lost or corrupted user data – you’ll be lucky to have any users left to fix bugs for.

What’s interesting to me is that despite causing more than my share of software crashes and hardware bluescreens, I’ve never lost data, or had my data corrupted. You’d figure Murphy’s Law would force the worst possible outcome at least once a year, but it’s exceedingly rare in my experience. Maybe this is an encouraging sign for the current state of software engineering. Or maybe I’ve just been lucky.

So what can we, as software developers, do about this? If we adopt a “fail as often and as obnoxiously as possible” strategy, we’ve clearly failed our users. But if we corrupt or lose our users’ data through misguided attempts to prevent error messages – if we fail to treat our users’ data as sacrosanct – we’ve also failed our users. You have to do both at once:

If you can safely fix the problem, you should. Take responsibility for your program. Don’t slouch through the easy way out by placing the burden for dealing with every problem squarely on your users.
If you can’t safely fix the problem, always err on the side of protecting the user’s data. Protecting the user’s data is a sacred trust. If you harm that basic contract of trust between the user and your program, you’re hurting not only your credibility – but the credibility of the entire software industry as a whole. Once they’ve been burned by data loss or corruption, users don’t soon forgive.

The guiding principle here, as always, should be to respect your users. Do the right thing.

Complaint-Driven Development

If I haven’t blogged much in the last year, it’s because we’ve been busy building that civilized discourse construction kit thing I talked about. (Yes, that’s actually the name of the company. This is what happens when you put me in charge of naming things. Pinball

The Rule of Three

Every programmer ever born thinks whatever idea just popped out of their head into their editor is the most generalized, most flexible, most one-size-fits all solution that has ever been conceived. We think we’ve built software that is a general purpose solution to some set of problems,

Today is Goof Off at Work Day

When you’re hired at Google, you only have to do the job you were hired for 80% of the time. The other 20% of the time, you can work on whatever you like – provided it advances Google in some way. At least, that’s the theory. Google’s 20

Coding Horror: The Book

If I had to make a list of the top 10 things I’ve done in my life that I regret, “writing a book” would definitely be on it. I took on the book project mostly because it was an opportunity to work with a few friends whose company I