Coding Horror

programming and human factors

The Bathroom Wall of Code

In Why Isn't My Encryption.. Encrypting?, many were up in arms about the flawed function I posted. And rightfully so, as there was a huge mistake in that code that just about invalidates any so-called "encryption" it performs. But there's one small problem: I didn't write that function.

Now, I am certainly responsible for that function, in the sense that it magically appeared in our codebase one day -- and the the entire project is the sum of all the code contributed by every programmer working on it. I invoke the First Rule of Programming: It's Always Your Fault. And by "your", I don't mean the particular programmer who contributed this code, who shall remain blissfully nameless. I mean us -- the entire team. The onus is on us, as a team, to vet every line of code at the time it is contributed, and constantly peer review each other's code. It's a responsibility we shoulder together. Nobody owns the code, because everybody owns the code.

Yes, I failed. Because the team failed.

Geoff Weinhold left this prophetic comment on the post:

The irony in this is that someone will inevitably end up here for sample encryption code and blindly copy/paste your flawed code.

Indeed. Heaven forbid someone copy and paste flawed code from the internet into their project! In fact, a quick search on some of the unique strings in that original Encrypt() function turns up a few ... interesting ... search results.

01/2006C# Shiznit - Library Encrypt and Decrypt Methods using TripleDES and MD5
05/2006Code Project - Encrypt and Decrypt Data with C#
04/2007Bytes - String Encryption Help
06/2008Egghead Cafe - invalid length while decrypting TripleDESCryptoServiceProvider
09/2008ASP.NET Forums - Need help on password-encrypted key used for signing
11/2008code:keep Encryption
12/2008Encrypt/Decrypt the password in C# .net
05/2009My Own Stupid Blog Post

That's just a sampling of the 131 web hits I got. To paraphrase Austin Powers, this Encrypt() function is like the village bicycle: everybody's had a ride. It's a shame this particular bicycle happens to have a crippling lack of brakes that makes it dangerous to ride, but what can you do.

Scott Hanselman coined a nice phrase for this: the internet as the bathroom wall of code.

bathroom wall graffiti

It's true. People, being people, have gone and scrawled a bunch of random code graffiti all over the damn internet. Some of it is vanity tagging. Some of it is borderline vandalism. And some of it is art. How do we tell the difference?

That's the very reason I put forth a modest proposal for the copy and paste school of code reuse. Not that it would have helped in this case, but it sure would be nice if someone could perform a grep replace ...

s/Mode = CipherMode.ECB/Mode = CipherMode.CBC/g

... on, like, the entire internet. So other projects don't absorb this critically flawed code sample.

In the meantime, until that tool is developed, I recommend that you apply extra-strength peer review to any code snippets you absorb into your project from the bathroom wall of code. That internet code snippet you're looking at, the one that appears to be just what you're looking for, could also be random graffiti scrawled on a bathroom wall.

It's true that some bathrooms are nicer than others. But as we've learned, it pays to be especially careful when cribbing code from the internet.

Discussion

Why Isn't My Encryption.. Encrypting?

It's as true in life as it is in client-server programming: the only secret that can't be compromised is the one you never revealed.

But sometimes, it's unavoidable. If you must send a secret down to the client, you can encrypt it. The most common form of encryption is symmetric encryption, where the same key is used to both encrypt and decrypt. Most languages have relatively easy to use libraries in place for symmetric encryption. Here's how we were doing it in .NET:

public static string Encrypt(string toEncrypt, string key, bool useHashing)
{
byte[] keyArray = UTF8Encoding.UTF8.GetBytes(key);
byte[] toEncryptArray = UTF8Encoding.UTF8.GetBytes(toEncrypt);
if (useHashing)
keyArray = new MD5CryptoServiceProvider().ComputeHash(keyArray);
var tdes = new TripleDESCryptoServiceProvider()
{ Key = keyArray, Mode = CipherMode.ECB, Padding = PaddingMode.PKCS7 };
ICryptoTransform cTransform = tdes.CreateEncryptor();
byte[] resultArray = cTransform.TransformFinalBlock(
toEncryptArray, 0, toEncryptArray.Length);
return Convert.ToBase64String(resultArray, 0, resultArray.Length);
}

This is how our symmetric encryption function works:

  1. We start with a secret string we want to protect. Let's say it is "password123".
  2. We pick a key. Let's use the key "key-m4st3r"
  3. Before encrypting, we'll prefix our secret with a salt to prevent dictionary attacks. Let's call our salt "NaCl".

We'd call the function like so:

Encrypt("NaCl" + "password123", "key-m4ast3r", true);

The output is a base64 encoded string of the TripleDES encrypted byte data. This encrypted data can now be sent to the client without any reasonable risk that the secret string will be revealed. There's always unreasonable risk, of the silent black government helicopter sort, but for all practical purposes there's no way someone could discover that your password is "password123" unless your key is revealed.

In our case, we were using this Encrypt() method to experiment with storing some state data in web pages related to the login process. We thought it was secure, because the data was encrypted. Sure it's encrypted! It says Encrypt() right there in the method name, right?

Wrong.

There's a bug in that code. A bug that makes our encrypted state data vulnerable. Do you see it? My coding mistakes, let me show you them!

string key = "SuperSecretKey";
Debug.WriteLine(
Encrypt("try some different" +
"00000000000000000000000000000000",
key, true).Base64ToHex());
Debug.WriteLine(
Encrypt("salts" +
"00000000000000000000000000000000",
key, true).Base64ToHex());
3908024fc33b55c3
4e885c8946b80735
704cbe2a41d25f21
81bb6d726bd35152
81bb6d726bd35152
81bb6d726bd35152
1367f10f2584ace3
4ae7661295a98e46
81bb6d726bd35152
81bb6d726bd35152
81bb6d726bd35152
4ee5d23b3b5e3eb4

(I'm using strings with multiples of 8 here to make the Base64 conversions easier.)

Do you see the mistake now? It's a short trip from here to unlimited data tampering, particularly since the state data from the login process contained user entered strings. An attacker could simply submit the form as many times as she likes, chop out the encrypted attack values from the middle, and insert them into the next encrypted request -- which will happily decrypt and be processed as if our code had sent it down!

The culprit is this line of code:

{ Key = keyArray, Mode = CipherMode.ECB, Padding = PaddingMode.PKCS7 }

Which, much to our embarrassment, is an incredibly stupid parameter to use in symmetric encryption:

The Electronic Codebook (ECB) mode encrypts each block individually. This means that any blocks of plain text that are identical and are in the same message, or in a different message encrypted with the same key, will be transformed into identical cipher text blocks. If the plain text to be encrypted contains substantial repetition, it is feasible for the cipher text to be broken one block at a time. Also, it is possible for an active adversary to substitute and exchange individual blocks without detection.

It's fairly standard for symmetric encryption algorithms to use feedback from the previous block to seed the next block. I honestly did not realize that it was possible to pick a cipher mode that did not do some kind of block chaining! CipherMode.ECB? More like CipherMode.Fail!

So, what have we learned?

  1. If it doesn't have to be sent to the client, then don't! Secrets sent to the client can potentially be tampered with and compromised in various ways that aren't easy to see or even predict. In our case, we can store login state on the server and avoid transmitting any of that state to the client altogether.

  2. It isn't encryption until you've taken the time to fully understand the concepts behind the encryption code. Specifically, we didn't notice that our encryption function was using a highly questionable CipherMode that allowed block level substitution of the encrypted data.

Luckily, this was a somewhat experimental page on the site, so we were able to revert back to our standard server-side approach rather quickly once the exploit was discovered. I'm no Bruce Schneier, but I have a reasonable grasp of encryption concepts. And I still completely missed this problem.

So the next time you sit down to write some encryption code, consider the above two points carefully. Otherwise, like us, you'll be left wondering why your encryption isn't... encrypting.

(Thanks to Daniel LeCheminant for his assistance in discovering this issue.)

Discussion

Why Do Computers Suck at Math?

You've probably seen this old chestnut by now.

google-calculator-incorrect.png

Insert your own joke here. Google can't be wrong -- math is! But Google is hardly alone; this is just another example in a long and storied history of obscure little computer math errors that go way back, such as this bug report from Windows 3.0.

  1. Start Calculator.
  2. Input the largest number to subtract first (for example, 12.52).
  3. Press the MINUS SIGN (-) key on the numeric keypad.
  4. Input the smaller number that is one unit lower in the decimal portion (for example, 12.51).
  5. Press the EQUAL SIGN (=) key on the numeric keypad.

On my virtual machine, 12.52 - 12.51 on Ye Olde Windows Calculator indeed results in 0.00.

Windows 3.11 calculator incorrect

And then there was the famous Excel bug.

If you have Excel 2007 installed, try this: Multiply 850 by 77.1 in Excel.

One way to do this is to type "=850*77.1" (without the quotes) into a cell. The correct answer is 65,535. However, Excel 2007 displays a result of 100,000.

At this point, you might be a little perplexed, as computers are supposed to be pretty good at this math stuff. What gives? How is it possible to produce such blatantly incorrect results from seemingly trivial calculations? Should we even be trusting our computers to do math at all?

Well, numbers are harder to represent on computers than you might think:

A standard floating point number has roughly 16 decimal places of precision and a maximum value on the order of 10308, a 1 followed by 308 zeros. (According to IEEE standard 754, the typical floating point implementation.)

Sixteen decimal places is a lot. Hardly any measured quantity is known to anywhere near that much precision. For example, the constant in Newton's Law of Gravity is only known to four significant figures. The charge of an electron is known to 11 significant figures, much more precision than Newton's gravitational constant, but still less than a floating point number. So when are 16 figures not enough? One problem area is subtraction. The other elementary operations -- addition, multiplication, division -- are very accurate. As long as you don't overflow or underflow, these operations often produce results that are correct to the last bit. But subtraction can be anywhere from exact to completely inaccurate. If two numbers agree to n figures, you can lose up to n figures of precision in their subtraction. This problem can show up unexpectedly in the middle of other calculations.

Number precision is a funny thing; did you know that an infinitely repeating sequence of 0.999.. is equal to one?

In mathematics, the repeating decimal 0.999Ö denotes a real number equal to one. In other words: the notations 0.999Ö and 1 actually represent the same real number.

0.999 infinitely repeating

This equality has long been accepted by professional mathematicians and taught in textbooks. Proofs have been formulated with varying degrees of mathematical rigour, taking into account preferred development of the real numbers, background assumptions, historical context, and target audience.

Computers are awesome, yes, but they aren't infinite.. yet. So any prospects of storing any infinitely repeating number on them are dim at best. The best we can do is work with approximations at varying levels of precision that are "good enough", where "good enough" depends on what you're doing, and how you're doing it. And it's complicated to get right.

Which brings me to What Every Computer Scientist Should Know About Floating-Point Arithmetic.

Squeezing infinitely many real numbers into a finite number of bits requires an approximate representation. Although there are infinitely many integers, in most programs the result of integer computations can be stored in 32 bits. In contrast, given any fixed number of bits, most calculations with real numbers will produce quantities that cannot be exactly represented using that many bits. Therefore the result of a floating-point calculation must often be rounded in order to fit back into its finite representation. This rounding error is the characteristic feature of floating-point computation.

What do the Google, Windows, and Excel (pdf) math errors have in common? They're all related to number precision approximation issues. Google doesn't think it's important enough to fix. They're probably right. But some mathematical rounding errors can be a bit more serious.

Interestingly, the launch failure of the Ariane 5 rocket, which exploded 37 seconds after liftoff on June 4, 1996, occurred because of a software error that resulted from converting a 64-bit floating point number to a 16-bit integer. The value of the floating point number happened to be larger than could be represented by a 16-bit integer. The overflow wasn't handled properly, and in response, the computer cleared its memory. The memory dump was interpreted by the rocket as instructions to its rocket nozzles, and an explosion resulted.

I'm starting to believe that it's not the computers that suck at math, but the people programming those computers. I know I'm living proof of that.

Discussion

The Web Browser Address Bar is the New Command Line

Google's Chrome browser passes anything you type into the address bar that isn't an obvious URI on to the default search engine.

chrome address bar onebox

While web browsers should have some built-in smarts, they can never match the collective intelligence of a worldwide search engine. For example:

weather San Francisco
CSCO
time London
san francisco 49ers
5*9+(sqrt 10)^3=
Henry+Wadsworth+Longfellow
earthquake
10.5 cm in inches
population FL
Italian food 02138
movies 94705
homes Los Angeles
150 GBP in USD
Seattle map
Patent 5123123
650
american airlines 18
036000250015
JH4NA1157MT001832
510-525-xxxx (I'm hesitant to link a listed personal phone number here, but it does work)

I like to think of the web browser address bar as the new command line.

Oh, you wanted dozens of cryptic, obscure UNIX style command line operators and parameters? No problem!

define:defenestrate
presidents 1850...1860
"plants vs. zombies" daterange:2454955-2454955
link:experts-exchange.com sucks
filetype:pdf programming language poster
allintitle:nigerian site:www.snopes.com

Any command line worth its salt has some kind of scripting language built in, too, right? No sweat. Just try entering this in your browser's address bar.

javascript:alert('Hello, world!')

The sky's the limit from there; whatever JavaScript you can fit in the address bar is fair game. These are more commonly known as "bookmarklets".

Apparently we've spent the last 20 years reimplementing the UNIX command line in the browser. Services like yubnub make this process even more social, with collaborative group creation (and ranking!) of new commands. You can find some of the cooler ones on the golden eggs page.

gimf "carrot top"
esv Ezekiel 25:17
2g color colour

Honestly, I was never a big command-line enthusiast; even way back when on my Amiga I'd choose the GUI over the CLI whenever I could. But maybe I bet on the wrong horse. Perhaps the command prompt – or more specifically, the search oriented, crowdsourced, world public command prompt – really is the future.

Discussion

Pseudocode or Code?

Although I'm a huge fan of Code Complete -- it is my single most recommended programming book for good reason -- there are chapters in it that I haven't been able to digest, even after 16 years.

One of those chapters describes something called the Pseudocode Programming Process. And on paper, at least, it sounds quite sensible. Before writing a routine, you describe what that routine should do in plain English. So if we we set out to write an error handling lookup routine, we'd first write it in pseudocode:

set the default status to "fail"
look up the message based on the error code
if the error code is valid
if doing interactive processing, display the error message
interactively and declare success
if doing command line processing, log the error message to the
command line and declare success
if the error code isn't valid, notify the user that an
internal error has been detected
return status information

Then, when you're satisfied that you understand what the routine should do, you turn that pseudocode into comments that describe the code you're about to write.

// set the default status to "fail"
Status errorMessageStatus = Status_Failure;
// look up the message based on the error code
Message errorMessage = LookupErrorMessage( errorToReport );
// if the error code is valid
if ( errorMessage.ValidCode() ) {
// determine the processing method
ProcessingMethod errorProcessingMethod = CurrentProcessingMethod();
// if doing interactive processing, display the error message
// interactively and declare success
if ( errorProcessingMethod == ProcessingMethod_Interactive ) {
DisplayInteractiveMessage( errorMessage.Text() );
errorMessageStatus = Status_Success;
}

Pseudocode is sort of like the Tang of programming languages -- you hydrate the code around it.

tang.jpg

But why pseudocode? Steve offers some rationales:

  • Pseudocode makes reviews easier. You can review detailed designs without examining source code. Pseudocode makes low-level design reviews easier and reduces the need to review the code itself.
  • Pseudocode supports the idea of iterative refinement. You start with a high-level design, refine the design to pseudocode, and then refine the pseudocode to source code. This successive refinement in small steps allows you to check your design as you drive it to lower levels of detail. The result is that you catch highlevel errors at the highest level, mid-level errors at the middle level, and low-level errors at the lowest level -- before any of them becomes a problem or contaminates work at more detailed levels.
  • Pseudocode makes changes easier. A few lines of pseudocode are easier to change than a page of code. Would you rather change a line on a blueprint or rip out a wall and nail in the two-by-fours somewhere else? The effects aren't as physically dramatic in software, but the principle of changing the product when it's most malleable is the same. One of the keys to the success of a project is to catch errors at the "least-value stage," the stage at which the least effort has been invested. Much less has been invested at the pseudocode stage than after full coding, testing, and debugging, so it makes economic sense to catch the errors early.
  • Pseudocode minimizes commenting effort. In the typical coding scenario, you write the code and add comments afterward. In the PPP, the pseudocode statements become the comments, so it actually takes more work to remove the comments than to leave them in.
  • Pseudocode is easier to maintain than other forms of design documentation. With other approaches, design is separated from the code, and when one changes, the two fall out of agreement. With the PPP, the pseudocode statements become comments in the code. As long as the inline comments are maintained, the pseudocode's documentation of the design will be accurate.

All compelling arguments. As an acolyte of McConnell, it pains me to admit this, but every time I've tried the Pseudocode Programming Process, I almost immediately abandon it as impractical.

Why? Two reasons:

  1. code > pseudocode. I find it easier to think about code in code. While I'm all for describing the overall general purpose of the routine before you write it in plain English -- this helps name it, which is incredibly difficult -- extending that inside the routine doesn't work well for me. There's something fundamentally.. unrealistic.. about attempting to using precise English to describe the nuts and bolts of code.
  2. Starting with the goal of adding comments to your code seems backwards. I prefer coding without comments, in that I want the code to be as self-explanatory as humanly possible. Don't get me wrong; comments do occur regularly in my code, but only because the code could not be made any clearer without them. Comments should be a method of last resort, not something you start with.

Of course, PPP is just one proposed way to code, not the perfect or ideal way. McConnell has no illusions about this, and acknowledges that refactoring, TDD, design by contract, and even plain old "hacking" are valid and alternative ways to construct code.

But still -- I have a hard time seeing pseudocode as useful in anything other than possibly job interviews. And even then, I'd prefer to sit down in front of a computer and write real code to solve whatever problem is being posed. What's your take? Is pseudocode a useful tool in your programming? Do you write pseudocode before writing code?

Discussion