Coding Horror

programming and human factors

Mixing Oil and Water: Authorship in a Wiki World

When you visit Wikipedia's entry on asphalt, you get some reasonably reliable information about asphalt. What you don't get, however, is any indication of who the author is. That's because the author is irrelevant. Wikipedia is a community effort, the result of tiny slices of effort contributed by millions of people around the world. The focus is on the value of the aggregated information, not who the individual authors are.

But who is that community? According to Jimmy Wales, most of the work on Wikipedia is done by a tightly knit Gang of 500:

Wales decided to run a simple study to find out: he counted who made the most edits to the site. “I expected to find something like an 80-20 rule: 80% of the work being done by 20% of the users, just because that seems to come up a lot. But it’s actually much, much tighter than that: it turns out over 50% of all the edits are done by just .7% of the users … 524 people. … And in fact the most active 2%, which is 1400 people, have done 73.4% of all the edits.” The remaining 25% of edits, he said, were from “people who [are] contributing … a minor change of a fact or a minor spelling fix … or something like that.” 

Stack Overflow has some wiki-like aspects, and even my limited experience with the genre tells me this claim is implausible. Aaron Swartz ran his own study and came to a very different conclusion:

I wrote a little program to go through each edit and count how much of it remained in the latest version. Instead of counting edits, as Wales did, I counted the number of letters a user actually contributed to the present article.

If you just count edits, it appears the biggest contributors to the Alan Alda article (7 of the top 10) are registered users who (all but 2) have made thousands of edits to the site. Indeed, #4 has made over 7,000 edits while #7 has over 25,000. In other words, if you use Wales’s methods, you get Wales’s results: most of the content seems to be written by heavy editors.

But when you count letters, the picture dramatically changes: few of the contributors (2 out of the top 10) are even registered and most (6 out of the top 10) have made less than 25 edits to the entire site. In fact, #9 has made exactly one edit — this one! With the more reasonable metric — indeed, the one Wales himself said he planned to use in the next revision of his study — the result completely reverses.

Insiders account for the vast majority of the edits. But it's the outsiders who provide nearly all of the content.

Satisfying the needs of these two radically different audiences – the insiders and the outsiders – is the art of wiki design. That's why, on Stack Overflow, we mix oil and water:

  1. There's a strong sense of authorship, with a reputation system and a signature block attached to every post, like traditional blogs and forums.
  2. Once the system learns to trust you, you can edit anything – and we sometimes switch into a mode where authorship is de-emphasized to focus on the resulting content, like a wiki.

I'm not sure mixing these opposing elements would work for a project on the scale of Wikipedia. But I think it works for us (and when I say us, I mean programmers) because it's analogous to the version control system baked into the DNA of every programmer. Communal ownership is all well and good, but sometimes you still need to know Who Wrote This Crap. Authorship matters, ownership matters – and yet there's still something bigger, a larger goal we're all working toward, that trumps any individual contribution we might make. Both elements are in play.

Still, we absorbed a lot of tension with this design choice, because authorship and wiki are fundamentally opposing goals. How do you balance self-interest (vote for me) with selfnessness (vote for this content)? Sometimes it breaks down. There's a rough area around the edges where these two systems meet. For example, consider the Stack Overflow question titled Significant new inventions in computing since 1980.

Stack Overflow post from Alan Kay

If you knew this question was from Turing Award winning computer scientist Alan Kay, would it change the way you reacted to it? Of course it would!

But you'd never know that, because our wiki signature block only tells you:

  1. The last editor (Out Into Space)
  2. How many revisions there have been to this question so far (5)
  3. How many users have created those revisions (4)

It's a lot of information, by typical wiki standards. Who cares who wrote the question, as long as it's a good question, right?

But that doesn't entirely work; we also need to know who the primary author is, because that information will color and influence our responses to the question. I'll grant you this is an extreme example; no disrespect to my fellow programmers, but you haven't won a turing award. Even in more typical cases, attaching authorship matters. It lets us know who we're talking to, what their background is, what their skills are, and so forth. Furthermore, how can you possibly form a community when everyone is a random, anonymous contributor?

So the challenge, then, is tracking authorship – strictly for informational purposes – across a series of edit revisions. Jimbo erred in tracking only edit counts. Aaron used Python's difflib.SequenceMatcher.find_longest_match to establish ownership across revisions. This is the basic technique visualized in IBM's History Flow.

Imagine a scenario where three people will make contributions to a Wiki page at different points in time. Each person edits the page and then saves their changes to what becomes the latest version of that page.

History Flow animation

History Flow connects text that has been kept the same between consecutive versions. Pieces of text that do not have correspondence in the next (or previous) version are not connected and the user sees a resulting "gap" in the visualization; this happens for deletions and insertions.

It's very cool when applied to larger inputs; see history flow visualization of the Wikipedia entry on evolution.

Now, the differencing of text is, in itself, not exactly a trivial problem. I started by examining the Levenshtein Distance, but this algorithm is truly brute force. See if you can tell why, in this visualization of the Levenshtein distance between "puzzle" and "pzzel":

levenshtein distance example: puzzle and pzzel

The levenshtein distance is a measure of how many insertions, deletions, or substitutions are required to transform string A into string B. The larger the number, the more different the strings are. We're comparing two strings essentially letter-by-letter, which means the typical cost is O(mn), where m and n are the lengths of the two strings we're comparing. That's why you typically see Levenshtein used for comparing words, nothing on the order of paragraphs or pages.

I played around with Levenshtein for a while, but even optimized implementations are brutally slow as the size of the input increases. I quickly realized that a line-based comparison was the only workable one. We used this C# implementation of An O(ND) Difference Algorithm and its Variations (pdf).

What I ended up implementing was nowhere near as thorough as IBM's history flow, although it's probably similar to the rough metrics Aaron used. I simply sum the total size of all line contributions (insertions or deletions) from any given author in a revision, with a small bonus multiplier of 2x for the original author. We report the highest percentage of authorship in the final revision.

Alan Kay stackoverflow post wiki signature

The line-based diff approach for determining authorship is far from perfect; it'd be more accurate if it was per-word or per-sentence. But it's a fairly good approximation in my testing.

And most importantly, wiki posts by Alan Kay look like they're from Alan Kay.

Discussion

Have Keyboard, Will Program

My beloved Microsoft Natural Keyboard 4000 has succumbed to the relentless pounding of my fingers.

A moment of silence, please.

OK, it still works, technically, but certain keys have become.. unreliable. In particular, the semicolon key is now infuriatingly difficult to use. I don't know if this is God's way of punishing lapsed Visual Basic programmers, or what, but it's incredibly annoying. Yes, I've tried cleaning it repeatedly with compressed air (although I didn't get to the dishwasher quite yet), but no dice. I blame Kernighan, Ritchie, and Anders, in that order. Also, Canada.

Or maybe my keyboard is just worn out. It is three years old. Some of the home row keys and the arrows are worn to a shiny blankness. Perhaps it's time to reinvest in my keyboard.

And why not? As a corollary to We Are Typists First, Programmers Second, a quality keyboard is one of the best (and cheapest) investments you can make in your career. So what makes a good programming keyboard? Well, I can point to a few things that make for a very bad one:

1. Thou Shalt Not Mangle The Home Key Cluster

keyboard with mangled page-up and page-down key cluster

2. Thou Shalt Not Use a Non-Standard Arrow Key Cluster

keyboard with non-standard arrow key cluster

3. Thou Shalt Not Remap the Function Keys

keyboard with remapped function keys and f-lock

These areas are sacrosanct for programmers. Unlike the average home or office user, we depend on our function keys, the home key cluster, and the arrow keys. We use the crap out of these keys. Move those around and you might as well cut our fingers off while you're at it.

I think all programmers can agree on these three. Beyond that, it rapidly becomes a matter of personal preference. Do you like your keyboards ...

  • Ergonomic or standard?
  • Clicky or quiet?
  • Low-profile or normal?
  • Minimalistic or extra function keys?
  • With backlights and LEDs or plain?

There are many small subtleties to key position and size that could also heavily influence your choice. Pick whatever keyboard you like, as long as it's of reasonable quality, and you're comfortable typing on it for long periods. That's the important thing. With that in mind, I'll survey a few popular programming keyboard choices.

I mentioned my beloved Microsoft Natural Keyboard 4000, which is pretty much the holy grail of keyboards to me.

MS Natural Ergonomic 4000

Some people don't care for the non-split spacebar, and the way the keys have a fair bit of resistance -- but that's never bothered me. If you're into the whole ergonomic split layout thing, as I obviously am, it's difficult to go wrong with the Natural 4000. That's why I'm replacing my old keyboard with the very same model. If you hate wires, the wireless equivalent is available -- but only with the Microsoft Natural Ergonomic Desktop 7000 bundle.

If you're into classic keyboards, the DAS Keyboard Professional is another popular choice. Here it is next to the classic IBM Model M, the granddaddy of all PC keyboards.

model-m-vs-das-keyboard.jpg

These are both buckling spring keyboards, part of a long line of venerable keyboard designs going back to 1980. Dan waxes poetic:

These mainstream 'boards, all with one or another variant of the simple and quiet rubber dome switch idea, are perfectly OK for people who don't type much. They may drop dead with or without the assistance of a spilled beverage, but that's no big deal; if your computer's essential to your happiness, buy a spare cheap keyboard in case your main cheap keyboard dies, and use your nasty mushy input devices with my blessing.

If you do type a lot, though, you owe it to yourself to get a good keyboard of one kind or another, for the same reason that people who use the mouse a lot shouldn't settle for some ancient crusty serial-port optomechanical artifact.

Old mouses aren't nice to use, but old keyboards can be, because mouse technology's advanced a lot over the last 20 years, but keyswitch technology was quite mature in 1980. Modern keyboard tech advances have mainly had to do with wireless interfaces, snazzy looks, and making cheap crud cheaper.

The Das got a very favorable review at Tech Report. And it also comes in a super-hardcore blank keycaps edition, if you really want to prove to yourself (and your coworkers) that you can actually touch type. It is a bit spendy, though, particularly when excellent Model M clones can be had for fifty bucks less.

If you're more into laptop-style ultra low profile keyboards, you might prefer the Apple Keyboard.

Apple wired keyboard

Haven't tried this one myself, but I've heard good things; the layout seems solid and the quality superb, as you would expect from Apple.

I read recommendations for each of these keyboards almost daily. But of course I'm only touching the tip of the iceberg in this post. There are at least a dozen other popular contenders, along with a seemingly neverending parade of oddities and curiosities. Such as the Space cadet keyboard.

Whatever your choice, give your keyboard the consideration it deserves; it is the one essential tool of our craft.

Discussion

The Sad Tragedy of Micro-Optimization Theater

I'll just come right out and say it: I love strings. As far as I'm concerned, there isn't a problem that I can't solve with a string and perhaps a regular expression or two. But maybe that's just my lack of math skills talking.

In all seriousness, though, the type of programming we do on Stack Overflow is intimately tied to strings. We're constantly building them, merging them, processing them, or dumping them out to a HTTP stream. Sometimes I even give them relaxing massages. Now, if you've worked with strings at all, you know that this is code you desperately want to avoid writing:

static string Shlemiel()
{
string result = "";
for (int i = 0; i < 314159; i++)
{
result += getStringData(i);
}
return result;
}

In most garbage collected languages, strings are immutable: when you add two strings, the contents of both are copied. As you keep adding to result in this loop, more and more memory is allocated each time. This leads directly to awful quadradic n2 performance, or as Joel likes to call it, Shlemiel the painter performance.

Who is Shlemiel? He's the guy in this joke:

Shlemiel gets a job as a street painter, painting the dotted lines down the middle of the road. On the first day he takes a can of paint out to the road and finishes 300 yards of the road. "That's pretty good!" says his boss, "you're a fast worker!" and pays him a kopeck.

The next day Shlemiel only gets 150 yards done. "Well, that's not nearly as good as yesterday, but you're still a fast worker. 150 yards is respectable," and pays him a kopeck.

The next day Shlemiel paints 30 yards of the road. "Only 30!" shouts his boss. "That's unacceptable! On the first day you did ten times that much work! What's going on?"

"I can't help it," says Shlemiel. "Every day I get farther and farther away from the paint can!"

This is a softball question. You all knew that. Every decent programmer knows that string concatenation, while fine in small doses, is deadly poison in loops.

But what if you're doing nothing but small bits of string concatenation, dozens to hundreds of times -- as in most web apps? Then you might develop a nagging doubt, as I did, that lots of little Shlemiels could possibly be as bad as one giant Shlemiel.

Let's say we wanted to build this HTML fragment:

<div class="user-action-time">stuff</div>
<div class="user-gravatar32">stuff</div>
<div class="user-details">stuff<br/>stuff</div>

Which might appear on a given Stack Overflow page anywhere from one to sixty times. And we're serving up hundreds of thousands of these pages per day.

Not so clear-cut, now, is it?

So, which of these methods of forming the above string do you think is fastest over a hundred thousand iterations?

1: Simple Concatenation

string s =
@"<div class=""user-action-time"">" + st() + st() + @"</div>
<div class=""user-gravatar32"">" + st() + @"</div>
<div class=""user-details"">" + st() + "<br/>" + st() + "</div>";
return s;

2: String.Format

string s =
@"<div class=""user-action-time"">{0}{1}</div>
<div class=""user-gravatar32"">{2}</div>
<div class=""user-details"">{3}<br/>{4}</div>";
return String.Format(s, st(), st(), st(), st(), st());

3: string.Concat

string s =
string.Concat(@"<div class=""user-action-time"">", st(), st(),
@"</div><div class=""user-gravatar32"">", st(),
@"</div><div class=""user-details"">", st(), "<br/>",
st(), "</div>");
return s;

4: String.Replace

string s =
@"<div class=""user-action-time"">{s1}{s2}</div>
<div class=""user-gravatar32"">{s3}</div>
<div class=""user-details"">{s4}<br/>{s5}</div>";
s = s.Replace("{s1}", st()).Replace("{s2}", st()).
Replace("{s3}", st()).Replace("{s4}", st()).
Replace("{s5}", st());
return s;

5: StringBuilder

var sb = new StringBuilder(256);
sb.Append(@"<div class=""user-action-time"">");
sb.Append(st());
sb.Append(st());
sb.Append(@"</div><div class=""user-gravatar32"">");
sb.Append(st());
sb.Append(@"</div><div class=""user-details"">");
sb.Append(st());
sb.Append("<br/>");
sb.Append(st());
sb.Append("</div>");
return sb.ToString();

Take your itchy little trigger finger off that compile key and think about this for a minute. Which one of these methods will be faster?

Got an answer? Great!

And.. drumroll please.. the correct answer:

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

It. Just. Doesn't. Matter!

We already know none of these operations will be performed in a loop, so we can rule out brutally poor performance characteristics of naive string concatenation. All that's left is micro-optimization, and the minute you begin worrying about tiny little optimizations, you've already gone down the wrong path.

Oh, you don't believe me? Sadly, I didn't believe it myself, which is why I got drawn into this in the first place. Here are my results -- for 100,000 iterations, on a dual core 3.5 GHz Core 2 Duo.

1: Simple Concatenation606 ms
2: String.Format665 ms
3: string.Concat587 ms
4: String.Replace979 ms
5: StringBuilder588 ms

Even if we went from the worst performing technique to the best one, we would have saved a lousy 391 milliseconds over a hundred thousand iterations. Not the sort of thing that I'd throw a victory party over. I guess I figured out that using .Replace is best avoided, but even that has some readability benefits that might outweigh the miniscule cost.

Now, you might very well ask which of these techniques has the lowest memory usage, as Rico Mariani did. I didn't get a chance to run these against CLRProfiler to see if there was a clear winner in that regard. It's a valid point, but I doubt the results would change much. In my experience, techniques that abuse memory also tend to take a lot of clock time. Memory allocations are fast on modern PCs, but they're far from free.

Opinions vary on just how many strings you have to concatenate before you should start worrying about performance. The general consensus is around 10. But you'll also read crazy stuff, like this:

Don't use += concatenating ever. Too many changes are taking place behind the scene, which aren't obvious from my code in the first place. I advise you to use String.Concat() explicitly with any overload (2 strings, 3 strings, string array). This will clearly show what your code does without any surprises, while allowing yourself to keep a check on the efficiency.

Never? Ever? Never ever ever? Not even once? Not even if it doesn't matter? Any time you see "don't ever do X", alarm bells should be going off. Like they hopefully are right now.

Yes, you should avoid the obvious beginner mistakes of string concatenation, the stuff every programmer learns their first year on the job. But after that, you should be more worried about the maintainability and readability of your code than its performance. And that is perhaps the most tragic thing about letting yourself get sucked into micro-optimization theater -- it distracts you from your real goal: writing better code.

Discussion

The Ultimate Dogfooding Story

In software circles, dogfooding refers to the practice of using your own products. It was apparently popularized by Microsoft:

The idea originated in television commercials for Alpo brand dog food; actor Lorne Greene would tout the benefits of the dog food, and then would say it's so good that he feeds it to his own dogs. In 1988, Microsoft manager Paul Maritz sent Brian Valentine, test manager for Microsoft LAN Manager, an email titled "Eating our own Dogfood" challenging him to increase internal usage of the product.

Buried deep in Eric Sink's post Yours, Mine and Ours is perhaps the ultimate example of the power of dogfooding.

The primary machine tool in any well-equipped woodshop is a
table saw. Basically, it's a polished cast iron table with a slot through which protrudes a circular saw blade, ten inches in diameter. Wood is cut by sliding it across the table into the spinning blade.

A table saw is an extremely dangerous tool. My saw can cut
a 2-inch thick piece of hard maple with no effort at all. Frankly, it's a tool which should only be used by someone who is a little bit afraid of it. It should be obvious what would happen if a finger ever came in contact with the spinning blade. Over 3,000 people each year lose a finger in an accident with
a table saw.

A guy named Stephen Gass has come up with an amazing solution to this problem. He is a woodworker, but he also has a PhD in physics. His technology is called Sawstop.
It consists of two basic inventions:

  • He has a sensor which can detect the difference in
    capacitance between a finger and a piece of wood.
  • He has a way to stop a spinning table saw blade within
    1/100 of a second, less than a quarter turn of rotation.

The videos of this product are amazing. Slide a piece of
wood into the spinning blade, and it cuts the board just like it should. Slide a hot dog into the spinning blade, and it stops instantly, leaving the frankfurter with nothing more than a nick.

Here's the spooky part: Stephen Gass tested his product on
his own finger!
This is a guy who really wanted to close the distance between him and his customers. No matter how much I believed in my product, I think I
would find it incredibly difficult to stick my finger in a spinning table saw
blade.

The creator actually did stick his own finger in a SawStop on camera, apparently on the Discovery Channel show Time Warp – and now thanks to eagle-eyed reader Andy Bassit, here it is! The action starts at around 4 minutes in.

There's also a video of the sawstop in action on YouTube, using a hotdog in place of an errant digit. Personally, I find this demonstration no less effective than an actual finger.

Does it work? Yes, but it still has unavoidable limitations based on the laws of physics:

The bottom line is that this saw cuts you about 1/16" for every foot per second that you're moving. If you hit the blade while feeding the wood you're likely to get cut about 1/16" or less. If you hit the blade while you're falling you'll likely get a 3/16" deep cut instead of multiple finger amputation. If you hit it while pitching a baseball for the major leagues the injury will be even worse.

Dogfooding your own code isn't always possible, but it's worth looking very closely at any ways you can use your own software internally. As Mr. Gass proves, nothing exudes confidence like software developers willing to stick their own extremities into the spinning blades of software they've written.

Update: I found this quote from Havoc Pennington rather illustrative.

It would be wonderful discipline for any software dev team serious about Linux 'on the desktop' (whatever that means) to ban their own use of terminals. Of course, none of us have ever done this, and that explains a lot about the resulting products.

Discussion

A Scripter at Heart

What's the difference between a programming language and a scripting language? Is there even a difference at all? Larry Wall's epic Programming is Hard, Let's Go Scripting attempts to survey the scripting landscape and identify commonalities.

When you go out to so-called primitive tribes and analyze their languages, you find that structurally they're just about as complex as any other human language. Basically, you can say pretty much anything in any human language, if you work at it long enough. Human languages are Turing complete, as it were.

Human languages therefore differ not so much in what you can say but in what you must say. In English, you are forced to differentiate singular from plural. In Japanese, you don't have to distinguish singular from plural, but you do have to pick a specific level of politeness, taking into account not only your degree of respect for the person you're talking to, but also your degree of respect for the person or thing you're talking about.

So languages differ in what you're forced to say. Obviously, if your language forces you to say something, you can't be concise in that particular dimension using your language. Which brings us back to scripting.

How many ways are there for different scripting languages to be concise?

How many recipes for borscht are there in Russia?

Larry highlights the following axes of language design in his survey:

  1. Binding: Early or Late?
  2. Dispatch: Single or Multiple?
  3. Evaluation: Eager or Lazy?
  4. Typology: Eager or Lazy?
  5. Structures: Limited or Rich?
  6. Symbolic or Wordy?
  7. Compile Time or Run Time?
  8. Declarational or Operational?
  9. Classes: Immutable or Mutable?
  10. Class-based or Prototype-based?
  11. Passive data, global consistency or Active data, local consistency?
  12. Encapsulatation: by class? by time? by OS constructs? by GUI elements?
  13. Scoping: Syntactic, Semantic, or Pragmatic?

It's difficult to talk about Larry Wall without pointing out that Perl 6 has been missing in action for a very long time. In this 2002 Slashdot interview with Larry, he talks about Perl 6 casually, like it's just around the corner. Sadly, it has yet to be released. That's not quite Duke Nukem Forever vaporware territory, but it's darn close.

While interesting, I have to admit that I have a problem with all this pontificating about the nature of scripting languages, and the endlessly delayed release of Perl 6. Aren't Mr. Wall's actions, on some level, contrary to the spirit of the very thing he's discussing? The essence of a scripting language is immediate gratification. They're Show, Don't Tell in action.

In fact, my first programming experiences didn't begin with a compile and link cycle. They began something like this:

basic on the Apple // series

basic on the Atari 8-bit series

basic on the Commodore 64

As soon as you booted the computer, the first thing you were greeted with is that pesky blinking cursor. It's right there, inviting you.

C'mon. Type something. See what happens.

That's the ineffable, undeniable beauty of a scripting language. You don't need to read a giant Larry Wall article, or wait 8 years for Perl 6 to figure that out. It's right there in front of you. Literally. Try entering this in your browser's address bar:

javascript:alert('hello world');

But it's not real programming, right?

My first experience with real programming was in high school. Armed with a purchased copy of the the classic K&R book and a pirated C compiler for my Amiga 1000, I knew it was finally time to put my childish AmigaBASIC programs aside.

The C Programming Language

I remember that evening only vaguely (in my defense: I am old). My mom was throwing some kind of party downstairs, and one of the guests tried to draw me out of my room and be social. She was a very nice lady, with the best of intentions. I brandished my K&R book as a shield, holding it up and explaining to her: "No. You don't understand. This is important. I need to learn what's in this book." Tonight, I become a real programmer. And so I began.

What happened next was the eight unhappiest hours of my computing life. Between the painfully slow compile cycles and the torturous, unforgiving dance of pointers and memory allocation, I was almost ready to give up programming altogether. C wasn't for me, certainly. But I couldn't shake the nagging feeling that there was something altogether wrong with this type of programming. How could C suck all the carefree joy out of my stupid little AmigaBASIC adventures? This language took what I had known as programming and contorted it beyond recognition, into something stark and cruel.

I didn't know it then, but I sure do now. I hadn't been programming at all. I had been scripting.

I don't think my revulsion for C is something I need to apologize for. In fact, I think it's the other way around. I've just been waiting for the rest of the world to catch up to what I always knew.

The reason why dynamic languages like Perl, Python, and PHP are so important is key to understanding the paradigm shift. Unlike applications from the previous paradigm, web applications are not released in one to three year cycles. They are updated every day, sometimes every hour. Rather than being finished paintings, they are sketches, continually being redrawn in response to new data.

In my talk, I compared web applications to Von Kempelen's famous hoax, the mechanical Turk, a 1770 mechanical chess playing machine with a man hidden inside. Web applications aren't a hoax, but like the mechanical Turk, they do have a programmer inside. And that programmer is sketching away madly.

Now, I do appreciate and admire the seminal influence of C. In the right hands, it's an incredibly powerful tool. Every language has its place, and every programmer should choose the language that best fits their skillset and the task at hand.

I know, I know, I'll never be a real programmer. But I've come to terms with my limitations, because I'm a scripter at heart.

Discussion