Coding Horror

programming and human factors

How Not to Advertise on the Internet

Games that run in your web browser are all the rage, and understandably so. Why not build your game for the largest audience in the world, using freely available technology, and pay zero licensing fees? One such game is Evony, formerly known as Civony – a browser-based clone of the game Civilization with a buy-in mechanism.

There are also plentiful opportunities to 'pay money' now. In the end, Civony is still a business. And to be honest, it's probably better to give the option for some elite folks to finance the game for the masses than to make everyone pay a subscription or watch in-game ads. In addition to the old $0.30 per line world chat, you can spend money to speed up resource gathering, boost stats, and buy in-game artifacts. I'm sure there are other ways to pay money that I haven't discovered yet. But whenever you see a green plus-sign (+), you know the option exists to pay money for a perk.

The game is ostensibly free, but supported by a tiny fraction of players making cash payments for optional items (sometimes referred to as "freemium"). Thus, the player base needs to be quite large for the business of running the game to be sustainible, and the game's creators regularly purchase internet ad space to promote their game. The most interesting thing about Evony isn't the game, per se, but the game's advertising. Here's one of the early ads.

evony-ad-1.jpg

Totally reasonable advertisement. Gets the idea across that this is some sort of game set in medieval times, and emphasizes the free angle.

Apparently that ad didn't perform up to expectations at Evony world HQ, because the ads got progressively ... well, take a look for yourself. These are presented in chronological order of appearance on the internet.

evony-ad-2-alt

(if this lady looks familiar, there's a reason.)

evony-ad-3.jpg

evony-ad-4.jpg

evony-ad-5.jpg

evony-ad-6.jpg

To be clear, these are real ads that were served on the internet. This is not a parody. Just to prove it, here's a screenshot of the last ad in context at The Elder Scrolls Nexus.

evony-ad-in-context.jpg

I've talked about advertising responsibly in the past. This is about as far in the opposite direction as I could possibly imagine. It's yet another way, sadly, the brilliant satire Idiocracy turned out to be right on the nose.

idiocracy-starbucks-exotic-coffee-for-men.jpg

The dystopian future of Idiocracy predicted the reduction of advertising to the inevitable lowest common denominator of all, with Starbucks Exotic Coffee for Men, H.R. Block "Adult" Tax Return (home of the gentleman's rebate), and Pollo Loco chicken advertising a Bucket of Wings with "full release".

Evony, thanks for showing us what it means to take advertising on the internet to the absolute rock bottom ... then dig a sub-basement under that, and keep on digging until you reach the white-hot molten core of the Earth. I've always wondered what that would be like. I guess now I know.

Discussion

Testing With "The Force"

Markdown was one of the humane markup languages that we evaluated and adopted for Stack Overflow. I've been pretty happy with it, overall. So much so that I wanted to implement a tiny, lightweight subset of Markdown for comments as well.

I settled on these three commonly used elements:

*italic* or _italic_
**bold** or __bold__
`code`

I loves me some regular expressions and this is exactly the stuff regex was born to do! It doesn't look very tough. So I dusted off my copy of RegexBuddy and began.

I typed some test data in the test window, and whipped up a little regex in no time at all. This isn't my first time at the disco.

regexbuddy-naive-regex.png

Bam! Yes! Done and done! By gum, I must be a genius programmer!

Despite my obvious genius, I began to have some small, nagging doubts. Is the test phrase...

I would like this to be *italic* please.

... really enough testing?

Sure it is! I can feel in my bones that this thing freakin' works! It's almost like I'm being pulled toward shipping this code by some inexorable, dark, testing ... force. It's so seductively easy!

come-to-the-dark-side-vader.jpg

But wait. I have this whole database of real world comments that people have entered on Stack Overflow. shouldn't I perhaps try my awesome regular expression on that corpus of data to see what happens? Oh, fine. If we must. Just to humor you, nagging doubt. Let's run a query and see.

select Text from PostComments
where dbo.RegexIsMatch(Text, '*(.*?)*') = 1

Which produced this list of matches, among others:

Interesting fact about math: x * 7 == x + (x * 2) + (x * 4), or x + x >> 1 + x >> 2. Integer addition is usually pretty cheap.

Thanks. What I needed was to turn on Singleline mode too, and use .*? instead of .*.

yeah, see my edit - change select * to select RESULT.* one row - are sure you have more than one row item with the same InstanceGUID?

Not your main problem, but you are mix and matching wchar_t and TCHAR. mbstowcs() converts from char * to wchar_t *.

aawwwww.... Brainf**k is not valid. :/

Thank goodness I listened to my midichlorians and let the light side of the testing force prevail here!

come-to-the-light-side-skywalker.jpg

So how do we fix this regex? We use the light side of the force -- brute force, that is, against a ton of test cases! My job here is relatively easy because I have over 20,000 test cases sitting in a database. You may not have that luxury. Maybe you'll need to go out and find a bunch of test data on the internet somewhere. Or write a function that generates random strings to feed to the routine, also known as fuzz testing.

I wanted to leave the rest of this regular expression as an exercise for the reader, as I'm a sick guy who finds that sort of thing entertaining. If you don't -- well, what the heck is wrong with you, man? But I digress. I've been criticized for not providing, you know, "the answer" in my blog posts. Let's walk through some improvements to our italic regex pattern.

First, let's make sure we have at least one non-whitespace character inside the asterisks. And more than one character in total so we don't match the ** case. We'll use positive lookahead and lookbehind to do that.

*(?=S)(.+?)(?<=S)*

That helps a lot, but we can test against our data to discover some other problems. We get into trouble when there are unexpected characters in front of or behind the asterisks, like, say, p*q*r. So let's specify that we only want certain characters outside the asterisks.

(?<=[s^,(])*(?=S)(.+?)(?<=S)*(?=[s$,.?!])

Run this third version against the data corpus, and wow, that's starting to look pretty darn good! There are undoubtedly some edge conditions, particularly since we're unlucky enough to be talking about code in a lot of our comments, which has wacky asterisk use.

This regex doesn't have to be (and probably cannot be, given the huge possible number of human inputs) perfect, but running it against a large set of input test data gives me reasonable confidence that I'm not totally screwing up.

So by all means, test your code with the force -- brute force! It's good stuff! Just be careful not to get sloppy, and let the dark side of the testing force prevail. If you think one or two simple test cases covers it, that's taking the easy (and most likely, buggy and incorrect) way out.

Discussion

Code: It's Trivial

Remember that Stack Overflow thing we've been working on? Some commenters on a recent Hacker News article questioned the pricing of Stack Exchange -- essentially, a hosted Stack Overflow:

Seems really pricey for a relatively simple software like this. Someone write an open source alternative? it looks like something that can be thrown together in a weekend.

Ah, yes, the stereotypical programmer response to most projects: it's trivial! I could write that in a week!*

It's even easier than that. Open source alternatives to Stack Overflow already exist, so you've got a head start. Gentlemen, start your compilers! Er, I mean, interpreters!

No, I don't take this claim seriously. Not enough to write a response. And fortunately for me, now I don't need to, because Benjamin Pollack -- one of the few people outside our core team who has access to the Stack Overflow source code -- already wrote a response. Even if I had written a response, I doubt it would have been half as well written as Benjamin's.

Developers think cloning a site like StackOverflow is easy for the same reason that open-source software remains such a horrible pain in the ass to use. When you put a developer in front of StackOverflow, they don't really see StackOverflow. What they actually see is this:
create table QUESTION (ID identity primary key,
TITLE varchar(255),
BODY text,
UPVOTES integer not null default 0,
DOWNVOTES integer not null default 0,
USER integer references USER(ID));
create table RESPONSE (ID identity primary key,
BODY text,
UPVOTES integer not null default 0,
DOWNVOTES integer not null default 0,
QUESTION integer references QUESTION(ID))

If you then tell a developer to replicate StackOverflow, what goes into his head are the above two SQL tables and enough HTML to display them without formatting, and that really is completely doable in a weekend. The smarter ones will realize that they need to implement login and logout, and comments, and that the votes need to be tied to a user, but that's still totally doable in a weekend; it's just a couple more tables in a SQL back-end, and the HTML to show their contents. Use a framework like Django, and you even get basic users and comments for free.

But that's not what StackOverflow is about. Regardless of what your feelings may be on StackOverflow in general, most visitors seem to agree that the user experience is smooth, from start to finish. They feel that they're interacting with a polished product. Even if I didn't know better, I would guess that very little of what actually makes StackOverflow a continuing success has to do with the database schema--and having had a chance to read through StackOverflow's source code, I know how little really does. There is a tremendous amount of spit and polish that goes into making a major website highly usable. A developer, asked how hard something will be to clone, simply does not think about the polish, because the polish is incidental to the implementation.

I have zero doubt that given enough time, open source clones will begin to approximate what we've created with Stack Overflow. It's as inevitable as evolution itself. Well, depending on what time scale you're willing to look at. With a smart, motivated team of closed-source dinosaurs, it is indeed possible to outrun those teeny tiny open-source mammals. For now, anyway. Let's say we're those speedy, clever Velociraptor types of dinosaurs -- those are cool, right?

Despite Benjamin's well reasoned protests, the source code to Stack Overflow is, in fact, actually, kind of ... well, trivial. Although there is starting to be quite a lot of it, as we've been beating on this stuff for almost a year now. That doesn't mean our source code is good, by any means; as usual, we make crappy software, with bugs. But every day, our tiny little three person team of speedy-but-doomed Velociraptors starts out with the same goal. Not to write the best Stack Overflow code possible, but to create the best Stack Overflow experience possible. That's our mission: make Stack Overflow better, in some small way, than it was the day before. We don't always succeed, but we try very, very hard not to suck -- and more importantly, we keep plugging away at it, day after day.

Building a better Stack Overflow experience does involve writing code and building cool features. But more often, it's anything but:

  1. synthesizing cleaner, saner HTML markup
  2. optimizing our pages for speed and load time efficiency
  3. simplifying or improving our site layout, CSS, and graphics
  4. responding to support and feedback emails
  5. writing a blog post explaining some aspect of the site engine or philosophy
  6. being customers of our own sites, asking our own programming questions and sysadmin questions
  7. interacting with the community on our dedicated meta-discussion site to help gauge what we should be working on, and where the rough edges are that need polishing
  8. electing community moderators and building moderation tools so the community can police and regulate itself as it scales
  9. producing Creative Commons dumps of our user-contributed questions and answers
  10. coming up with schemes for responsible advertising so we can all make a living
  11. producing the Stack Overflow podcast with Joel
  12. helping set up logistics for the Stack Overflow DevDays conferences
  13. setting up the next site in the trilogy, and figuring out where we go next

As programmers, as much as we might want to believe that

lots_of_awesome_code = success;

There's nothing particularly magical about the production of source code. In fact, writing code is a tiny proportion of what makes most businesses successful.

Code is meaningless if nobody knows about your product. Code is meaningless if the IRS comes and throws you in jail because you didn't do your taxes. Code is meaningless if you get sued because you didn't bother having a software license created by a lawyer.

Writing code is trivial. And fun. And something I continue to love doing. But if you really want your code to be successful, you'll stop coding long enough to do all that other, even more trivial stuff around the code that's necessary to make it successful.

* Although, to be fair, I really could write Twitter in a week. It's so ridiculously simple! Come on!

Discussion

Oh, You Wanted "Awesome" Edition

We recently upgraded our database server to 48 GB of memory -- because hardware is cheap, and programmers are expensive.

Imagine our surprise, then, when we rebooted the server and saw only 32 GB of memory available in Windows Server 2008. Did we install the memory wrong? No, the BIOS screen reported the full 48 GB of memory. In fact, the system information applet even reports 48 GB of memory:

sodb1-system-summary.png

But there's only 32 GB of usable memory in the system, somehow.

sodb1-taskman-memory.png

Did you feel that? A great disturbance in the Force, as if 17 billion bytes simultaneously cried out in terror and were suddenly silenced. It's so profoundly sad.

That's when I began to suspect the real culprit: weasels.

marketing-weasel.jpg

No. Not the cute weasels. I'm referring to angry, evil marketing weasels.

weasels-ripped-my-flesh.jpg

That's more like it. Those marketing weasels are vicious.

We belatedly discovered post-upgrade that we are foolishly using Windows Server 2008 Standard edition. Which has been arbitrarily limited to 32 GB of memory. Why? So the marketing weasels can segment the market.

It's sort of like if you were all set to buy that new merino wool sweater, and you thought it was going to cost $70, which is well worth it, and when you got to Banana Republic it was on sale for only $50! Now you have an extra $20 in found money that you would have been perfectly happy to give to the Banana Republicans!

Yipes!

That bothers good capitalists. Gosh darn it, if you're willing to do without it, well, give it to me! I can put it to good use, buying a SUV or condo or Mooney or yacht one of those other things capitalists buy!

In economist jargon, capitalists want to capture the consumer surplus.

Let's do this. Instead of charging $220, let's ask each of our customers if they are rich or if they are poor. If they say they're rich, we'll charge them $349. If they say they're poor, we'll charge them $220.

Now how much do we make? Back to Excel. Notice the quantities: we're still selling the same 233 copies, but the richest 42 customers, who were all willing to spend $349 or more, are being asked to spend $349. And our profits just went up! from $43K to about $48K! NICE!

Capture me some more of that consumer surplus stuff!

How many versions of WIndows Server 2008 are there? I count at least six. They're capturing some serious consumer surplus, over there in Redmond.

  • Datacenter Edition
  • Enterprise Edition
  • Standard Edition
  • Foundation
  • Web
  • HPC

Already, I'm confused. Which one of these versions allows me to use all 48 GB of my server's memory? There are no less than six individual "compare" pages to slice and dice all the different features each version contains. Just try to make sense of it all. I dare you. No, I double dog dare you! Oh, and by the way, there's zero pricing information on any of these pages. So open another browser window and factor that into your decisionmaking, too.

I don't mean to single out Microsoft here; lots of companies use this segmented pricing trick. Even Web 2.0 darlings 37 Signals.

BaseCamp pricing

Heck, our very own product segments the market.

Stack Exchange pricing

37signals just does it .. prettier, that's all. They're still asking you if you're poor or rich, and charging you more if you're rich.

Eric Sink also advocates the same "rich customer, poor customer" software pricing policy:

In an ideal world, the price would be different for every customer. The "perfect" pricing scheme would charge every customer a different amount, extracting from each one the maximum amount they are willing to pay.
  • The IT guy at Podunk Lutheran College has no money: Gratis.
  • The IT guy at a medium-sized real estate agency has some money: $500.
  • The IT guy at a Fortune 100 company has tons of money: $50,000.

You can never make your pricing "perfect," but you can do much better than simply setting one constant price for all situations. By carefully tuning all these details, you can find ways to charge more money from the people who are willing to pay more.

This sort of pricing seems exploitative, but it can also be an act of public good -- remember that the poorest customers are paying less; with a one-size-fits-all pricing policy, they might not be able to afford the product at all. Drug companies often follow the same pricing model when selling life-saving drugs to third-world countries. First-world countries end up subsidizing the massive costs of drug development, but the whole world benefits.

What I object to isn't the money involved, but the mental overhead. The whole thing runs so contrary to the spirit of Don't Make Me Think. Sure, don't make us customers think. Unless you want us to think about how much we'd like to pay you, that is.

And what are we paying for? The privilege of flipping the magic bits in the software that say "I am blah edition!" It's all so.. anticlimactic. All that effort, all that poring over complex feature charts and stressing out about pricing plans, and for what? Just to get the one simple, stupid thing I care about -- using all the memory in my server.

Perhaps these complaints, then, point to one unsung advantage of open source software:

Open source software only comes in one edition: awesome.

The money is irrelevant; the expensive resource here is my brain. If I choose open source, I don't have to think about licensing, feature matrices, or recurring billing. I know, I know, we don't use software that costs money here, but I'd almost be willing to pay for the privilege of not having to think about that stuff ever again.

Now if you'll excuse me, I'm having trouble deciding between Windows 7 Smoky Bacon Edition and Windows 7 Kenny Loggins Edition. Bacon is delicious, but I also love that Footloose song..

Discussion

All Abstractions Are Failed Abstractions

In programming, abstractions are powerful things:

Joel Spolsky has an article in which he states

All non-trivial abstractions, to some degree, are leaky.

This is overly dogmatic - for example, bignum classes are exactly the same regardless of the native integer multiplication. Ignoring that, this statement is essentially true, but rather inane and missing the point. Without abstractions, all our code would be completely interdependent and unmaintainable, and abstractions do a remarkable job of cleaning that up. It is a testament to the power of abstraction and how much we take it for granted that such a statement can be made at all, as if we always expected to be able to write large pieces of software in a maintainable manner.

But they can cause problems of their own. Let's consider a particular LINQ to SQL query, designed to retrieve the most recent 48 Stack Overflow questions.

var posts =
(from p in DB.Posts
where
p.PostTypeId == PostTypeId.Question &&
p.DeletionDate == null &&
p.Score >= minscore
orderby p.LastActivityDate descending
select p).
Take(maxposts);

The big hook here is that this is code the compiler actually understands. You get code completion, compiler errors if you rename a database field or mistype the syntax, and so forth. Perhaps best of all, you get an honest to goodness post object as output! So you can turn around and immediately do stuff like this:

foreach (var post in posts.ToList())
{
Render(post.Body);
}

Pretty cool, right?

Well, that Linq to SQL query is functionally equivalent to this old-school SQL blob. More than functionally, it is literally identical, if you examine the SQL string that LINQ generates behind the scenes:

string query =
"select top 48 * from Posts
where
PostTypeId = 1 and
DeletionDate is null and
Score >= -4
order by LastActivityDate desc";

This text blob is of course totally opaque to the compiler. Fat-finger a syntax error in here, and you won't find out about it until runtime. Even if it does run without a runtime error, processing the output of the query is awkward. It takes row level references and a lot of tedious data conversion to get at the underlying data.

var posts = DB.ExecuteQuery(query);
foreach (var post in posts.ToList());
{
Render(post["Body"].ToString());
}

So, LINQ to SQL is an abstraction -- we're abstracting away raw SQL and database access in favor of native language constructs and objects. I'd argue that Linq to SQL is a good abstraction. Heck, it's exactly what I asked for five years ago.

But even a good abstraction can break down in unexpected ways.

Consider this optimization, which is trivial in the old-school SQL blob code: instead of pulling down every single field in the post records, why not pull just the id number? Makes sense, if that's all I need. And it's faster -- much faster!

select top 48 * from Posts827 ms
select top 48 Id from Posts260 ms

Selecting all columns with the star (*) operator is expensive, and that's what LINQ to SQL always does by default. Yes, you can specify lazy loading, but not on a per-query basis. Normally, this is a non-issue, because selecting all columns for simple queries is not all that expensive. And you'd think pulling down 48 measly little post records would be squarely in the "not expensive" category!

So let's compare apples to apples. What if we got just the id numbers, then retrieved the full data for each row?

select top 48 Id from Posts260 ms
select * from Posts where Id = 123453 ms

Now, retrieving 48 individual records one by one is sort of silly, becase you could easily construct a single where Id in (1,2,3..,47,48) query that would grab all 48 posts in one go. But even if we did it in this naive way, the total execution time is still a very reasonable (48 * 3 ms) + 260 ms = 404 ms. That is half the time of the standard select-star SQL emitted by LINQ to SQL!

An extra 400 milliseconds doesn't sound like much, but slow pages lose users. And why in the world would you perform a slow database query on every single page of your website when you don't have to?

It's tempting to blame Linq, but is Linq really at fault here? These seem like identical database operations to me:

1. Give me all columns of data for the top 48 posts.

or

1. Give me just the ids for the top 48 posts.
2. Retrieve all columns of data for each of those 48 ids.

So why in the wide, wide world of sports would one of these seemingly identical operations be twice as slow as the other?

The problem isn't Linq to SQL. The problem is that we're attempting to spackle a nice, clean abstraction over a database that is full of highly irregular and unusual real world behaviors. Databases that:

  • may not have the right indexes
  • may misinterpret your query and generate an inefficient query plan
  • are trying to perform an operation that doesn't fit well in available memory
  • are paging data from disks which might be busy at that particular moment
  • might contain irregularly sized column datatypes

That's what's so frustrating. We can't just pretend all our data is formatted into neat, orderly data structures sitting there in memory, lined up in convenient little queues for us to reach out and casually scoop them up. As I've demonstrated, even trivial queries can have bizarre behavior and performance characteristics that are not at all clear.

To its credit, Linq to SQL is quite flexible: we can use strongly typed queries, or we can use SQL blob queries that we cast to the right object type. That flexibility is critical, because so much of our performance depends on these quirks of the database. We default to the built-in Linq language constructs, and drop down to hand-tuning ye olde SQL blobs where the performance traces tell us we need to.

Either way, it's clear that you've got to know what's happening in the database every step of the way to even begin understanding the performance of your application, much less troubleshoot it.

I think you could make a fairly solid case that Linq to SQL is, in fact, a leaky and failed abstraction. Exactly the kind of thing Joel was complaining about. But I'd also argue that virtually all good programming abstractions are failed abstractions. I don't think I've ever used one that didn't leak like a sieve. But I think that's an awfully architecture astronaut way of looking at things. Instead, let's ask ourselves a more pragmatic question:

Does this abstraction make our code at least a little easier to write? To understand? To troubleshoot? Are we better off with this abstraction than we were without it?

It's our job as modern programmers not to abandon abstractions due to these deficiencies, but to embrace the useful elements of them, to adapt the working parts and construct ever so slightly less leaky and broken abstractions over time. Like desperate citizens manning a dike in a category 5 storm, we programmers keep piling up these leaky abstractions, shoring up as best we can, desperately attempting to stay ahead of the endlessly rising waters of complexity.

As much as I may curse Linq to SQL as yet another failed abstraction, I'll continue to use it. Yes, I may end up soggy and irritable at times. But it sure as heck beats drowning.

Discussion