Coding Horror

programming and human factors

URL Rewriting to Prevent Duplicate URLs

As a software developer, you may be familiar with the DRY principle: don't repeat yourself. It's absolute bedrock in software engineering, and it's covered beautifully in The Pragmatic Programmer, and even more succinctly in this brief IEEE software article (pdf). If you haven't committed this to heart by now, go read these links first. We'll wait.

Scott Hanselman recently found out the hard way that the DRY principle also applies to URLs. Consider the multiple ways you could get to this very page:

  • http://codinghorror.com/blog/
  • http://www.codinghorror.com/blog/
  • http://www.codinghorror.com/blog/index.htm

It's even more problematic for Scott because he has two different domain names that reference the same content.

Having multiple URLs reference the same content is undesirable not only from a sanity check DRY perspective, but also because it lowers your PageRank. PageRank is calculated per-URL. If 50% of your incoming backlinks use one URL, and 50% use a different URL, you aren't getting the full PageRank benefit of those backlinks. The link juice is watered down and divvied up between the two different URLs instead of being concentrated into one of them.

So the moral of this story, if there is one, is to keep your URLs simple and standard. This is something the REST crowd has been preaching for years. You can't knock simplicity. Well, you can, but you'll be crushed by simplicity's overwhelming popularity eventually, so why fight it?

Normalizing your URLs isn't difficult if you take advantage of URL Rewriting. URL Rewriting has been a de-facto standard on Apache for years, but has yet to reach mainstream acceptance in Microsoft's IIS. I'm not even sure if IIS 7 supports URL Rewriting out of the box, although its new, highly modular architecture would make it very easy to add support. It's critical that Microsoft get a good reference implementation of an IIS7 URL rewriter out there, preferably one that's compatible with the vast, existing library of mod_rewrite rules.

But that doesn't help us today. If you're using IIS today, you have two good options for URL rewriting; they're both installable as ISAPI filters. I'll show samples for both, using a few common URL rewriting rules that I personally use on my website.

The first is ISAPI Rewrite. ISAPI Rewrite isn't quite free, but it's reasonably priced, and most importantly, it's nearly identical in syntax to the Apache mod_rewrite standard. It's also quite mature, as it's been through quite a few revisions by now.

[ISAPI_Rewrite]
# fix missing slash on folders
# note, this assumes we have no folders with periods!
RewriteCond Host: (.*)
RewriteRule ([^.?]+[^.?/]) http://$1$2/ [RP]
# remove index pages from URLs
RewriteRule (.*)/default.htm$ $1/ [I,RP]
RewriteRule (.*)/default.aspx$ $1/ [I,RP]
RewriteRule (.*)/index.htm$ $1/ [I,RP]
RewriteRule (.*)/index.html$ $1/ [I,RP]
# force proper www. prefix on all requests
RewriteCond %HTTP_HOST ^test.com [I]
RewriteRule ^/(.*) http://www.test.com/$1 [RP]
# only allow whitelisted referers to hotlink images
RewriteCond Referer: (?!http://(?:www.good.com|www.better.com)).+
RewriteRule .*.(?:gif|jpg|jpeg|png) /images/block.jpg [I,O]

The second option, Ionic's ISAPI Rewrite Filter, is completely free. This filter has improved considerably since the last time I looked at it, and it appears to be a viable choice now. However, it uses its own rewrite syntax that is similar to the Apache mod_rewrite standard, but different enough to require some rework.

# fix missing slash on folders
# note, this assumes we have no folders with periods!
RewriteRule (^[^.]+[^/]$) $1/ [I,RP]
# remove index pages from URLs
RewriteRule  (.*)/default.htm$ $1/ [I,RP]
RewriteRule  (.*)/default.aspx$ $1/ [I,RP]
RewriteRule  (.*)/index.htm$ $1/ [I,RP]
RewriteRule  (.*)/index.html$ $1/ [I,RP]
# force proper www. prefix on all requests
RewriteCond %{HTTP_HOST} ^test.com [I]
RewriteRule ^/(.*) http://www.test.com/$1 [I,RP]
# only allow whitelisted referers to hotlink images
RewriteCond %{HTTP_REFERER} ^(?!HTTP_REFERER)
RewriteCond %{HTTP_REFERER} ^(?!http://www.good.com) [I]
RewriteCond %{HTTP_REFERER} ^(?!http://www.better.com) [I]
RewriteRule .(?:gif|jpg|jpeg|png)$ /images/block.jpg [I,L]

The Ionic filter still has some quirks, but I loved its default logging capability. I could tell exactly what was happening with my rules, blow by blow, with a quick glance at the log file. However, I had a lot of difficulty getting the Ionic filter to install-- I could only get it to work in IIS 5.0 isolation mode, no matter what I tried. Clearly a work in progress, but a very promising one.

Of course, the few rewrite rules I presented above-- URL normalization and image hotlink prevention-- are merely the tip of the iceberg.

They don't call it the Swiss Army Knife of URL Manipulation for nothing. URL rewriting should be an integral part of every web developer's toolkit. It'll increase your DRYness, it'll increase your PageRank, and it's also central to the concept of REST.

Discussion

Because They All Suck

The release of Windows Vista has caused an unfortunate resurgence in that eternal flame of computer religious wars, Mac vs. PC. Everywhere I go, somebody's explaining in impassioned tones why their pet platform is better than yours. It's all so tedious.

The eternal flame.. wars

Personally, I had my fill of Mac versus PC arguments by 1994. I remember spending untold hours on the America Online forums endlessly debating the merits of PCs and Macs with Ross Rubin and other unsavory characters. But all that arguing never seemed to result in anything other than more arguments. Eventually, if you're more interested in using computers than endlessly arguing about them, you outgrow the arguments. And yet somehow, nearly fifteen years later, we're all happily retreading the same tired old Mac vs. PC ground.

I have a problem with this.

You might read Charles Petzold's ironically titled It Just Works as an anti-Mac diatribe. It certainly casts Apple in an unflattering light; Petzold's poor mother can't seem to catch a break.

 

Perhaps if my mother used lots of various Mac applications and stuck in lots of external devices, the machine would "just work" quite well. But she basically only uses email, so perhaps that's the problem. Just about every time I visit my mother in Jersey, I am called upon to boot up that dreadful machine and do something so it "just works" once again. For awhile she had a problem where certain spam emails would hang the email program upon viewing, but they couldn't be deleted without first being viewed. (Gosh, that was fun.) Presumably some patch to fix this little problem is among the 100 megabytes of updates waiting to be downloaded and installed, but my mother has a dial-up and we're forced to forego this 100 meg download. And besides, the slogan isn't "It just works with 100 megabytes of updates."

But if you read closely, as I did, you'll see that the experience wouldn't have been any better on a Windows PC. For a PC of that vintage, it's likely Petzold would have had to install the enormous Windows XP Service Pack 2 update to bring it up to date, which is certainly no less of a hassle than going from OS X 10.2 to OS X 10.4.

That's because Macs and PCs share one crucial flaw: they're both computers.

My computer frustrates and infuriates me on a daily basis, and it's been this way since I first laid my hands on a keyboard. Every computer I've ever owned-- including the ones with an Apple logo-- has been a colossal pain in the neck. Some slightly more so than others, but any device designed as a general purpose "do-everything" computing machine is destined to disappoint you eventually. It's inevitable.

The only truly sublime end-user experiences I've had have been with computers that weren't computers-- specialized devices, such as Tivo, the original Palm Pilot, the Nintendo Wii, and so forth.

General purpose computing devices are designed to be all things to all people. As a direct consequence, they will always be rife with compromises, pitfalls, and disappointments. That's the first secret of using computers: they all suck. Which makes the entire Mac vs. PC debate relative degrees of moot. I learned this lesson early in life; evidently some people are still struggling with it.

Computers do have one strong suit: they're unparalleled tools for writing, photography, programming, composing music, and creating art. It's the only reason to deal with the pain of owning one. As the Guardian's Charlie Brooker notes, the Mac vs. PC debate has an insidious side-effect that can distract you from this key benefit:

 

Ultimately the [Get a Mac advertising] campaign's biggest flaw is that it perpetuates the notion that consumers somehow "define themselves" with the technology they choose. If you truly believe you need to pick a mobile phone that "says something" about your personality, don't bother. You don't have a personality. A mental illness, maybe - but not a personality. Of course, that hasn't stopped me slagging off Mac owners with a series of sweeping generalisations for the past 900 words, but that is what the ads do to PCs. Besides, that's what we PC owners are like - unreliable, idiosyncratic and gleefully unfair. And if you'll excuse me now, I feel an unexpected crash coming.

That's the other problem with the Mac vs. PC debate: it completely misses the point. Computers aren't couture, they're screwdrivers. Your screwdriver rocks, and our screwdriver sucks. So what? They're screwdrivers. If you really want to convince us, stop talking about your screwdriver, and show us what you've created with it.

Discussion

Everybody Loves BitTorrent

The traditional method of distributing large files is to put them on a central server. Each client then downloads the file directly from the server. It's a gratifyingly simple approach, but it doesn't scale. For every download, the server consumes bandwidth equal to the size of the file. You probably don't have enough bandwidth to serve a large file to a large audience, and even if you did, your bandwidth bill would go through the roof. The larger the file, the larger the audience, the worse your bandwidth problem gets. It's a popularity tax.

With BitTorrent, you also start by placing your large file on a central server. But once the downloading begins, something magical happens: as clients download the file, they share whatever parts of the file they have with each other. Clients can opportunistically connect with any other client to obtain multiple parts of the file at once. And it scales perfectly: as file size and audience size increases, the bandwidth of the BitTorrent distribution network also increases. Your server does less and less work with each connected client. It's an elegant, egalitarian way of sharing large files with large audiences.

BitTorrent radically shifts the economics of distribution. It's one of the most miraculous ideas ever conceived on the internet. As far as I'm concerned, there should be a Nobel prize for computing, and the inventor of BitTorrent should be its first recipient.

There's a great Processing visualization of BitTorrent in action which explains it far better than I can. The original visualization is not only down semi-permanently, but also written for an ancient version of Processing. I grabbed a cached copy of the code and updated it for the latest version of Processing.

animated visualization of BitTorrent in action

This meager little animated GIF doesn't do the highly dynamic, real-time nature of the visualization justice. I highly recommend downloading Processing and downloading the updated bittorrent visualization code, so you can see the process from start to finish on your own machine. It's beautiful.

But as as wonderful and clever as BitTorrent is, it isn't perfect. As an avid BitTorrent user, I've noticed the following problems:

  1. BitTorrent is a terrible Long Tail client.

    The efficiency of BitTorrent is predicated on popularity. The more people downloading, the larger the distribution network gets. But if what you want is obscure or unpopular – part of the long tail – BitTorrent is painfully, brutally slow. With only a handful of clients sharing the workload, you're better off using traditional distribution methods.

  2. BitTorrent, although distributed, is still centralized.

    Download work is shared by the clients, but how do the clients locate each other? Traditionally this is done through a centralized server "tracker", or list of peers. This means BitTorrent is vulnerable to attacks on the centralized server. Once the server is out of commission, the clients have no way of locating each other, and the whole distribution network grinds to a halt. There are alternatives which allow clients to share the list of peers amongst themselves, such as distributed hash tables, but centralized tracking is more efficient.

    Also, in order to even begin a BitTorrent download, you must first know where to obtain a .torrent file. It's a chicken-and-egg problem which also implies the existence of a centralized server out there somewhere.

  3. BitTorrent is unsuitable for small files, even if they are extremely popular.

    The BitTorrent distribution network is predicated on clients sharing pieces of the file during the download period. But if the download period is small, the opportunity window for sharing is also small; at any given time only a few users will be downloading. This is another scenario where you're unlikely to find any peers, so you're better off with traditional distribution methods.

  4. BitTorrent relies on client altruism.

    There's no rule that says clients must share bandwidth while they're downloading. Although most BitTorrent clients default to uploading the maximum amount a user's upstream connection allows, it's possible to dial the upload rate down to nothing if you're greedy. And some users may have their firewalls configured in such a way that they can't upload data, even if they wanted to. There's no way to punish bad peers for not sharing, or reward good peers for sharing more.

    Furthermore, every torrent needs a "seed" – a peer with 100% of the file downloaded – connected at all times. If there is no seed, no matter how many peers you have, none of the peers will never be able to download the entire file. It's considered a courtesy to stay connected if you have 100% of the file downloaded and no other seeds are available. But again, this is a convention, not a requirement. It's entirely possible for a torrent to "die" when there are no seeds available.

The BitTorrent model is innovative, but it isn't suitable for every distribution task. The centralized server model is superior in most cases. But centralized distribution is a tool for the rich. Only highly profitable organizations can afford massive amounts of bandwidth. BitTorrent, in comparison, is highly democratic. BitTorrent gives the people whatever they want, whenever they want it – by collectively leveraging the tiny trickle of upstream bandwidth doled out by most internet service providers.

But just because it's democratic doesn't mean BitTorrent has to be synonymous with intellectual piracy. BitTorrent has legitimate uses, such as distributing World of Warcraft patches. And Amazon's S3 directly supports the torrent protocol.

BitTorrent, in short, puts distribution choices back in the hands of the people. And that's why everybody loves BitTorrent. Everyone, that is, except the MPAA and RIAA.

Discussion

Beyond JPEG

It's surprising that the venerable JPEG image compression standard, which dates back to 1986, is still the best we can do for photographic image compression. I can't remember when I encountered my first JPEG image, but JPEG didn't appear to enter practical use until the early 90's.

There's nothing wrong with JPEG. It's a perfectly serviceable image compression format. But there are newer, more modern choices these days. There's even a sequel of sorts to JPEG known as JPEG 2000. It's the logical heir to the JPEG throne.

The promise of JPEG 2000 is higher image quality in much smaller file sizes, at the minor cost of additional CPU time. And since we always seem to have a lot more CPU time than bandwidth, this is a perfect tradeoff. You may remember my comparsion of JPEG compression levels entry from last year. Let's see what happens when we take the two worst-looking images from that comparison – the ones with JPEG compression factor 40 and 50 – and use JPEG 2000 to produce images of (nearly) the exact same size:

JPEG, ~8,200 bytes JPEG 2000, ~8,200 bytes
JPEG Lena image, filesize ~8,200 bytes, compression factor 50 JPEG 2000 Lena image, filesize ~8,200 bytes
JPEG, ~10,700 bytes JPEG 2000, ~10,700 bytes
JPEG Lena image, filesize ~10,700 bytes, compression factor 50 JPEG 2000 Lena image, filesize ~10,700 bytes

No current web browsers can render JPEG 2000 (.jp2) images, so what you're seeing are extremely high quality JPEG versions of the JPEG 2000 images. Click on the images to download the actual JPEG 2000 files; most modern photo editing software can view them natively.

JPEG 2000 not only compresses more efficiently, it also does a better job of hiding its compression artifacts, too. It takes a lot more bits per pixel to create a JPEG image that looks as good as a JPEG 2000 image. But if you're willing to pump up the file size, you aren't losing any fidelity by presenting JPEG images.

Microsoft, as Microsoft is wont to do, offers a closed-source alternative to JPEG 2000 known as HD Photo or Windows Media Photo. As of late 2006, Microsoft made the format 100% royalty free, and support for HD Photo is included in Windows Vista and .NET Framework 3.0. According to this Russian study, Files in Microsoft's HD Photo format (.hdp, .wdp) are comparable to-- but not better than-- JPEG 2000. The study PDF has lots of comparison images, so you can decide for yourself.

Unfortunately, it doesn't really matter which next-generation image compression format is better, since nobody uses them. Microsoft neglected to include support for HD Photo in Internet Explorer 7. And Firefox doesn't currently support JPEG 2000, either. It's a bit of a mystery, because there's an seven year-old open bug on JPEG 2000, and the OpenJPEG library seems like a logical fit.

Until a commonly used web browser supports JPEG 2000 or HD Photo, there's no traction. I hope the next browser releases can move us beyond the ancient JPEG image compression format.

Discussion

What's In a Version Number, Anyway?

I remember when Microsoft announced that Windows 4.0 would be known as Windows 95. At the time, it seemed like a radical, unnecessary change -- naming software with years instead of version numbers? Inconceivable! How will users of Windows 3.1 possibly know what software version they should upgrade to?

In retrospect, switching away from software version numbers to years seems like one of the wisest decisions Microsoft ever made.

  • Users don't care about version numbers. Major, minor, alpha, beta, build number.. what does it all mean? What users might care about is knowing whether or not the software they're running is current. A simple date is the most direct way to communicate this to the user.

  • A model year is easy to understand. Why should it take two arbitrary numbers and a decimal point to identify what software you're using? We identify tons of consumer products using a simple model year designator. Software should be no different.

  • Version numbers don't scale. Once you get beyond ten versions, what's the point of meticulously counting every new release? Better to stamp it with a date and move on.

Microsoft Office 2003 is a far more meaningful name than Microsoft Office 11. And Firefox 2007 would be a much better name than Firefox 2.0 for all the same reasons.

But version numbers live on, at least for programmers. Here's a quick survey of version numbers for the software running on my machine at the moment:

7.0.6000.16386
8.1.0178.00
11.11
2.7.0.0
2.5.10 / build 6903
2.0 build 0930
0122.1848.2579.33475
2.0.50727.312
2.0.0.1
1.8.20061.20418

As you can see, there's not even a commonly accepted pattern for version numbers. In .NET, the version number convention is:

(Major version).(Minor version).(Revision number).(Build number)

But it's hardly universal. And even if it was, what does all this meticulously numbered version data get us? What does it mean? Why have version numbers at all? It's partly because version number is an expected software convention. And partly because programmers never met a piece of arbitrarily detailed metadata they didn't love. Personally, I like to think of version numbers as dogtags for your software. Like dogtags, they're primarily designed for use in the event of an emergency.

dogtags

In the event of a software problem-- if, on the battlefield, you hear someone screaming "medic!"-- it is useful to consult the dogtags so you know exactly what version of the software you're dealing with.

But software version numbers, even arbitrarily detailed programmer version numbers, can't seem to avoid dates, either. Jensen Harris explains the Microsoft Office version numbering scheme:

The most interesting thing to watch for is the first 4-digit number you encounter. In the examples above, 5608 and 3417. These are what we refer to as the "build number." Every few days during the development cycle, we compile all of the code in Office and turn it into a "build": essentially an installable version of all the work everyone's done up until that point. Eventually, a build becomes "final" and that is the one that ends up on CDs and in the store.

The 4-digit build number is actually an encoded date which allows you tell when a build was born. The algorithm works like this:

  • Take the year in which a project started. For Office "12", that was 2003.
  • Call January of that year "Month 1."
  • The first two digits of the build number are the number of months since "Month 1."
  • The last two digits are the day of that month.

So, if you have build 3417, you would do the following math: "Month 1" was January 2003. "Month 13" was January 2004. "Month 25" was January 2005. Therefore, "Month 34" would be October 2005. 3417 = October 17, 2005, which was the date on which Office 12 build 3417 started. For Office 2003 and XP both, "Month 1" was January 2000. So, the final build of Office 2003, 5608, was made on August 8, 2003.

So Microsoft Office version numbers end up containing three relevant bits of data:

  1. the software generation (Office 97, Office XP, Office 2003, Office 2007), which is patently obvious to anyone using the software-- and can be directly inferred from the build date anyway.
  2. the date of the build.
  3. the number of builds done after "code freeze".

Of those three, how many are actually useful to users? How many are useful to developers?

On the whole, I encourage software developers to avoid confounding users with version numbers. That's what leads to crappy ideas like SID 6.7 and even crappier movies like Virtuosity. We brought it on ourselves by letting our geeky, meaningless little construct of major and minor version numbers spill over into pop culture. It's not worth it. Let's reel it back in.

Whenever possible, use simple dates instead of version numbers, particularly in the public names of products. And if you absolutely, positively must use version numbers internally, make them dates anyway: be sure to encode the date of the build somewhere in your version number.

Discussion