XML: The Angle Bracket Tax

Everywhere I look, programmers and programming tools seem to have standardized on XML. Configuration files, build scripts, local data storage, code comments, project files, you name it -- if it's stored in a text file and needs to be retrieved and parsed, it's probably XML. I realize that we have to use something to represent reasonably human readable data stored in a text file, but XML sometimes feels an awful lot like using an enormous sledgehammer to drive common household nails.

I'm deeply ambivalent about XML. I'm reminded of this Winston Churchill quote:

It has been said that democracy is the worst form of government except all the others that have been tried.

XML is like democracy. Sometimes it even works. On the other hand, it also means we end up with stuff like this:

<SOAP-ENV:Envelope xmlns:SOAP-ENV="http://schemas.xmlsoap.org/soap/envelope/"
SOAP-ENV:encodingStyle="http://schemas.xmlsoap.org/soap/encoding/">
<SOAP-ENV:Body>
<m:GetLastTradePrice xmlns:m="Some-URI">
<symbol>DIS</symbol>
</m:GetLastTradePrice>
</SOAP-ENV:Body>
</SOAP-ENV:Envelope>

How much actual information is communicated here? Precious little, and it's buried in an astounding amount of noise. I don't mean to pick on SOAP. This blanket criticism applies to XML, in whatever form it appears. I spend a disproportionate amount of my time wading through an endless sea of angle brackets and verbose tags desperately searching for the vaguest hint of actual information. It feels wrong.

You could argue, like Derek Denny-Brown, that XML has been misappropriated and misapplied.

I find it so interesting that XML has become so popular for such things as SOAP. XML was not designed with the SOAP scenarios in mind. Other examples of popular scenarios which deviate XML's original goals are configuration files, quick-n-dirty databases, and [RSS]. I'll call these 'data' scenarios, as opposed to the 'document' scenarios for which XML was originally intended. In fact, I think it is safe to say that there is more usage of XML for 'data' scenarios than for 'document' scenarios, today.

Given its prevalence, you might decide that XML is technologically terrible, but you have to use it anyway. It sure feels like, for any given representation of data in XML, there was a better, simpler choice out there somewhere. But it wasn't pursued, because, well, XML can represent anything. Right?

Consider the following XML fragment:

<memo date="2008-02-14">
<from>
<name>The Whole World</name><email>us@world.org</email>
</from>
<to>
<name>Dawg</name><email>dawg158@aol.com</email>
</to>
<message>
Dear sir, you won the internet. http://is.gd/fh0
</message>
</memo>

Because XML purports to represent everything, it ends up representing nothing particularly well.

Wouldn't this information be easier to read and understand -- and only nominally harder to parse -- when expressed in its native format?

Date: Thu, 14 Feb 2008 16:55:03 +0800 (PST)
From: The Whole World <us@world.org>
To: Dawg <dawg158@aol.com>
Dear sir, you won the internet. http://is.gd/fh0

You might argue that XML was never intended to be human readable, that XML should be automagically generated via friendly tools behind the scenes, never exposed to a single living human eye. It's a spectacularly grand vision. I hope one day our great-grandchildren can live in a world like that. Until that glorious day arrives, I'd sure enjoy reading text files that don't make me suffer through the XML angle bracket tax.

So what, then, are the alternatives to XML? One popular choice is YAML. I could explain it, but it's easier to show you. Which, I think, is entirely the point.

<club>
<players>
<player id="kramnik"
name="Vladimir Kramnik"
rating="2700"
status="GM" />
<player id="fritz"
name="Deep Fritz"
rating="2700"
status="Computer" />
<player id="mertz"
name="David Mertz"
rating="1400"
status="Amateur" />
</players>
<matches>
<match>
<Date>2002-10-04</Date>
<White refid="fritz" />
<Black refid="kramnik" />
<Result>Draw</Result>
</match>
<match>
<Date>2002-10-06</Date>
<White refid="kramnik" />
<Black refid="fritz" />
<Result>White</Result>
</match>
</matches>
</club>
players:
Vladimir Kramnik: &kramnik
rating: 2700
status: GM
Deep Fritz: &fritz
rating: 2700
status: Computer
David Mertz: &mertz
rating: 1400
status: Amateur
matches:
-
Date: 2002-10-04
White: *fritz
Black: *kramnik
Result: Draw
-
Date: 2002-10-06
White: *kramnik
Black: *fritz
Result: White

There's also JSON notation, which some call the new, fat-free alternative to XML, though this is still hotly debated.

You could do worse than XML. It's a reasonable choice, and if you're going to use XML, then at least learn to use it correctly. But consider:

  1. Should XML be the default choice?
  2. Is XML the simplest possible thing that can work for your intended use?
  3. Do you know what the XML alternatives are?
  4. Wouldn't it be nice to have easily readable, understandable data and configuration files, without all those sharp, pointy angle brackets jabbing you directly in your ever-lovin' eyeballs?

I don't necessarily think XML sucks, but the mindless, blanket application of XML as a dessert topping and a floor wax certainly does. Like all tools, it's a question of how you use it. Please think twice before subjecting yourself, your fellow programmers, and your users to the XML angle bracket tax. <CleverEndQuote>Again.</CleverEndQuote>