Coding Horror

programming and human factors

Parsing Html The Cthulhu Way

Among programmers of any experience, it is generally regarded as A Bad Ideatm to attempt to parse HTML with regular expressions. How bad of an idea? It apparently drove one Stack Overflow user to the brink of madness:

You can't parse [X]HTML with regex. Because HTML can't be parsed by regex. Regex is not a tool that can be used to correctly parse HTML. As I have answered in HTML-and-regex questions here so many times before, the use of regex will not allow you to consume HTML.

Regular expressions are a tool that is insufficiently sophisticated to understand the constructs employed by HTML. HTML is not a regular language and hence cannot be parsed by regular expressions. Regex queries are not equipped to break down HTML into its meaningful parts. so many times but it is not getting to me. Even enhanced irregular regular expressions as used by Perl are not up to the task of parsing HTML. You will never make me crack. HTML is a language of sufficient complexity that it cannot be parsed by regular expressions.

Even Jon Skeet cannot parse HTML using regular expressions. Every time you attempt to parse HTML with regular expressions, the unholy child weeps the blood of virgins, and Russian hackers pwn your webapp. Parsing HTML with regex summons tainted souls into the realm of the living. HTML and regex go together like love, marriage, and ritual infanticide. The <center> cannot hold it is too late. The force of regex and HTML together in the same conceptual space will destroy your mind like so much watery putty. If you parse HTML with regex you are giving in to Them and their blasphemous ways which doom us all to inhuman toil for the One whose Name cannot be expressed in the Basic Multilingual Plane, he comes.

(The unicode action in the post, not shown here, is the best part of the gag.)

That's right, if you attempt to parse HTML with regular expressions, you're succumbing to the temptations of the dark god Cthulhu's … er … code.

kraken-cthulhu.jpg

This is all good fun, but the warning here is only partially tongue in cheek, and it is born of a very real frustration.

I have heard this argument before. Usually, I hear it as justification for seeing something like the following code:
# pull out data between <td> tags
($table_data) = $html =~ /<td>(.*?)</td>/gis;

"But, it works!" they say.

"It's easy!"

"It's quick!"

"It will do the job just fine!"

I berate them for not being lazy. You need to be lazy as a programmer. Parsing HTML is a solved problem. You do not need to solve it. You just need to be lazy. Be lazy, use CPAN and use HTML::Sanitizer. It will make your coding easier. It will leave your code more maintainable. You won't have to sit there hand-coding regular expressions. Your code will be more robust. You won't have to bug fix every time the HTML breaks your crappy regex

For many novice programmers, there's something unusually seductive about parsing HTML the Cthulhu way instead of, y'know, using a library like a sane person. Which means this discussion gets reopened almost every single day on Stack Overflow. The above post from five years ago could be a discussion from yesterday. I think we can forgive a momentary lapse of reason under the circumstances.

Like I said, this is a well understood phenomenon in most programming circles. However, I was surprised to see a few experienced programmers in metafilter comments actually defend the use of regular expressions to parse HTML. I mean, they've heeded the Call of Cthulhu … and liked it.

Many programs will neither need to, nor should, anticipate the entire universe of HTML when parsing. In fact, designing a program to do so may well be a completely wrong-headed approach, if it changes a program from a few-line script to a bullet-proof commercial-grade program which takes orders of magnitude more time to properly code and support. Resource expenditure should always (oops, make that very frequently, I about overgeneralized, too) be considered when creating a programmatic solution.

In addition, hard boundaries need not always be an HTML-oriented limitation. They can be as simple as "work with these sets of web pages", "work with this data from these web pages", "work for 98% users 98% of the time", or even "OMG, we have to make this work in the next hour, do the best you can".

We live in a world full of newbie PHP developers doing the first thing that pops into their collective heads, with more born every day. What we have here is an ongoing education problem. The real enemy isn't regular expressions (or, for that matter, goto), but ignorance. The only crime being perpetrated is not knowing what the alternatives are.

So, while I may attempt to parse HTML using regular expressions in certain situations, I go in knowing that:

  • It's generally a bad idea.
  • Unless you have discipline and put very strict conditions on what you're doing, matching HTML with regular expressions rapidly devolves into madness, just how Cthulhu likes it.
  • I had what I thought to be good, rational, (semi) defensible reasons for choosing regular expressions in this specific scenario.

It's considered good form to demand that regular expressions be considered verboten, totally off limits for processing HTML, but I think that's just as wrongheaded as demanding every trivial HTML processing task be handled by a full-blown parsing engine. It's more important to understand the tools, and their strengths and weaknesses, than it is to knuckle under to knee-jerk dogmatism.

So, yes, generally speaking, it is a bad idea to use regular expressions when parsing HTML. We should be teaching neophyte developers that, absolutely. Even though it's an apparently neverending job. But we should also be teaching them the very real difference between parsing HTML and the simple expedience of processing a few strings. And how to tell which is the right approach for the task at hand.

Whatever method you choose – just don't leave the <cthulhu> tag open, for humanity's sake.

Discussion

Whitespace: The Silent Killer

Ever have one of those days where everything you check into source control is wrong?

Also, how exactly is that day is different from any other? But seriously.

Code that is visible is code that can be wrong. No surprise there. But did you know that even the code you can't see may be wrong, too?

These are the questions that drive young programmers to madness. Take this perfectly innocent code, for example.

code-whitespace-invisible.png

Looks fine, doesn't it? But hold on. Wait a second. Let's take another, closer look.

code-whitespace-visible.png

OH. MY. GOD!

If you're not a programmer, you may be looking at these two images and wondering what the big deal is. That's fine. But I humbly submit that, well, you're not one of us. You don't appreciate what it's like to spend every freaking minute of every freaking day agonizing over the tiniest details of the programs you write. Not because we want to, you understand, but because the world explodes when we don't.

I mean that literally. Well, almost. If one semicolon is out of place, everything goes sideways. That's how programming works. It's fun! Sometimes! I swear!

We got into this industry because, quite frankly, we are control freaks. It's who we are. It's what we do. Now to imagine, to our dismay, that there's all this stupid, useless whitespace at the ends of our lines. Stuff that's there, but we can't see it. Well, those are the nightmares OCD horror movies are made of. I have a full-body itchiness just talking about it.

Depending on how far down the rabbit-hole you want to go, there's any number of things you could do here:

  • Have a post-build step, perhaps something with a regular expression like s*?$ in it, that auto-cleans extra spaces checked into source control
  • Execute a local macro which removes whitespace from ends of lines
  • Have a special rule to highlight extra spaces
  • Run your IDE in whitespace-always-visible mode, or toggle it frequently

OK, fine, so maybe the world won't explode if there are a few extra bits of whitespace in my code.

But all the same, I think I'll go back and make extra double plus sure no more of that pesky whitespace has accumulated in my code when I wasn't looking. Just because I can't see it doesn't mean it's not out to get me.

Discussion

Preserving Our Digital Pre-History

I've spent a significant part of my life online. Not just on the internet, I mean, but on modems and early, primitive online communities. Today's internet is everything we couldn't have possibly dared to imagine twenty-five years ago, but there is a real risk of these early, tentative digital artifacts -- and for some, the beginnings of our Hacker Odyssey -- being lost forever in the relentless deluge of online progress. Sure, every single thing that happened in 2004 is documented exhaustively online. But 1994? 1984? Not so much.

That's where Jason Scott comes in.

You may know Jason Scott from BBS The Documentary. Or, perhaps you're familiar with textfiles.com, his massive (and growing) archive of what passed for blogs and forums in the earliest online era.

A wonderful thing happened in the 1980s: Life started to go online. And as the world continues this trend, everyone finding themselves drawn online should know what happened before, to see where it all really started to come together and to know what went on, before it's forgotten.

When a historian or reporter tries to capture the feelings and themes that proliferated through the BBS Scene of the early 1980's, the reader nearly always experiences a mere glimpse of what went on. This is probably true of most any third-party reporting, but when the culture is your own, and when the experiences were your own, the gap between story and reality is that much wider, and it's that much harder to sit back and let the cliche-filled summary become "The Way It Was." You want to do something, anything so that the people who stumble onto the part of history that was yours know what it was like to grow up through it, to meet the people you did, to do the things you enjoyed doing. Maybe, you hope, they might even see the broader picture and the conclusions that you yourself couldn't see at the time. This is history the way the chronicled want it to be.

Jason is nothing less than our generation's digital historian in residence. When GeoCities went permanently offline a week ago, he was there to help preserve it for posterity.

bbs-documentary.png

BBS: The Documentary was a major milestone in his ongoing effort to document our digital pre-history. But it's only the beginning; there's also a huge documentary on text adventures, Get Lamp, that's been in the works for a few years now. Unfortunately, progress has been slow. Because while being a digital historian is great, it's not exactly something you get paid to do.

But maybe we can change that. Witness Jason's kickstarter proposal:

Throughout all this, I had a day job - computer administration. It paid well, but I paid for it with my health. When my most recent employer and I parted ways, I decided I'd take this time finish some of the bigger projects I've been working on.

I suddenly thought back to Kickstarter and got this crazy idea - what if I simply asked the world and fans to contribute a bit of money towards keeping me somewhat solvent, and give me the opportunity to go full-time with computer history? If I was able to get all these things done over the years, what if I just asked people to subscribe or give me some patronage and in return I fill their free time with cool stuff to look at, learn from, and enjoy?

There are so many people whose online presences I greatly admire. But very few of them will go on to become part of the permanent written history of this era. I have no doubt whatsoever that Jason Scott is one of those people who will, thanks to his tireless efforts to preserve the flotsam and jetsam of our digital past, stuff that would otherwise be overlooked by the mainstream and lost forever.

I've pledged $100. It is an honor to support his ongoing work of preserving our shared digital pre-history. His history, is my history, is our history. A history of geeks, dorks, dweebs, nerds, and generally computer-obsessed misfits, but nonetheless -- it's something we all share.

If this is something you believe in, I urge you to pledge as well.

Discussion

Stack Overflow Careers: Amplifying Your Awesome

That Stack Overflow thing we launched a year ago? It's been going pretty well so far.

Of course, everyone knows you could code Stack Overflow in a long weekend. It's trivial. Assembling a worldwide community of smart, engaged software developers? That's a whole different ball of wax. Stack Overflow is a site by programmers, for programmers; it's only as good as the programmers who choose to participate.

Stack Overflow isn't about me. Or anybody else on the Stack Overflow team for that matter.

Stack Overflow is you.

This is the scary part, the great leap of faith that Stack Overflow is predicated on: trusting your fellow programmers. The programmers who choose to participate in Stack Overflow are the "secret sauce" that makes it work. You are the reason I continue to believe in developer community as the greatest source of learning and growth. You are the reason I continue to get so many positive emails and testimonials about Stack Overflow. I can't take credit for that. But you can.

I learned the collective power of my fellow programmers long ago writing on Coding Horror. The community is far, far smarter than I will ever be. All I can ask – all any of us can ask – is to help each other along the path.

I am continually humbled by the skill and expertise of the programmers who volunteer time to Stack Overflow. These programmers graciously donate tiny slivers of their day to help us -- and themselves -- become better programmers. These 5 and 10 minute slices of effort, across hundreds of thousands of questions and answers, become a permanently archived (and creative commons wiki licensed) bread crumb content trail for future programmers to follow, edit, and contribute to themselves over time.

I'm thrilled to see Stack Overflow working so well for both askers and answerers; the "pay it forward" model of programmers helping their peers is exactly what we were shooting for. We'll never change the world, but it sure is nice to be able to improve our small corner of it just a little bit. Remember: bad code that isn't written, is bad code that another poor programmer won't have to debug. If we don't reach out to slaphelp new programmers and teach them the lessons we learned the hard way, who will? I'm only exaggerating a little when I say that the future of our entire profession depends on it.

If you're actively participating on Stack Overflow, we now have another way to convert those slices of effort into something that actively furthers your professional goals – Stack Overflow Careers.

Stack Overflow Careers

What is careers.stackoverflow.com? It's a few things:

  • a completely free, public CV hosting service for programmers, to share the cool stuff you've coded and created with the world.
  • a way to explicitly link your Stack Overflow profile with your CV, to provide concrete examples of your communication skills and individual expertise to anyone who is interested.
  • a better way to connect great programmers with the best programming jobs, for those who opt into the small annual listing fee.

In short, Stack Overflow Careers amplifies your awesome.

I won't lie to you. This is also a business. That's why there are nominal opt-in listing fees for those programmers interested in seeking employment, and substantial fees for hiring managers who want to tap into the smart developers who grok Stack Overflow.

update: I apologize if I wasn't clear. It is 100% free, forever, to create a public CV, put whatever HTML content you want in it, and link it to your Stack Overflow profile. Like so:

These are of course freely indexable and searchable on the web.

Beyond the free public component, there is a private (and completely optional) subscription component. For those programmers actively seeking employment, a small annual subscription fee allows inclusion in a private employer search UI. This is also explained in the faq and about.

That said, we're also trying to do something a bit different here. Something better than the endless, mind-numbing acronym sea of monster.com, dice.com, et al. Joel and I believe current hiring practices for programmers are incredibly broken. We think we can do better.

dilbert-interview.png

We love our work, and so should you. Our goal isn't to put warm bodies in front of interviewers. Our goal is to create love connections. Instead of avid programmers pursuing disinterested and distracted companies, it's the other way around -- savvy companies who understand the competitive advantages of having the best programmers will pursue you. We connect smart, engaged hiring managers who "get it" with top programmers who love to code.

computer-engineers-number-puzzle.jpg

If you love to code, too, I encourage you to create your own Stack Overflow CV. Keep it private, or make it public via the URL of your choice -- it's completely free either way. If you think you might be actively looking for a job in the next 3 years, take advantage of our outrageously low promotional pricing of $29 for a 3 year filing. That way, at any point in those 3 years, you can flip a switch and become visible to hiring managers. Or not. It's totally up to you.

(also, if you're hiring, and your company appreciates top software engineers -- and you think you can convince our tough audience of that -- email us)

Discussion

Revisiting "The Fold"

After I posted my blog entry on Treating User Myopia I got a lot of advice. Some useful, some not so useful. But the one bit of advice I hadn't anticipated was that we were not making good use of the area "above the fold". This surprised me. Does the fold still matter?

The fold refers to the border at the bottom of the browser window at the user's default screen resolution. Like so:

the-fold-nytimes.png

Way back in the dark ages of 1996, it was commonly thought that users didn't know how to scroll a web page.

On the Web, the inverted pyramid becomes even more important since we know from several user studies that users don't scroll, so they will very frequently be left to read only the top part of an article.

Thus, it was critically important to cram in as much content in as possible above that fold, as anything below it was invisible to a huge number of users. They didn't know how to scroll, so they would never find it. Jacob Neilsen, renowned usability expert, is the author of the above quote. But he recanted his position in 2003:

In 1996, I said that "users don't scroll." This was true at the time: many, if not most, users only looked at the visible part of the page and rarely scrolled below the fold. The evolution of the Web has changed this conclusion. As users got more experience with scrolling pages, many of them started scrolling.

Scrolling is an example usability versus learnability. It was always my belief that users quickly learned to scroll, otherwise they were permanently crippled as web citizens. If you can't learn to scroll within an hour or so of using the web, you're going to have an awfully stunted experience -- so much so that you're probably better off not using it at all. In short, if you use the web, you know how to scroll, almost by definition. It is a fundamental skill.

Even today, people will cite the ancient, irrelevant rule of The Fold as if it's still law. In fact, I was just talking to a friend of mine who expressed his frustration at dealing with a middle manager who was using the "content must be above the fold" rule as a weapon, and demanding that all page content appear above the fold. It's terribly misguided.

Although thoroughly debunked, there are still some hidden dangers from the fold, and subtlety to how users react to it. As documented by a recent usability study on the fold, there are three specific pitfalls to watch out for:

  1. Don't cram everything in above the fold. Users will explore and find your content -- as long as the page "looks" scrollable.
  2. Watch out for stark, horizontal lines that happen to line up with the fold. This is the only factor that causes users to stop scrolling, because the page looks done and complete. Instead, have a small amount of content just visible, poking up above the fold. This encourages scrolling.
  3. Avoid in-page scroll bars. The standard browser scrollbar is an indicator of the amount of content on the page that users learn to rely on. Placing <iframe> and other elements with scroll bars on the page can break this convention -- and may lead to users not scrolling.

These are excellent guidelines, backed by actual eye tracking and experimental results. You know, science! But how do they apply to me? First, I established where the fold actually was. Per Google Analytics, about 25% of our users are using screen resolutions where the page fold is at about 700 or 800 pixels of height. And remember, browsers have a lot of horizontal chrome that tends to squander that height -- toolbars, status bars, tabs, etcetera. The fold is probably much closer than you think it is.

Next, I looked at the advice I had been given regarding the top of the page. Sure enough, we had a bunch of irrelevant UI at the top that didn't really matter: things like redundant page titles, and two line title entry. We were wasting critical real estate at the top of the page! For the 25% of users who have a 700 or 800 pixel fold, items were pushed down far enough that they might not actually be visible. Worse still, the strong bottom border of the text entry area with the drag slider could possibly align with the page fold itself -- leading the user to believe that nothing is below there and failing to scroll.

It's not only a basic rule of writing, it's also a basic rule of the web: put the most important content at as close to the top of the page as you can. This isn't new advice, but it's so important that it never hurts to revisit it periodically in your own designs.

In treating user myopia, it's not enough to place important stuff directly in the user's eyepoint. You also need to ensure that you've placed the absolute most important stuff at the top of the page -- and haven't created any accidental barriers to scrolling, so they can find the rest of it. The fold is far less important than it used to be, but it isn't quite as mythical as Bigfoot and the Loch Ness Monster quite yet.

Discussion