Coding Horror

programming and human factors

The Trouble with PDFs

Adobe's Portable Document Format is so advanced it makes you wonder why anyone bothers with primitive HTML. It's a completely vector-based layout format, both display and resolution independent. With PDF, you sacrifice almost nothing compared to traditional book and magazine layouts except the obvious limitation of resolution. Here's Kevin Kelly extolling the virtues of PDFs:

A PDF is able to retain the highly evolved grammar, design and syntax that one thousand years of bookmaking has attained. Because of the idiosyncratic way web browsers work, designers do not have full control of what you as a reader see on the web. The web page, including its fonts, fonts sizes, and placement of material and size of the window, partly depends on the viewer's preferences. In my experience as a reader, a web designer, and a book designer, the reading experience on paper -- and PDFs -- is much more refined and elegant. As a publisher and designer I can direct the flow of attention with better tools (font choices, rules, lines, columns) and better control. The benefit to me as a reader is that this sophisticated design translates into increased clarity, smoothness, comprehension, and enjoyment.

But I have a problem with PDF files.

  1. Every time I link to a PDF, I have to tag the link (pdf) to indicate that the hyperlink will whisk you away, not to another web page as you might expect, but to a strange, otherworldly out-of-browser experience.
  2. Links to PDF files assume the user has a PDF viewer installed. Do they? And how will the link be handled? As in situ navigation, presenting the user with a weird new set of PDF controls? Or as an undesirable popup window? Browser support for PDF is so weird there are entire PDF add-ons to deal with it.
  3. The layout better be mind-blowingly good to justify the use of the PDF format. For most of the PDFs I encounter, the information could have been presented in HTML and CSS markup with almost no aesthetic loss at all. The "refined, elegant, sophisticated design" offered by PDF is often wasted.
  4. You might argue that PDFs make sense as a secondary, print-optimized version of existing HTML content. But why not stick to one version of the content? Why repeat ourselves? Do we really want to maintain two different versions of the same content?

I'm not the first person to note the usability problems of PDF, but I consider this a classic case of worse is better. The advantages of PDF rarely outweigh the many disadvantages compared to plain old HTML. I suppose relying on PDF was more defensible in 2001, when browser printing support was notoriously poor, and HTML layout was not well understood. But it's 2008. I'm surprised how many authors still reach for the safety blanket of PDF when they and their audience would be much better served with modern HTML.

The other problem with PDFs is a bit more subtle. A PDF is not merely a PDF; it's a statement. An implicit protest against the terrible limitations of the HTML used by the unwashed masses. PDF content yearns to be free of the constraints of common HTML-- this content, you see, signifies something:

It seems that the PDF format signifies something now, and it's something more than just user inconvenience. In addition to requiring the user to shift mental modes, ("I'm seeing something designed as a PDF now, this must be serious information...") the requirement that a document either be downloaded or viewed in a context that's radically different from standard web pages seems like a subtle assertion of authority by a document's creator. The decision to switch from standard HTML to PDF isn't arbitrary, but it isn't based on technical requirements either. It's based on the value that an author wants to assign to the work, and it benefits from the still-prevalent, though rapidly fading, consensus that print work is somehow more inherently valuable and authoritative than web pages and other online content.

The massive inconvenience of PDF for the user rarely outweighs the minor HTML injustices righted through the power of PDF layout. Consider Kevin Kelly's own True Films 3.0 PDF:

True Films 3.0 PDF screenshot

Kevin went to the trouble of packaging this content up as a PDF, even adding Adobe's brand new support for contextual PDF advertising. All in the name of better formatting. But I don't see any advanced formatting here! Everything in that PDF would render perfectly as HTML. And it'd be better as HTML: easier to hyperlink and search, more accessible to a wide audience, and it would certainly generate greater advertising revenue through the existing web ad ecosystem.

I don't dispute Mr. Kelly's taste in movies for a second. And I worship at the altar of his Cool Tools. But I'll never understand how the founding editor of Wired could fall prey to such shallow PDF elitism-- completely missing the obvious and inherent power of the world's HTML common denominator.

Discussion

An Inalienable Right to Privacy

Privacy has always been a concern on the internet. But as more and more people let it all hang out on the many social networking websites popping up like weeds all over the web, there's much more at risk. Every other week, it seems, I'm reading about some new privacy gaffe. Last month, it was Facebook's Beacon opt-out policy; this week, it's Google Reader sharing private data. The privacy problems just keep piling up as more people tune in and turn on.

Nearly a decade ago, Sun Microsystems CEO Scott McNealy snapped out a warning to the worriers of the Internet Age: "You don't have any privacy. Get over it." McNealy's words look more prescient every year. In 2006, AOL unwittingly divulged the personal lives of 650,000 customers by publishing their search histories as research data. Despite AOL's attempts to anonymize the info, the New York Times quickly outed a 62-year-old lady in Georgia whose searches revealed her dog was wetting the upholstery. The Justice Department has subpoenaed Google, Yahoo!, MSN, and AOL for lists of search queries. More recently, Facebook employees were caught reading the customer logs.

Nothing warms the cockles of a user's heart quite like the tender mercies of your friendly neighborhood CEO. That privacy stuff you're so worried about? Get over it! You might wonder if Mr. McNealy has the same glib attitude towards the privacy of himself and his own family. Only criminals have stuff to hide, right? Here's Bruce Schneier's take on the value of privacy:

Last week, revelation of yet another NSA surveillance effort against the American people has rekindled the privacy debate. Those in favor of these programs have trotted out the same rhetorical question we hear every time privacy advocates oppose ID checks, video cameras, massive databases, data mining, and other wholesale surveillance measures: "If you aren't doing anything wrong, what do you have to hide?"

Let's look in this closet

Some clever answers: "If I'm not doing anything wrong, then you have no cause to watch me." "Because the government gets to define what's wrong, and they keep changing the definition." "Because you might do something wrong with my information." My problem with quips like these -- as right as they are -- is that they accept the premise that privacy is about hiding a wrong. It's not. Privacy is an inherent human right, and a requirement for maintaining the human condition with dignity and respect.

I promote openness and making things public. Not everything, of course; just the good and publicly useful sections you've culled from the repertoire of your life. If you don't consider any part of your life worthy of public consumption in any form, are you really doing anything?

Even as a proponent of selectively exhibiting parts of your life in public, there's a huge part of my life that's private. I didn't realize it, but I've relied on privacy through obscurity until now. My life is so utterly mundane that I can't imagine anyone caring what I do, what I buy, what I read, and who I talk to. I thought privacy was overrated. I certainly never considered privacy a basic human right, on par with life, liberty, and the pursuit of happiness. But it is.

Too many wrongly characterize the debate as "security versus privacy." The real choice is liberty versus control. Tyranny, whether it arises under threat of foreign physical attack or under constant domestic authoritative scrutiny, is still tyranny. Liberty requires security without intrusion, security plus privacy. Widespread police surveillance is the very definition of a police state. And that's why we should champion privacy even when we have nothing to hide.

If power corrupts, then access to a pure, unfettered stream of data on every American corrupts absolutely. The default strategy of privacy through obscurity may have worked by default in the hodepodge, sporadically digital worlds of the 80's and 90's. Not any more. Now that so much of the world is online or stored in a vast database somewhere, all those tiny digital artifacts of who you are and what you do can be woven into a complete tapestry of your life. And you better believe it will be, because it makes some people a lot of money.

So what can we do about it? Is privacy possible in the digital age?

The truth is, fighting to protect privacy is a quixotic venture. Sure, there are any number of technologies, techniques and work-arounds you can employ, all in the effort to protect your privacy. But such a quest is like trying to dig a hole in middle of a fast flowing river. The rich and powerful gain some amount of privacy only because they can afford to grid their personal lives with a kind of digital body armor.

Garfinkel says we need to rethink privacy in the 21st Century. "It's not about the man who wants to watch pornography in complete anonymity over the Internet. It's about the woman who's afraid to use the Internet to organize her community against a proposed toxic dump - afraid because the dump's investors are sure to dig through her past if she becomes too much of a nuisance."

I'm with Bruce on this one. Demand privacy even if you don't think you need it. Consider that the next time you sign up for some new social networking service, or a grocery discount card, or give out your telephone or social security number for some trivial reason. Neglecting to protect our right to privacy is, in effect, giving up on privacy altogether. And that's not a world I want to live in. Openness is important-- but so is privacy, in equal measure. I believe we can have both, but not without active effort on our part.

Discussion

Modern Logo

Leon recently posted a link to a great blog entry on rediscovering Logo. You know, Logo -- the one with the turtle.

Berkeley Logo screenshot

I remember being exposed to Logo way back in high school. All I recall about Logo is the turtle graphics, and the primitive digital Etch-a-Sketch drawings you could create with it. What I didn't realize is that Logo is "an easier to read adaptation of the Lisp language.. [with] significant facilities for handling lists, files, I/O, and recursion", at least if the Wikipedia entry on Logo is to be believed.

Although I was eternally fascinated with programming, Logo held no interest for me. It seemed like a toy language, only useful for silly little graphical tricks and stunts with the turtle. But apparently there was a real language lurking underneath all that turtle graphics stuff. Brian Harvey is a Berkeley professor who not only co-wrote Berkeley Lisp, but authored three books that, amazingly, teach the whole of computer science using nothing but Logo.

If you have no time to skim the material, and you're still convinced Logo is a graphics language for little kids, check out a sample Logo program that Brian put together to impress us. I'm impressed, anyway.

Logo is much more than the thin wrapper over turtle graphics I thought it was in 1986. But turtle graphics still-- how shall I put this? -- suck. I took two new books with me over the holiday vacation, and both deal with something akin to the spiritual successor to Logo-- the Processing environment.

Processing: A Programming Handbook for Visual Designers and Artists   Visualizing Data

Both Processing: A Programming Handbook for Visual Designers and Artists and Visualizing Data paint a picture of the Processing environment that strongly reminds me of Logo. But Processing doesn't offer up a new Lisp syntax -- it sticks with good old-fashioned Java.

If we didn't care about speed, it might make sense to use Python, Ruby, or many other scripting languages. That is especially true on the education side. If we didn't care about making a transition to more advanced languages, we'd probably avoid a C++ or Java-style syntax. But Java is a nice starting point for a sketching language because it's far more forgiving than C/C++ and also allows users to export sketches for distribution via the Web.

The focus of the Processing environment is squarely on learning while doing, which is definitely one of the tenets of Logo.

If you're already familiar with programming, it's important to understand how Processing differs from other development environments and languages. The Processing project encourages a style of work that builds code quickly, understanding that either the code will be used as a quick sketch or that ideas are being tested before developing a final project. This could be misconstrued as software engineering heresy. Perhaps we're not far from "hacking", but this is more appropriate for the roles in which Processing is used. Why force students or casual programmers to learn about graphics contexts, threading, and event handling methods before they can show something on the screen that interacts with the mouse? The same goes for advanced developers: why should they always need to start with the same two pages of code whenever they begin a project?

In another scenario, if you're doing scientific visualization, the ability to try things out quickly is a far higher priority than sophisticated code structure. Usually you don't know what the outcome will be, so you might build something one week to try an initial hypothesis and build something new the next week based on what was learned in the first week.

It's an admirable philosophy, and it's especially appropriate for a domain-specific language. If you're interested in graphics and visualization -- if you're truly looking for a modern Logo-- leave the turtles behind and check out Processing instead.

Discussion

Size Is The Enemy

Steve Yegge's latest, Code's Worst Enemy, is like all of his posts: rich, rewarding, and ridiculously freaking long. Steve doesn't write often, but when he does, it's a doozy. As I mentioned a year ago, I've started a cottage industry mining Steve's insanely great but I-hope-you-have-an-hour-to-kill writing and condensing it into its shorter form points. So let's begin:

  1. Steve began writing a multiplayer game in Java, Wyvern, around 1998. If you're curious what it looks like, see fan screenshots one and two.
  2. Over the last 9 years, Wyvern has grown to 500,000 lines of Java code.
  3. Steve realized that it is impossible for a single programmer to singlehandedly maintain and support half a million lines of code. Even if you're Steve Yegge.

There's much more, but I want to pause here for a moment. It is absolutely true that any programmer who personally maintains half a million lines of code is automatically in a pretty rarified club. Steve's right about this. Most developers will never have the superhuman privilege of personally maintaining 500k LOC or more. On any rational software development project, you'd have a team of developers working on it, or you'd open source the thing entirely to spread the effort across a community.

But here's what I don't understand:

I happen to hold a hard-won minority opinion about code bases. In particular I believe, quite staunchly I might add, that the worst thing that can happen to a code base is size.

So Steve believes the majority of developers, when encountering a code base approximately the size of the Death Star, would think:

I could totally build that.

It's a telling indicator of the impressively bearded computer scientist crowd that Steve runs with. They probably wear flip-flops to work, too. Amongst the programmers I know, the far more common-- and certainly more rational-- reaction to a code base that large would be to run away, screaming, as fast as they could. And I'd be right behind them.

I don't think you necessarily have to spend ten years writing 500k worth of fairly complicated Java code to independently reach the same conclusion. Size is the enemy. Simply going from 1k to 10k LOC-- assuming you're sufficiently self-aware as a programmer-- is more than enough of a glimpse into the maw of madness that lies beyond. Even if you've written zero lines of code, if you've ever read any Steve McConnell books, the size rule is pounded home, time and time again:

Project size is easily the most significant determinant of effort, cost and schedule [for a software project]. People naturally assume that a system that is 10 times as large as another system will require something like 10 times as much effort to build. But the effort for a 1,000,000 LOC system is more than 10 times as large as the effort for a 100,000 LOC system.

One of the most fundamental and truly effective pieces of advice you can give a software development team-- any software development team-- is to write less code, by any means necessary. Break the project into smaller subprojects. Deliver it in complementary fragments. Try iterative development. Stop writing everything in assembly language and APL. Hire better programmers who naturally write less code. Buy code from a third party. Do absolutely whatever it takes to write as little code as possible, because the best code is no code at all.

We're not done yet. I warned you that this was a long post. Continuing from above:

  1. Because Java is a statically typed language, it requires lots of tedious, repetitive boilerplate code to get things done.
  2. That tedious, repetitive boilerplate code has been codified into Java faith as the seminal books "Design Patterns" and "Refactoring".
  3. Java developers fervently believe, almost to a man/woman, that IDEs can overcome the unavoidable LOC bloat of Java.
  4. A rewrite of Wyvern from Java into a dynamic language that runs on the JVM could reduce the raw code size by 50% to 75%.

Here's where Steve not-so-gently segues from "size is the problem" to "Java is the problem".

Bigger is just something you have to live with in Java. Growth is a fact of life. Java is like a variant of the game of Tetris in which none of the pieces can fill gaps created by the other pieces, so all you can do is pile them up endlessly.

Tetris: Game Over

Going back to our crazed Tetris game, imagine that you have a tool that lets you manage huge Tetris screens that are hundreds of stories high. In this scenario, stacking the pieces isn't a problem, so there's no need to be able to eliminate pieces. This is the cultural problem: [Java programmers] don't realize they're not actually playing the right game anymore.

Steve singles out Martin Fowler, who recently "abandoned" the static-language Java fold in favor of the dynamically typed Ruby. Fowler quite literally wrote the book on refactoring, so perhaps there's some truth to Steve's claim that the rigid architecture of classic, statically typed languages ultimately prevent you from refactoring the code down as far as you need to go. If Fowler can't refactor the Java pieces to fit, who can?

Bruce Eckel is another notable Java personality who apparently reached many of the same conclusions about Java years ago.

I can't quantify [the cost of strong typing]. I haven't been able to come up with a from-first- principles mathematical proof, probably because it depends on human factors, like how much time it takes to remember how to open a file and put the try block in the right places and remember how to read lines and then remember what you were really trying to accomplish by reading that file. In Python, I can process each line in a file by saying:

for line in file("FileName.txt"):
# Process line

I didn't have to look that up, or to even think about it, because it's so natural. I always have to look up the way to open files and read lines in Java. I suppose you could argue that Java wasn't intended to do text processing and I'd agree with you, but unfortunately it seems like Java is mostly used on servers where a very common task is to process text.

Lines of code are, and always have been, the enemy. More lines of code means more to read, more to understand, more to troubleshoot, more to debug. But it is possible to go too far in the other direction as well. If you're not careful, you could end up playing yet another game entirely-- yes, you've cleverly avoided the trap of Java's infinitely tall Tetris, but have you slipped into Perl's Golf instead?

Perl "golf" is the pastime of reducing the number of characters used in a Perl program to the bare minimum, much as how golf players seek to take as few shots as possible in a round.

NES mario golf

It originally focused on the JAPHs used in signatures in Usenet postings and elsewhere, but the use of Perl to write a program which performed RSA encryption prompted a widespread and practical interest in this pastime. In subsequent years, code golf has been taken up as a pastime in other languages besides Perl.

In our war on verbosity, there's an inevitable tradeoff between verbosity and understandability. Steve acknowledges this by hinging his JVM language choice on what is "syntactically mainstream": JRuby, Groovy, Rhino (JavaScript), and Jython. I'll spoil the not-so-surprise ending for you: Steve is rewriting Wyvern in Rhino, and in the process he'll help bring Rhino up to spec with the forthcoming EcmaScript Edition 4 update to JavaScript. It's no magic bullet, but it seems like a reasonable compromise based on his goals.

So ends the epic ten year tale of Stevey and his merry band of Wyverneers. But where does that leave us? I have my opinions, naturally:

  • If you personally write 500,000 lines of code in any language, you are so totally screwed.
  • If you personally rewrite 500,000 lines of static language code into 190,000 lines of dynamic language code, you are still pretty screwed. And you'll be out a year of your life, too.
  • If you're starting a new project, consider using a dynamic language like Ruby, JavaScript, or Python. You may find you can write less code that means more. A lot of incredibly smart people like Steve present a compelling case that the grass really is greener on the dynamic side. At the very least, you'll learn how the other half lives, and maybe remove some blinders you didn't even know you were wearing.
  • If you're stuck using exclusively static languages, ask yourself this: why do we have to write so much damn code to get anything done-- and how can this be changed? Simple things should be simple, complex things should be possible. It's healthy to question authority, particularly language authorities.

Remember: size really is the enemy. Right after ourselves, of course.

Discussion

Digital Certificates: Do They Work?

The most obvious badge of internet security is the "lock" icon. The lock indicates that the website is backed by a digital certificate:

  1. This website is the real deal, not a fake set up by criminals to fool you.
  2. All data between your browser and that website is sent encrypted. Nobody in the middle can read any sensitive information you submit to that website, such as your credit card number.

Here's what PayPal looks like in Internet Explorer 7. The lock icon and green background of the address bar let us know that this website is backed by a digital certificate. Clicking on the lock provides additional detail about the certificate.

Certificate info in Internet Explorer 7

Here's PayPal in Firefox 2, which follows the same conventions. The address bar color changes, and the lock icon is present. Clicking on the lock produces a dialog with similar summary information.

Certificate info in Firefox 2.0

The summary is reasonable enough. The certificate authority instutution, VeriSign, vouches that this site is indeed PayPal. One question I've always had, though, is this: who decided VeriSign is a trusted authority? There's some kind of whitelist built into IE and Firefox that blesses these certificate authorities with "root" status. According to Wikipedia, a 2007 survey identified 6 major certificate authorities:

  1. VeriSign (57.6%)
  2. Comodo (8.3%)
  3. GoDaddy (6.4%)
  4. DigiCert (2.8%)
  5. Network Solutions (1.3%)
  6. Entrust (1.1%)

The certificate authority business has always struck me as an odd relationship, because it's completely commercial and superficial. Fork over your $300-$2,500, some nominal proof of your identity, and you're granted a certificate for a year. Does that imply trust? I'm not the only person to share these concerns; Bruce Schneier has an excellent whitepaper which examines the risks of certification authorities and public-key infrastructure:

Certificates provide an attractive business model. They cost almost nothing to make, and if you can convince someone to buy a certificate each year for $5, that times the population of the Internet is a big yearly income. If you can convince someone to purchase a private CA and pay you afee for every certificate he issues, you're also in good shape. It's no wonder so many companies are trying to cash in on this potential market.With that much money at stake, it is also no wonder that almost all the literature and lobbying on the subject is produced by PKI vendors. And this literature leaves some pretty basic questions unanswered: What good are certificates anyway? Are they secure? For what? In this essay, we hope to explore some of those questions.

The other problem with certificates is that, as an end user, it's nearly impossible to tell a good, valid certificate provided by a reputable certificate authority from a bad one. If we click through to examine the PayPal certificate details, we're presented with these three dense tabs:

Certificate dialog: General   Certificate dialog: Details   Certificate dialog: Path

I don't know about you, but none of that makes any sense to me. And I'm a programmer. Imagine the poor end user trying to make heads or tails of this. What does it all mean? Of course, most users simply won't pay attention -- it's questionable whether they'll even notice the presence of the lock icon and the color difference in the address bar.

Certificates aren't just for websites; they can also be applied to executables, too. Here's what happens when I double-click on the Safari 3.0.4 beta installer. It's been signed by Apple using their digital certificate.

Open File - Security Warning

Clicking on the word "Apple" opens detailed information about the certificate. Again, what does all this mean? How can we tell if it is valid?

Digital Signature Details: General   Digital Signature Details: Advanced

I understand the value of digital certificates in theory-- to definitively establish the identity of a program or website before entrusting your data to it. Consider a real-world analog. What if I walked up to you on the street and told you I was a policeman? You might check to see if I'm wearing an appropriate uniform. You might ask to see my badge. You might wonder where my partner or squad car is. We use all these things to judge the authenticity of human interactions.

However, I don't understand how the current digital certificate infrastructure prevents criminals from obtaining their own certificates with ease. Even though I could potentially fake a policeman's badge and uniform in the real world, that pales compared with how trivially easy it is to obtain a digital certificate for code signing from TuCows:

  • Create an account at Tucows
  • Buy a Cert ($300)
  • Email them your Drivers License
  • Download the Cert
  • Export your certificate from the machine and store in a safe place
  • Grab signtool.exe from the .NET 2.0 SDK
  • Sign your binary using the certificate from step 4

If the only validation is an emailed copy of a drivers' license, that doesn't exactly give me the warm fuzzies. And even if we enhance that with (more expensive, naturally) "extended validation", I fail to see how this would prevent a determined, resourceful criminal from getting whatever certificate they need.

I suppose digital certificates are better than nothing. But I also worry that they're incredibly confusing for the end user, easy to game, and ultimately provide a false sense of security-- and that's the most dangerous risk of all.

Discussion