Coding Horror (Page 150)

27 Oct 2006

The Single Most Important Virtual Machine Performance Tip

If you use virtual machines at all, you should have the single most important virtual machine performance tip committed to heart by now: always run your virtual machines from a separate physical hard drive:

[the] biggest performance win is to put the virtual hard disks on separate disk spindles from the operating system. The biggest performance hit in virtual machines is disk I/O. Making the VM fight with your OS and swap disk makes this issue much, much worse. Additionally, today's USB 2.0 and firewire external hard drives run on a fast interface bus, have large buffers and spin at 7,200 rpm, as opposed to 4,200 rpm for most laptop hard drives.

I've talked about virtualization performance penalties before, but this bears repeating. I originally read this tip at Scott's blog, and I've heard it echoed in emails directly from the Virtual PC Guy himself.

It's true that most laptop drives are at 5,400 rpm these days, and a scant few even run at 7,200 rpm. But the tip is still as valid as ever. The primary performance bottleneck in virtual machines, by a very wide margin, is the hard drive. Although it's possible to squeeze a complete install of Windows XP into a 641 megabyte VM hard drive image file, most VM hard drive image files rapidly grow to multiple gigabytes. It's not unusual to see VMs end up 5 or 10 gigabytes in size. It shouldn't be too surprising that the disk subsystem has a disproportionately large impact on overall virtual machine performance.

That's one reason why all the desktop machines I build now have two hard drives:

A faster, smaller drive for the operating system and essential applications. You can't beat the 10,000 rpm Western Digital Raptor series for this role.
A larger data drive for virtual machines and everything else.

This way, whenever I boot up a VM, it's running from a different physical spindle than the operating system, and thus running at optimal speeds. It's also a good way to segregate your operating system and data in case you need to do a complete wipe of your operating system. And it's certainly a much safer and more practical two drive approach than RAID-0 on the desktop.

Unfortunately, we can't drop a second drive into our laptops. But there's another solution that works almost as well: external SATA and USB2 enclosures. These enclosures offer the best of both worlds: high-speed USB 2.0 interfaces for laptops, and full-speed eSATA connections for desktops.

The Icy Dock, pictured above, is one of the best of these new enclosures. It's a bit spendy, but it's remarkably well made from mostly aluminum, with a clever locking tray mechanism. It also includes all the extras you need to connect it to your PC, including a USB 2.0 cable, an eSATA bracket, and an eSATA cable. Just add the desktop SATA drive of your choice.

I've talked about the difference between USB 2.0 and full-blown SATA performance before. Here's a direct comparison between a modern 250 gigabyte 7,200 rpm SATA drive in the Icy Dock (connected via USB), and my laptop's internal 100 gigabyte 5,400 rpm hard drive:

Laptop HDD vs. IcyDock external USB 2.0 HDD

The USB 2.0 interface is nothing to sneeze at. With a fast 7,200 rpm desktop drive mounted, it does a little better than the internal laptop drive overall, once you factor in random access times and the constant speed across the entire drive. But it's obviously limited by the interface. That's why the option to connect a drive via its native SATA interface is so desirable in an external enclosure. Some recent motherboards even include eSATA connectors on their back panel, such as the Asus P5B that I recently built. I presume it's only a matter of time before some enterprising laptop manufacturer releases a laptop with an eSATA connector.

It's not a slam-dunk performance victory over the internal laptop drive in absolute terms. But the real-world performance improvement gained from running a VM on an external USB 2.0 drive is quite noticeable. Recommended.

Discussion

26 Oct 2006

The Build Server: Your Project's Heart Monitor

Although I've been dismissive of build servers in the past, I've increasingly come to believe that the build server is critical-- it's the heart monitor of your software project. It can tell you when your project is healthy, and it can give you advance warning when your project is about to flatline.

a heart monitor

You should start out with a simple pulse-- whether or not your project builds, and how often you're building it. The build server can be so much more, though. The Zutubi article The Path to Build Enlightenment provides a great overview of what a build server can do for your project:

Machine independence
Let's get past "It runs on my machine" first. The build server retrieves everything from source control, and builds on a machine untainted by developer dependencies. It forces an integration point for all the developers working on the project, in a neutral, indifferent way. You can hate your co-workers, but it's irrational to hate the build server.
Scripted Builds
Your build process is now clearly defined by a script and under source control. You might say it's almost.. self-documenting. Isn't that the way it should be?
Scripted tests
Sure, maybe all the code compiles. But does the software actually work? The build server is a logical place to integrate some basic tests to see if your product is doing what it's supposed to do. Mere compilation is not enough. The more tests you accrete into the build over time, the better the feedback is from the build, and the more valuable it will be to your project. It's a positive reinforcement cycle.
Daily and Weekly builds
Once you have the build server set up, you'll establish a rhythm for your project, where you're building regularly. When something breaks, you'll know, and quickly. A solid heartbeat from the build server leads to a confident development team.
Continuous Integration
This is the holy grail of build server integration-- doing a complete build every time something is checked into source control. Once you've gotten your feet wet with weekly and daily builds, it's the next logical step. It also forces you to keep your test and build times reasonable so things can proceed quickly.
Automated releases
The build server automates all the drudge work associated with releasing software. It..
1. labels the source code with the build number
2. creates a uniquely named drop folder for that particular build
3. tags the binaries with the build number in the file metadata
4. creates installation packages and installers
5. publishes the installs to websites, FTP sites, or file paths
A well-designed, fully-automated build process makes it trivially easy for anyone to get a particular release, or to go back in time to a previous release. And it's less work for you when the build machine does it.
Building in multiple environments
For advanced projects only. If you have to test your code against 10 different languages, or different variants of an operating system, consider integrating those tests into the build process. It's painful, but so is that much ad-hoc testing.
Static and Dynamic Analysis
There's an entire universe of analysis tools that you can run on your code during the build to produce the wall of metrics. FxCop, nDepends, LibCheck, and so forth. There are lots of metrics, and only you and your team can decide what's important to you. But some of these metrics are really clutch. At the very least, you'll want to know how much code churn you have for each build.

If you don't have a build server on your project, what are you waiting for?

Discussion

25 Oct 2006

CAPTCHA Effectiveness

If you've used the internet at all in the last few years, I'm sure you've seen your share of CAPTCHAs:

The Blogger Captcha

Of course, nobody wants to use CAPTCHAs. They're a necessary evil, just like the locks on the doors to your home and your car.

CAPTCHAs are designed to discriminate between computer scripts from spammers and real human beings. There's a popular misconception in technical circles that CAPTCHA has been "broken":

CAPTCHA, which stands for (C)ompletely (A)utomated (P)ublic (T)uring test to tell (C)omputers and (H)umans (A)part, works well for small sites but larger 'community' sites where there are multiple SPAM targets CAPTCHA only provides a false sense of security - it can be broken fairly easily and serious spammers are getting more sophisticated all the time.

Some people actually believe that spammers can now "fairly easily" write scripts which use advanced optical character recognition to automatically defeat any online CAPTCHA form.

Although there have been a number of CAPTCHA-defeating proof of concepts published, there is no practical evidence that these exploits are actually working in the real world. And if CAPTCHA is so thoroughly defeated, why is it still in use on virtually every major website on the internet? Google, Yahoo, Hotmail, you name it, if the site is even remotely popular, their new account forms are protected by CAPTCHAs.

The comment form of my blog is protected by what I refer to as "naive CAPTCHA", where the CAPTCHA term is the same every single time. This has to be the most ineffective CAPTCHA of all time, and yet it stops 99.9% of comment spam. I can count on two hands the number of manually entered comment spams I've gotten since I implemented it. Granted, Yahoo is more popular than my blog by many orders of magnitude. But it's still strong evidence that moving the difficulty bar up even one tiny notch can be quite effective in reducing spam. I went from cleaning up comment spam every day to cleaning one per month. Big difference.

I've been experimenting with improving the rendering algorithms in my CAPTCHA server control, and it's interesting how fragile typical computer OCR really is. SimpleOCR has an online form that allows you to upload and OCR small greyscale TIF images. Here are the results of submitting a few standard 180x50 CAPTCHAs from my reworked rendering algorithm. Note that these CAPTCHAs all use the same font, Courier New.

OCR result
	Standard	CQXKN	5/5
	Low perturbation	KxT*2	3/5
	Medium perturbation	acNx4	2/5
	High perturbation	Kc	0/5
	Extreme perturbation	(blank)	0/5
	Standard, low noise	(blank)	0/5

I didn't expect it to do well, but I was frankly surprised how poorly the SimpleOCR engine actually performed. Adding a tiny bit of noise or perturbation to the CAPTCHA text was all it took to break the OCR. I'm sure there are more advanced OCR engines out there that might be able to do somewhat better than the free SimpleOCR engine. Still, it's unlikely that any OCR engine could beat high perturbation – where the characters are physically overlapping each other – plus a little background noise. And that level of CAPTCHA security is absolute overkill unless you happen to run one of the top 100 most popular sites on the internet. Furthermore, none of these are particularly difficult CAPTCHAs. The most extreme perturbation sample shown above is eminently "human solvable", at least in my opinion.

The default settings for my new and improved CAPTCHA server control, a combination of …

high contrast for human readability
medium, per-character perturbation
random fonts per character
low background noise

… should be far more protection than most websites need.

Captcha image, low noise, medium perturbation, varied fonts

Remember, I use "naive CAPTCHA" with 99.9% effectiveness. The "low" settings will be even easier to read than the defaults and may be more appropriate for your user base.

Of course, OCR isn't the only way to attack CAPTCHA. But the other scenarios for spammers "beating" CAPTCHA are even more far-fetched. The Petmail documentation explains:

1. The Turing Farm

Let's say spammers set up a sweatshop to employ people to look at computer screens and answer CAPTCHA challenges. They get to send one message for each challenge passed. Assuming 10 seconds per challenge, and paying roughly $5 per hour, that represents $14 per thousand messages. A typical spam run of 1 million messages per day would cost $14,000 per day and require 116 people working 24/7.
This would break the economic model used by most current spammers. A recent Wired article showed one spammer earning $10 for each successful sale. At that rate, the cost of $14,000 for 1,000,000 spam emails requires a 1 in 1000 success rate just to break even, whereas current spammers are managing a 1 in 100,000 or even 1 in 1,000,000 sucess rate.

2. The Turing Porn Farm

A recent slashdot article described a trick in which spammers run a porn site that is gated by CAPTCHA challenges, which are actually ripped directly from Yahoo's new account creation page. The humans unwittingly solve the challenge on behalf of the spammers, who can therefore automate a process that was meant to be rate-limited to humans. This attack is simply another way of paying the workers of a Turing Farm. The economics may be infeasible because porn hosting costs money too.

If you're not using CAPTCHAs because you think they're compromised, then you're too gullible for your own good. There's absolutely no concrete data supporting any of these attack scenarios happening outside laboratory (read: infinite money and time) conditions. Just ask Google:

Some captchas have been solved with more than 90% accuracy by scientists specializing in computer vision research at the University of California, Berkeley, and elsewhere. Hobbyists also regularly write code to solve captchas on commercial sites with a high degree of accuracy.
But several Internet companies say their captchas appeared to be highly effective at thwarting spammers. "Researchers are really good, and the attackers really are not," says Mr. Jeske of Google, based in Mountain View, Calif. "Having these methods in place we find extremely effective against automated malicious attackers."

The real secret to CAPTCHA is that it hits spammers where they are most vulnerable: in the pocketbook. The minute you put up a computational barrier, the entire economic model of spam comes crashing down.

Now if you'd prefer not to use CAPTCHA because it's an inconvenience for the user, I can respect that. CAPTCHA isn't the only way to block spammers. But give CAPTCHA its due: it was one of the original spam blocking measures used way back in 1997 by AltaVista. And, even more impressively, it's still one of the most effective ways to block spam at its source today.

Discussion

24 Oct 2006

Swiss Army Knife or Generalizing Specialist

In Does Writing Code Matter?, I proposed that developers spend less time on the technical stuff, which they're already quite good at, and more time cultivating other non-technical skills that developers tend to lack. One commenter took issue with this approach:

I don't agree with the premise of improvement of weaknesses. I like the premise of enhancing talent and being aware of weaknesses. "Know yourself" doesn't mean go learn everything and become a Swiss Army Knife.

It's easy to take my modest proposal to an absurd extreme: either you write code all day, or you become completely non-technical and never touch a compiler again. Or maybe you spend so much time pursuing related interests that you become a jack-of-all-trades, master of none. In other words, a Swiss Army Knife.

First, a clarification. Cultivating non-technical skills is, first and foremost, about becoming a better software developer. If you wanted to be rich, famous, and get girls, then you picked the wrong profession. I'm sorry I had to be the one to break this to you.

Instead of a Swiss Army Knife, you should strive to be a generalizing specialist.

A generalizing specialist is someone with one or more technical specialties who actively seeks to gain new skills in both their existing specialties as well as in other areas, including both technical and domain areas. When you get your first job as an IT professional it is often in the role of a junior programmer or junior DBA. You will initially focus on becoming good at that role, and if you're lucky your organization will send you on training courses to pick up advanced skills in your specialty. Once you're adept at that specialty, or even when you've just reached the point of being comfortable at it, it is time to expand your horizons and learn new skills in different aspects of the software lifecycle and in your relevant business domain. When you do this you evolve from being a specialist to being a generalizing specialist. Generalizing specialists are often referred to as craftspeople, multi-disciplinary developers, cross-functional developers, polymaths, or even "renaissance developers".
A generalizing specialist is more than just a generalist. A generalist is a jack-of-all-trades but a master of none, whereas a generalizing specialist is a jack-of-all-trades and master of a few. Big difference.

Too much specialization is a pitfall of its own. Have you ever worked on projects where you had "the database guy", "the testing guy", "the web guy", and so forth? Wayne Allen refers to this as a process smell-- waiting on specialists.

A common process smell in new agile teams is more and more stories/backlog items incomplete at the end of the iteration. There are a couple of different reasons this might happen, but the one I'm interested in today can be detected by the claim "I finished my tasks". The clear implication is that "I got my stuff done, but someone else didn't".
A little additional research usually shows that people are waiting for other people with some kind of specialization such as database, QA, UI, or any other skill that isn't widely distributed among the team. As an agile team this is where specialization gets in the way. In a more traditional project the measure of delivery is the task, however, in agile projects the measure of delivery is working features. Specialists typically don't deliver working features, thus the problem. Note I am not saying that specializations aren't needed, but they do need to be balanced with other needs that help deliver features.

How do you know if you're on the road to becoming a generalizing specialist? Well, as in any good exercise regimen, you should be regularly exceeding your comfort zone. Sometimes you have to stretch a little:

The generalizing specialist is willing and able to learn new skills - to stretch as the needs of the team change. And since change is natural, this is an essential attitude for team members. However, we are usually trained, and strongly encouraged to have a deep specialty. This approach to education and training is a natural consequence to the typical organizational model for work and society. Therefore, if a team is converting to agile work methods, people need to be coached to stretch themselves and learn new things.

I see too many software developers burrowing themselves deeper and deeper into the same skillsets and specialties. It's all too easy to fall into that rut. There's an entire universe of software engineering skills outside the narrow realm of coding. To be a better software developer, grow from a specialist to a generalizing specialist.

Discussion

23 Oct 2006

Does Writing Code Matter?

Ian Landsman's 10 tips for moving from programmer to entrepreneur is excellent advice. Even if you have no intention of becoming an entrepreneur.

One of the biggest issues I see is developers getting caught up in the code. Spending countless hours making a function perfect or building features which show off the latest technology. Now you have to write code to be in the software business. It has to be high quality code that isn't filled with bugs or is insecure. However, the best code in the world is meaningless if nobody knows about your product. Code is meaningless if the IRS comes and throws you in jail because you didn't do your taxes. Code is meaningless if you get sued because you didn't bother having a software license created by a lawyer.

Software developers love code. But we're biased. And we write less of it than we think we do. We spend far more time understanding code than writing it.

Anyway, as Ian points out, the importance of the code we do write is absolutely dwarfed by everything else that goes on around it. Raise your hand if you've ever poured your heart and soul into an application that never shipped. I know I have. And that's just the tip of the iceberg; there are hundreds of reasons the code you write may have zero impact on the world. If nobody knows about your code, if nobody can understand your code, if for whatever reason your code doesn't even ship … then have you really accomplished anything by writing that code?

Maybe the best way to succeed as a programmer is to cut out the low value activities entirely and stop writing code altogether. Steve Yegge explains:

Do you have any programming heroes? I do! Oddly enough, though, I've never really seen much of their code. Most of the famous-ish programmers I respect have actually made their impact on me through writing, and it's usually just prose, with maybe a little code interspersed.

There are programmers I admire who've built things that I use a lot. But when I try to come up with a list of programmers I admire (and I specifically mean people I don't know personally), I find they almost always fall into one (or both) of just two categories:

People who wrote a useful programming language, an operating system, or an especially important framework.

People who wrote a really neat book about programming.

When someone builds a framework – any environment that we live in and actually enjoy programming in – and there's one person who's chiefly identifiable as the primary author of that framework, then I think we tend to admire that person, and unlike other programmers, the person starts to become famous.

Even if they're a crappy programmer.

Not that we'd really know, because how often do we go look at the source code for the frameworks we use? How much time have you spent examining the source code of your favorite programming language's compiler, interpreter or VM? And by the time such systems reach sufficient size and usefulness, how much of that code was actually penned by the original author?

Am I really telling developers to stop writing code? No, not really. Developers are already good at writing code. It's why they became software developers in the first place. Writing reams of code just digs you deeper into an already deeply specialized skill. What I am proposing is that we spend less time coding and more time developing skills in other areas that complement our coding skills. Become a better writer. Become a better speaker. Improve your personal skills. Participate in the community. Try to spend some time talking to people instead of the compiler. That's how you distinguish yourself from your peers. And that's ultimately how you become a better software developer, too.

Of course, this isn't a zero-sum game. You can have it both ways. Ideally, you'd write code, and then write or talk about the code in a way that inspires and illuminates other people. But we don't have an infinite amount of time, either. If you have to choose between writing code and writing about code, remember which side of the equation is more important – and balance accordingly.

Discussion