Coding Horror (Page 165)

19 Jul 2006

Creating Smaller Virtual Machines

Now that Virtual PC is finally free, I've become obsessed with producing the smallest possible Windows XP Virtual PC image. It's quite a challenge, because a default XP install can eat up well over a gigabyte. Once you factor in the swapfile and other overhead, you're generally talking about around 2-4 gigabytes for relatively simple configurations.

My best result so far, however, is a 641 megabyte virtual machine image of a clean, fully patched Windows XP install. Not bad. And here's how I did it.

First, start with the obvious stuff:

Install Windows XP SP2. Take all default options.
Connect to Windows update; install all critical updates.
Install VM additions.
Turn off system restore.
- Right click My Computer; select properties
- Click the System Restore tab
- Click the "Turn off System Restore" checkbox
- OK all the way back out
Set Visual Effects to minimum.
- Right click My Computer; select Properties
- Click the Advanced tab
- Click the Performance Settings button
- Click the "Adjust for best performance" checkbox
- OK all the way back out.
Shut down.

Don't install anything else yet! Remember, we're trying to get to a minimal baseline install of Windows XP first. A nice, flat platform to build on.

It's critical to turn off system restore, because that eats up hundreds of megabytes of disk space. In a virtual machine environment, having a rollback path doesn't make sense anyway. And if the Windows software environment wasn't so pathological, we wouldn't need complex rollback support embedded in the OS, either, but I digress.

Now let's put together our toolkit of virtual machine optimization:

Thes utilities are mostly free. And, except for Crap Cleaner, they don't even require installers. Just plop all the files for each one into a folder; I call mine VM-utils. Copy this folder to the target VM.

Use TweakUI to turn on automatic login. Otherwise you have to distribute login credentials with your VM, and who wants to do that?
Now, use XPlite to tear out all the annoying, unnecessary bits of Windows XP:

XPlite is easily the best utility of its type; it removes scads of useless things built into XP that have no explicit uninstall mechanism. Unfortunately, XPlite is payware. There is a free version, but it's crippled; it can only remove a fraction of the items the full version can. See the full list of items it can remove along the right-hand side of the product page.
By default, XPlite generally shows things that are safe to remove. Note that the "Advanced Components" item is shown in that screenshot, which is definitely stuff that's not safe to remove unless you really know what you're doing. Anyway, here's what I consider totally safe to remove in XPlite's standard list:
- Accessibility Options
- Communication and Messaging
- Server Components
- Games
- System Services
The others require a bit of judicious selection.
- Accessories - you probably want Notepad, Calc, and the other essential applets. A world without Notepad is a world I don't want to live in.
- Internet Utilities - if you want to keep the default IE6 inside XP, I'd leave this alone. With the notable exception of MSN Explorer, which is always safe to drop.
- Multimedia - if you have sound enabled, selectively keep some of this, otherwise dump it all. It's highly unlikely you would ever want to watch videos or listen to music inside your VM, right? Right?
- Operating System Options - you may want to keep the core fonts if you're planning to browse the web within the VM. Also, beware of removing the service pack update files. Most of this is safe to dump, though. However, you will need the VB6 runtimes for Crap Cleaner to run!
- System Tools & Utilities - I'd leave Dr. Watson, and possibly PerfMon, WSH and Zip folder support.
Once you've made your selections, let XPlite do its thing. It's worth the effort, because you'll have an unbelievably squeaky clean Start menu when it's done. Who knew Windows XP could be this.. simple?
Install and run Crap Cleaner. Perform the default analysis, then do a cleanup. This step is really optional; it only cleans up a couple megabytes of log files and miscellaneous junk. Be sure to uninstall Crap Cleaner when you're done, too.
Now that we've cleaned everything up, we need to defragment the disk.

You can use any defragmenter you like, of course, but this one is free and works quite well.
1. Navigate to the folder where you put your VM utilities, including the Whitney Defragger.
2. Open a command prompt
3. Copy the defragmenting program to our windows system folder:
```
copy bootdfrg.exe c:windowssystem32
```
4. Install the defragmenting service:
```
defrag -i
```
5. Schedule a defragmentation of the c: drive for the next boot:
```
defrag -d c: -B
```
6. Restart the virtual machine.
7. The defragmenter will run before Windows loads. Let it run to completion. It may take a little while, but it provides lots of textual feedback on what it's doing.
Now we have to zero the free space on the drive. You have your choice of the free Microsoft Virtual PC Pre-Compactor, or the inexpensive Invirtus VM Optimizer. Both do the same thing, but the Invirtus tool results in an image that's about 15 percent smaller (641 megabytes vs. 758 megabytes, in my test) than the Microsoft tool.
Either way, you're mounting an ISO. The Microsoft Pre-Compactor is in a folder named "Virtual Machine Additions" under your Virtual PC install folder. Once mounted, the precompactor will autorun. Let it prep the drive; this doesn't take long.
Cleanly shut down the virtual machine.
Finally, shrink the virtual machine hard drive using the disk wizard available from the Virtual PC UI:
1. Click the File | Virtual Disk Wizard drop-down menu
2. Edit an existing virtual disk
3. Select the correct disk image
4. Select "Compact it"
5. Select "replacing the original file"
.. and prepare to marvel at the tiny size* of the resulting hard drive image!

It's really quite amazing how snappy and compact Windows XP can be, once you remove all the useless cruft from it.

* that's what she said.

Discussion

18 Jul 2006

Why Can't Database Tables Index Themselves?

Here's a thought question for today: why can't database tables index themselves?

Obviously, indexes are a central concept to databases and database performance. But horror tales still abound of naive developers who "forget" to index their tables, and encounter massive performance and scalability problems down the road as their tables grow. I've run into it personally, and I've read plenty of other sad tales of woe from other developers who have, too. I've also forgotten to build indexes myself on non primary key columns many times. Why aren't databases smart enough to automatically protect themselves from this?

It always struck me as absurd that I had to go in and manually mark fields in a table to be indexed. Perhaps in the bad old file-based days of FoxPro, DBase, and Access, that might have been a necessary evil. But in a modern client-server database, the server should be aware of all the queries flowing through the system, and how much each of those queries cost. Who better to decide what needs to be indexed than the database itself?

Why can't you enable an automatic indexing mode on your database server that follows some basic rules, such as..

Does this query result in a table scan?
If so, determine which field(s) could be indexed, for that particular query, to remove the need for a table scan.
Store the potential index in a list. If the potential index already exists in the list, bump its priority.
After (some configurable threshold), build the most commonly needed potential index on the target table.

Of course, for database gurus who are uncomfortable with this, the feature could be disabled. And you could certainly add more rules to make it more robust. But for most database users, it should be enabled by default; an auto-indexing feature would make most database installations almost completely self-tuning with no work at all on their part.

I did some cursory web searches and I didn't see any features like this for any commercial database server. What am I missing here? Why does this seem so obvious, and yet it's not out there?

Discussion

17 Jul 2006

Diseconomies of Scale and Lines of Code

Steve McConnell on diseconomies of scale in software development:

Project size is easily the most significant determinant of effort, cost and schedule [for a software project].*
People naturally assume that a system that is 10 times as large as another system will require something like 10 times as much effort to build. But the effort for a 1,000,000 LOC system is more than 10 times as large as the effort for a 100,000 LOC system.
[Using software industry productivity averages], the 10,000 LOC system would require 13.5 staff months. If effort increased linearly, a 100,000 LOC system would require 135 staff months. But it actually requires 170 staff months.

Here's the single most important decision you can make on your software project if you want it to be successful: keep it small. Small may not accomplish much, but the odds of outright failure-- a disturbingly common outcome for most software projects-- is low.

I don't think the inverted, non-linear relationship between size and productivity on software projects will come as a shock to anyone; the guys at 37signals have been banging their drum on the virtues of small for over a year now. Isn't small the new big already?

But what I really want to focus on here is how you measure a project's size. What's big? What's small? McConnell is using lines of code (LOC) as his go-to measurement. Here's a table that illustrates the relationship between project size and productivity:

Project Size	Lines of code (per year)	COCOMO average
10,000 LOC	2,000 - 25,000	3,200
100,000 LOC	1,000 - 20,000	2,600
1,000,000 LOC	700 - 10,000	2,000
10,000,000 LOC	300 - 5,000	1,600

Lines of code is a reasonable metric to determine project size, but it also has some problems, which are well-documented in the wikipedia entry on lines of code:

/* How many lines of code is this? */
for (i=0; i<100; ++i) printf("hello");

For one thing, different languages vary widely in the number of lines of code they produce. 100 lines of Perl will probably accomplish a lot more than 100 lines of C. So you have to be careful that you're really comparing apples to apples. Furthermore, skilled developers know that the less code you write, the fewer bugs you've created-- so they naturally distrust any productivity metric that weights absolute lines of code. And does code generation count?

Even with all its problems, the LOC metric is still where you should start, according to McConnell:

My personal conclusion about using lines of code for software estimation is similar to Winston Churchill's conclusion about democracy: The LOC measure is a terrible way to measure software size, except that all the other ways to measure size are worse. For most organizations, despite its problems, the LOC measure is the workhorse technique for measuring size of past projects and for creating early-in-the-project estimates of new projects. The LOC measure is the lingua franca of software estimation, and it is normally a good place to start, as long as you keep its limitations in mind.
Your environment might be different enough from the common programming environments that lines of code are not highly correlated with project size. If that's true, find something that is more proportional to effort, count that, and base your size estimates on that instead. Try to find something that's easy to count, highly correlated with effort, and meaningful for use across multiple projects.

The wikipedia article features this chart of Windows operating system size, in lines of code, over time:

1993	Windows NT 3.1	6 million
1994	Windows NT 3.5	10 million
1996	Windows NT 4.0	16 million
2000	Windows 2000	29 million
2002	Windows XP	40 million
2007	Windows Vista	~50 million

If you're wondering how much code the average programmer produces per day, I think you might be asking the wrong question. Lines of code is certainly a key metric for determining project size, but it's also easily manipulated and misinterpreted. It should never be the only data point used to make decisions; it's just one of many signposts on the road that helps you orient your project.

* what are the other most significant determinants? Number two is the type of software you're developing, and personnel factors is a very close third.

Discussion

15 Jul 2006

Own a Coding Horror

A few people recently pointed out that my personal branding isn't everything that it could be. Joseph Cooney even took matters into his own hands.

Well, I contacted the big man himself, Steve McConnell, and he graciously provided me a high resolution vector file of the original Coding Horror logo used in Code Complete (see it on the page), along with permission to sell items based on that design.

United States / Canada

Buy Coding Horror shirts

International

Buy Coding Horror shirts

I also have custom, two color die-cut vinyl stickers based on the same high resolution vector art.

To give you an idea of scale, the coin in the picture is a nickel. The dimensions of the sticker are 3.55" h × 3" w.

United States $4 for 4 stickers	International $4 for 3 stickers

In case anyone was wondering, the actual Coding Horror font is Frutiger 75 Black, as I figured out today.

Discussion

14 Jul 2006

Separating Programming Sheep from Non-Programming Goats

⚠ Please note, this paper was ultimately retracted by its author (pdf) in 2014:

In 2006 I wrote an intemperate description of the results of an experiment carried out by Saeed Dehnadi. Many of the extravagant claims I made were insupportable, and I retract them. I continue to believe, however, that Dehnadi had uncovered the first evidence of an important phenomenon in programming learners. Later research seems to confirm that belief.

A bunch of people have linked to this academic paper, which proposes a way to separate programming sheep from non-programming goats in computer science classes – long before the students have ever touched a program or a programming language:

All teachers of programming find that their results display a 'double hump'. It is as if there are two populations: those who can [program], and those who cannot [program], each with its own independent bell curve. Almost all research into programming teaching and learning have concentrated on teaching: change the language, change the application area, use an IDE and work on motivation. None of it works, and the double hump persists. We have a test which picks out the population that can program, before the course begins. We can pick apart the double hump. You probably don't believe this, but you will after you hear the talk. We don't know exactly how/why it works, but we have some good theories.

I wasn't aware that the dichotomy between programmers and non-programmers was so pronounced at this early stage. Dan Bricklin touched on this topic in his essay, Why Johnny Can't Program. But evidently it's common knowledge amongst those who teach computer science:

Despite the enormous changes which have taken place since electronic computing was invented in the 1950s, some things remain stubbornly the same. In particular, most people can't learn to program: between 30% and 60% of every university computer science department's intake fail the first programming course. Experienced teachers are weary but never oblivious of this fact; brighteyed beginners who believe that the old ones must have been doing it wrong learn the truth from bitter experience; and so it has been for almost two generations, ever since the subject began in the 1960s.

You may think the test they're proposing to determine programming aptitude is complex, but it's not. Here's question one, verbatim:

Read the following statements and tick the box next to the correct answer.
int a = 10;
int b = 20;
a = b;
The new values of a and b are:
[ ] a = 20 b = 0
[ ] a = 20 b = 20
[ ] a = 0 b = 10
[ ] a = 10 b = 10
[ ] a = 30 b = 20
[ ] a = 30 b = 0
[ ] a = 10 b = 30
[ ] a = 0 b = 30
[ ] a = 10 b = 20
[ ] a = 20 b = 10

This test seems trivial to professional programmers, but remember, it's intended for students who have never looked at a line of code in their lives. The other 12 questions are all variations on the same assignment theme.
The authors of the paper posit that the primary hurdles in computer science are..

assignment and sequence
recursion / iteration
concurrency*

.. in that order. Thus, we start by testing the very first hurdle novice programmers will encounter: assignment. The test results divided the students cleanly into three groups:

44% of students formed a consistent mental model of how assignment works (even if incorrect!)
39% students never formed a consistent model of how assignment works.
8% of students didn't give a damn and left the answers blank.

The test was administered twice; once at the beginning, before any instruction at all, and again after three weeks of class. The striking thing is that there was virtually no movement at all between the groups from the first to second test. Either you had a consistent model in your mind immediately upon first exposure to assignment, the first hurdle in programming – or else you never developed one!

The authors found an extremely high level of correlation between success at programming and forming a consistent mental model:

Clearly, Dehnahdi's test is not a perfect divider of programming sheep from non-programming goats. Nevertheless, if it were used as an admissions barrier, and only those who scored consistently were admitted, the pass/fail statistics would be transformed. In the total population 32 out of 61 (52%) failed; in the first-test consistent group only 6 out of 27 (22%). We believe that we can claim that we have a predictive test which can be taken prior to the course to determine, with a very high degree of accuracy, which students will be successful. This is, so far as we are aware, the first test to be able to claim any degree of predictive success.

I highly recommend reading through the draft paper (pdf), which was remarkably entertaining for what I thought was going to be a dry, academic paper. Instead, it reads like a blog entry. It's filled with interesting insights like this one:

It has taken us some time to dare to believe in our own results. It now seems to us, although we are aware that at this point we do not have sufficient data, and so it must remain a speculation, that what distinguishes the three groups in the first test is their different attitudes to meaninglessness.

Formal logical proofs, and therefore programs – formal logical proofs that particular computations are possible, expressed in a formal system called a programming language – are utterly meaningless. To write a computer program you have to come to terms with this, to accept that whatever you might want the program to mean, the machine will blindly follow its meaningless rules and come to some meaningless conclusion. In the test the consistent group showed a pre-acceptance of this fact: they are capable of seeing mathematical calculation problems in terms of rules, and can follow those rules wheresoever they may lead. The inconsistent group, on the other hand, looks for meaning where it is not. The blank group knows that it is looking at meaninglessness, and refuses to deal with it.

Everyone should know how to use a computer, but not everyone needs to be a programmer. It's still a little disturbing that the act of programming seems literally unteachable to a sizable subset of incoming computer science students. Evidently not everyone is as fascinated by meaningless rules and meaningless conclusions as we are; I can't imagine why not.

* which I hope to master sometime between now and my death

Discussion