Coding Horror (Page 147)

20 Nov 2006

Filesystem Paths: How Long is Too Long?

I recently imported some source code for a customer that exceeded the maximum path limit of 256 characters. The paths in question weren't particularly meaningful, just pathologically* long, with redundant subfolders. To complete the migration, I renamed some of the parent folders to single character values.

This made me wonder: is 256 characters a reasonable limit for a path? And what's the longest path in my filesystem, anyway? I whipped up this little C# console app to loop through all the paths on my drive and report the longest one.

static string _MaxPath = "";
static void Main(string[] args)
{
  RecursePath(@"c:\");
  Console.WriteLine("Maximum path length is " + _MaxPath.Length);
  Console.WriteLine(_MaxPath);
  Console.ReadLine();
}

static void RecursePath(string p)
{
  foreach (string d in Directory.GetDirectories(p))
  {
    if (IsValidPath(d))
    {
      foreach (string f in Directory.GetFiles(d))
      {
        if (f.Length > _MaxPath.Length)
        {
          _MaxPath = f;
        }
      }
      RecursePath(d);
    }
  }
}
static bool IsValidPath(string p)
{
  if ((File.GetAttributes(p) & FileAttributes.ReparsePoint) == FileAttributes.ReparsePoint)
  {
    Console.WriteLine("'" + p + "' is a reparse point. Skipped");
    return false;
  }
  if (!IsReadable(p))
  {
    Console.WriteLine("'" + p + "' *ACCESS DENIED*. Skipped");
    return false;
  }
  return true;
}
static bool IsReadable(string p)
{
  try
  {
    string[] s = Directory.GetDirectories(p);
  }
  catch (UnauthorizedAccessException ex)
  {
    return false;
  }
  return true;
}

It works, but it's a bit more complicated than I wanted it to be, because

There are a few folders we don't have permission to access.
Vista makes heavy use of reparse points to remap old XP folder locations as symbolic links.

The longest path on a clean install of Windows XP is 152 characters.

c:\Documents and Settings\All Users\Application Data\Microsoft\Crypto\RSAS-1-5-18d42cc0c3858a58db2db37658219e6400_89e7e133-abee-4041-a1a7-406d7effde91

This is followed closely by a bunch of stuff in c:\WINDOWS\assembly\GAC_MSIL, which is a side-effect of .NET 2.0 being installed.

The longest path on a semi-clean install of Windows 8.1 is 273 characters:

c:\Users\wumpus-home\AppData\Local\Packages\WinStore_cw5n1h2txyewy\AC\Microsoft\Windows Store\Cache\0\0-Namespace-https???services.apps.microsoft.com?browse?6.2.9200-1?615?en-US?c?US?Namespace?pc?00000000-0000-0000-0000-000000000000?00000000-0000-0000-0000-000000000000.

The longest path Microsoft created in Windows 8.1 is 273 characters. But what's the longest path I can create in Windows Explorer?

The best I could do is 239 characters for folders, and 11 characters for the filename. Add in 3 characters for the inevitable "c:", plus 6 slashes. That's a grand total of 259 characters. Anything longer and I got a "destination path too long" error.

Windows Vista destination path too long error

The 259 character path limit I ran into jibes with the documented MAX_PATH limitation of the Windows shell:

The maximum length path (in characters) that can be used by the [Windows] shell is MAX_PATH (defined as 260). Therefore, you should create buffers that you will pass to SHFILEOPSTRUCT to be of length MAX_PATH + 1 to account for these NULLs.

If 259 characters plus a null seems like an unusually restrictive path limit for a modern filesystem like NTFS, you're right. The NTFS filesystem supports paths of 32,000 characters, but it's largely irrelevant because the majority of Windows APIs you'd use to get to those paths only accept paths of MAX_PATH or smaller. There is a wonky Unicode workaround to the MAX_PATH limitation, according to MSDN:

In the Windows API (with some exceptions discussed in the following paragraphs), the maximum length for a path is MAX_PATH, which is defined as 260 characters. A local path is structured in the following order: drive letter, colon, backslash, name components separated by backslashes, and a terminating null character. For example, the maximum path on drive D is "D:\some 256-character path string<NUL>" where "<NUL>" represents the invisible terminating null character for the current system codepage. (The characters < > are used here for visual clarity and cannot be part of a valid path string.)

The Windows API has many functions that also have Unicode versions to permit an extended-length path for a maximum total path length of 32,767 characters. This type of path is composed of components separated by backslashes, each up to the value returned in the lpMaximumComponentLength parameter of the GetVolumeInformation function (this value is commonly 255 characters). To specify an extended-length path, use the "\?" prefix. For example, "\?\D:\very long path".

The "\?" prefix can also be used with paths constructed according to the universal naming convention (UNC). To specify such a path using UNC, use the "\?\UNC" prefix. For example, "\?\UNC\server\share", where "server" is the name of the computer and "share" is the name of the shared folder. These prefixes are not used as part of the path itself. They indicate that the path should be passed to the system with minimal modification, which means that you cannot use forward slashes to represent path separators, or a period to represent the current directory, or double dots to represent the parent directory. Because you cannot use the "\?" prefix with a relative path, relative paths are always limited to a total of MAX_PATH characters.

The shell and the file system may have different requirements. It is possible to create a path with the API that the shell UI cannot handle.

Still, I wonder if the world really needs 32,000 character paths. Is a 260 character path really that much of a limitation? Do we need hierarchies that deep? Martin Hardee has an amusing anecdote on this topic:

We were very proud of our user interface and the fact that we had a way to browse 16,000 (!!) pages of documentation on a CD-ROM. But browsing the hierarchy felt a little complicated to us. So we asked Tufte to come in and have a look, and were hoping perhaps for a pat on the head or some free advice.

He played with our AnswerBook for about 90 seconds, turned around, and pronounced his review:

"Dr Spock's Baby Care is a best-selling owner's manual for the most complicated 'product' imaginable – and it only has two levels of headings. You people have 8 levels of hierarchy and I haven't even stopped counting yet. No wonder you think it's complicated."

I think 260 characters of path is more than enough rope to hang ourselves with. If you're running into path length limitations, the real problem isn't the operating system, or even the computers. The problem is the deep, dark pit of hierarchies the human beings have dug themselves into.

* ouch

Discussion

17 Nov 2006

Computers are Lousy Random Number Generators

The .NET framework provides two random number generators. The first is System.Random. But is it really random?

Pseudo-random numbers are chosen with equal probability from a finite set of numbers. The chosen numbers are not completely random because a definite mathematical algorithm is used to select them, but they are sufficiently random for practical purposes. The current implementation of the Random class is based on Donald E. Knuth's subtractive random number generator algorithm, from The Art of Computer Programming, volume 2: Seminumerical Algorithms.

These cannot be random numbers because they're produced by a computer algorithm; computers are physically incapable of randomness. But perhaps sufficiently random for practical purposes is enough.

The second method is System.Security.Cryptography.RandomNumberGenerator. It's more than an algorithm. It also incorporates the following environmental factors in its calculations:

The current process ID
The current thread ID
The tick count since boot time
The current time
Various high precision CPU performance counters
An MD4 hash of the user's environment (username, computer name, search path, etc)

Good cryptography requires high quality random data. In fact, a perfect set of encrypted data is indistinguishable from random data.

I wondered what randomness looks like. So I wrote the following program, which compares the two random number methods available in the .NET framework. In blue, System.Random, and in red, System.Cryptography.RandomNumberGenerator.

const int maxlen = 3000;
Random r = new Random();
RandomNumberGenerator rng = RandomNumberGenerator.Create();
Byte[] b = new Byte[4];
using (StreamWriter sw = new StreamWriter("random.csv"))
{
for (int i = 0; i < maxlen; i++)
{
sw.Write(r.Next());
sw.Write(",");
rng.GetBytes(b);
sw.WriteLine(Math.Abs(BitConverter.ToInt32(b, 0)));
}
}

3,000 random numbers

10,000 random numbers

30,000 random numbers

I have no idea how to test for true randomness. The math is far beyond me. But I don't see any obvious patterns in the resulting data. It's utterly random noise to my eye. Although both of these methods produce reasonable randomness, they're ultimately still pseudo-random number generators. Computers are great number crunchers, but they're lousy random number generators.

To have any hope of producing truly random data, you must reach outside the computer and sample the analog world. For example, WASTE samples user mouse movements to generate randomness:

But even something as seemingly random as user input can be predictable; not all environmental sources are suitably random:

True random numbers are typically generated by sampling and processing a source of entropy outside the computer. A source of entropy can be very simple, like the little variations in somebody's mouse movements or in the amount of time between keystrokes. In practice, however, it can be tricky to use user input as a source of entropy. Keystrokes, for example, are often buffered by the computer's operating system, meaning that several keystrokes are collected before they are sent to the program waiting for them. To the program, it will seem as though the keys were pressed almost simultaneously.
A better source of entropy is a radioactive source. The points in time at which a radioactive source decays are completely unpredictable, and can be sampled and fed into a computer, avoiding any buffering mechanisms in the operating system. In fact, this is what the HotBits people at Fourmilab in Switzerland are doing. Another source of entropy could be atmospheric noise from a radio, like that used here at random.org, or even just background noise from an office or laboratory. The lavarand people at Silicon Graphics have been clever enough to use lava lamps to generate random numbers, so their entropy source not only gives them entropy, it also looks good! The latest random number generator to come online is EntropyPool which gathers random bits from a variety of sources including HotBits and random.org, but also from web page hits received by the EntropyPool's web server.

Carl Ellision has an excellent summary of many popular environmental sources of randomness and their strengths and weaknesses. But environmental sources have their limits, too-- unlike pseudo-random algorithms, they have to be harvested over time. Not all environmental sources can provide enough random data for a server under heavy load, for example. And some encryption methods require more random data than others; one particularly secure algorithm requires one bit of random data for each bit of encrypted data.

Computers may be lousy random number generators, but we've still come a long way:

As recently as 100 years ago, people who needed random numbers for scientific work still tossed coins, rolled dice, dealt cards, picked numbers out of hats, or browsed census records for lists of digits. In 1927, statistician L.H.C. Tippett published a table of 41,600 random numbers obtained by taking the middle digits from area measurements of English churches. In 1955, the Rand Corporation published A Million Random Numbers With 100,000 Normal Deviates, a massive tome filled with tables of random numbers. To remove slight biases discovered in the course of testing, the million digits were further randomized by adding all pairs and retaining only the last digit. The Rand book became a standard reference, still used today in low-level applications such as picking precincts to poll.

The world is random. Computers aren't. Randomness is really, really hard for computers. It's important to understand the ramifications of this big divide between the analog and digital world, otherwise you're likely to make the same rookie mistakes Netscape did.

Discussion

16 Nov 2006

It's Never Been Built Before

In Microsoft Project and the Gantt Waterfall, many commenters wondered why software projects can't be treated like any other construction or engineering project:

I am not sure why it is so difficult to estimate software development? Is it a mystery, magic, is there a man behind the curtain that every project depends on?
I mean it's simple, come on! It's like a designer and general contractor coming over and estimating how long it will take to remodel your kitchen.

But software projects truly aren't like other engineering projects. I don't say this out of a sense of entitlement, or out of some misguided attempt to obtain special treatment for software developers. I say it because the only kind of software we ever build is unproven, experimental software. Sam Guckenheimer explains:

To overcome the gap, you must recognize that software engineering is not like other engineering. When you build a bridge, road, or house, for example, you can safely study hundreds of very similar examples. Indeed, most of the time, economics dictate that you build the current one almost exactly like the last to take the risk out of the project.
With software, if someone has built a system just like you need, or close to what you need, then chances are you can license it commercially (or even find it as freeware). No sane business is going to spend money on building software that it can buy more economically. With thousands of software products available for commercial license, it is almost always cheaper to buy. Because the decision to build software must be based on sound return on investment and risk analysis, the software projects that get built will almost invariably be those that are not available commercially.
This business context has a profound effect on the nature of software projects. It means that software projects that are easy and low risk, because they've been done before, don't get funded. The only new software development projects undertaken are those that haven't been done before or those whose predecessors are not publicly available. This business reality, more than any other factor, is what makes software development so hard and risky, which makes attention to process so important.

One kitchen remodelling project is much like another, and the next airplane you build will be nearly identical to the last five airplanes you built. Sure, there are some variables, some tweaks to the process over time, but it's a glorified factory assembly line. In software development, if you're repeating the same project over and over, you won't have a job for very long. At least not on this continent. If all you need is a stock airplane, you buy one. We're paid to build high-risk, experimental airplanes.

The appropriately named happy-go-lucky left this comment on the same post which explains the distinction quite succinctly:

The problem originates in the fact that "Software Project" shares the word "project" with "engineering project" or "construction project."
"Ah-ha!" our hapless software managers cry, as they roll-out the money for a copy of MS Project. And the software developers cry all the way 'til the end of the "project."
Software invention is just that. It has more in common with Mr. Edison's Menlo Park laboratory than the local hi-rise construction site.
Of course, there are far fewer variables on a work site than in software invention. Gravity remains the same, elements don't change and concrete has known properties. Things aren't that stable when operating systems, hardware, software and business processes come together in "the perfect speculation."

In software engineering projects, you aren't subject to God's rules. You get to play God: define, build, and control an entire universe in a box. It's a process of invention and evolution, not an assembly line. Can software development be managed? Sure. But it's a mistake to try managing it the same way you'd manage a kitchen remodel.

Discussion

15 Nov 2006

Simplicity as a Force

Simplicity isn't easy to achieve, and John Maeda's short book, The Laws of Simplicity, provokes a lot of thought on the topic.

Programmers swim in a sea of unending complexity. We get so used to complexity as an ambient norm that we begin, consciously or unconsciously, projecting it into our work. Simplicity is tough in any field, but in ours, we exacerbate the problem. Impossibly complex applications are the default deliverable for new programmers. Only seasoned software development veterans are capable of producing applications that are easy to understand and troubleshoot. Simplicity isn't achievable as a passive goal; it's a force that must be actively applied.

You can read most of the book online via John's excellent blog, including abbreviated versions of the ten laws of simplicity. One of my favorite sections is about the evolution of the iPod's controls.

Keep it Simple, Stupid. If only it was that easy. It feels more like back-breaking work to keep things from inevitably devolving into complexity.

Discussion

14 Nov 2006

Microsoft Project and the Gantt Waterfall

I've been using Microsoft Project quite a bit recently with a certain customer of ours. They bleed Gantt. I hadn't used Project in years, and after being exposed to it again, it really struck me how deeply the waterfall model is ingrained into the product. Take a look. What do you see?

Microsoft Project 2003 Gantt chart screenshot

Every Microsoft Project file I open is a giant waterfall, inexorably flowing across the page from left to right like Twain's mighty Mississippi river.

But like Twain's fictional Mississippi, project management with Microsoft Project is largely built on tall tales:

When you work in IT, you deal with the consensual hallucination of Project Management. There is an almost universal belief that it is possible to predict ahead of time how long a project will take, how much it will cost and what will happen along the way. In the broadest sense, this is possible. If you have enough experience you can come up with ballpark figures; last time we did something similar it took this long and cost this much.
But some people believe Project Management should tell you these things down to the day and the dollar. A project plan should tell you every task that needs to be completed. A project plan should be flawless and leave nothing to chance. And a project plan should be completed before ANY work is done on the project.

Despite the fact this is clearly insanity, it is a terrifyingly common mindset in management ranks. Project planning and goals are obviously important at some level (otherwise how the hell would you know what you are doing?) but how did we move from "let's have a clearly defined set of project goals and a strategy for how we'll get there," to "this is 100% accurate, carved in stone and will never change"?

What are the alternatives? Well, anything but waterfall. And a dose of McConnell's Software Estimation: Demystifying the Black Art, which manages to cover the difficult topic of software estimation without a single mention of Microsoft Project.

Amazing.

Discussion