Compression and Cliffs

I set up a number of Windows XP SP2 Virtual PC base images today. A WinXP SP2 clean install, after visiting Windows Update, is 1.70 gigabytes. Building up a few baseline images like this can chew up a substantial amount of disk space and network bandwidth. So, taking a page from Jon Galloway's book, I decided to see what I'd get if I compressed the virtual hard drive file. My results?

AppSizeTime taken (approx)
WinZip 9.0 SR-1880 megabytes3 minutes
7Zip 4.20739 megabytes22 minutes

(All apps were used with out of box defaults). I do end up with a file that is 17% smaller, but it takes 7.3 times longer. That sure doesn't seem like a very good deal to me. Now, in fairness to Jon, his only goal was to squeeze a largish 10gb VHD image into a single 4.7 gigabyte DVD-R; compression time wasn't a criteria.

Although this is my first exposure to 7zip, I've run these kinds of comparions before with ZIP and RAR and reached the same conclusion. Although there are certainly different algorithmic efficiencies, no matter what compression algorithm you choose-- beyond a certain optimal compression level, performance falls off a cliff. After you hit that point, you'll spend obscene amounts of time for rapidly diminishing benefits. Nowhere is this better illustrated than in Wim Heirman's Practical Compressor Test results:

GIMP source compression results, compression ratio vs. time

Note that the scale on the bottom of the graph is logarithmic. This is the only comparison I could find that properly expresses compression as the zero-sum game it really is: you can either have efficiency, or you can have speed. That's why, except for the truly obsolete algorithms, you see the "diagonal line" effect on this graph: better compression algorithms always take longer. Sometimes a lot longer. If you're holding out for Hyper Turbo Extreme Compression X algorithm, you may be waiting a while. Consider RAR, which offers the best blend of compression and speed currently available:

LevelTime (secs)Compression ratioTime factorGain
-m15.722.1%1x-
-m228.314.5%5x longer7.6% smaller
-m340.213.4%7x longer8.7% smaller
-m440.213.1%7x longer9.0% smaller
-m546.712.5%8x longer9.6% smaller

When it takes 5 times longer for barely 8% more compression, you've fallen off a cliff. But it still might be worth the extreme cost, depending on your goals. For most usages, it boils down to these three questions:

  1. How often will your data be compressed?
  2. How many times will it be transferred?
  3. How fast is the network?

Decompression time in this scenario is usually a tiny, relatively constant fraction of the compression time, so it's not a factor. Wim provides a helpful calculator to assist in making this decision:

total time = compression time + n * (compressed file size / network speed + decompression time)

For instance, if you compress a file to send it over a network once, n equals one and compression time will have a big influence. If you want to post a file to be downloaded many times, n is big so long compression times will weigh less in the final decision. Finally, slow networks will do best with a slow but efficient algorithm, while for fast networks a speedy, possibly less efficient algorithm is needed.

Of course, there are still a few variables Wim's page hasn't considered. Most notably, he only compresses a single file (the GIMP source TAR file), which has two consequences:

  1. Filetype specific compression can perform far better than the generic compression considered here. Compression tailored to file contents (eg, lossless audio compression) is generally a huge win.
  2. When compressing groups of files, programs that can do "solid" archiving will always outperform those that can't. Solid archiving means that the files are compressed as one giant binary blob and not as a bunch of individual files. This provides a much higher level of overall compression due to (generally) repeated data between files.

No one set of benchmarks offers a complete picture. Most other compression benchmark pages tend to focus on absolute compression ratios to the detriment of all other variables, which is a little crazy once you've fallen off the cliff. On Wim's page, the slowest three times are 198 (7zip), 47 (rar), and 43 (bzip2) seconds respectively. Some of the more extreme space-optimized compression algorithms can take several hours to compress the same file!

Read more

Stay Gold, America

We are at an unprecedented point in American history, and I'm concerned we may lose sight of the American Dream.

By Jeff Atwood · · Comments

The Great Filter Comes For Us All

With a 13 billion year head start on evolution, why haven't any other forms of life in the universe contacted us by now? (Arrival is a fantastic movie. Watch it, but don't stop there - read the Story of Your Life novella it was based on

By Jeff Atwood · · Comments

I Fight For The Users

If you haven't been able to keep up with my blistering pace of one blog post per year, I don't blame you. There's a lot going on right now. It's a busy time. But let's pause and take a moment

By Jeff Atwood · · Comments

The 2030 Self-Driving Car Bet

It's my honor to announce that John Carmack and I have initiated a friendly bet of $10,000* to the 501(c)(3) charity of the winner’s choice: By January 1st, 2030, completely autonomous self-driving cars meeting SAE J3016 level 5 will be commercially available for passenger

By Jeff Atwood · · Comments