Stefan Seidel

Log file compression

Here's something which has bothered me for a long time: I knew that xz is capable of very good compression. I knew that bzip2 is better and slower than gzip. But I never knew how much. So I decided to take one single application and test it. My use case is compressing log files of a server, in this case Java application server logs. The system spits out up to 6GB per day, so good compression is very important.

For the test, I took one of the smaller files, containing 1094 MiB of pure text data, lots of repetitions, lots of numbers.

And I ran a couple of tests - compressing it with bzip2, gzip and xz in all available presets.

Here are some observations beforehand:

  • gzip is always faster than any other compressor
  • gzip -9 compresses better than bzip2 -1 but worse than bzip2 -2
  • bzip2 is very slow compared to the compression ratio achieved
  • xz has a distinct drop in performance between level 3 and 4, also level 4 compresses worse than level 3 - explanation later

Now keep in mind that this is a special use case - a big plain ASCII text file. It's not an academic one though, since log files are quite relevant in many applications.

Now let's look at some graphs, here's compression ratio and time, sorted by ratio ascending.

Compression time and ratio, sorted by ratio

We notice some patterns here:

  • gzip and bzip2's execution times and compression ratios are coherent with the compression level: higher level equals better compression and slower execution
  • same goes for xz level 0 to 3 and xz level 4 to 9

Here an overview of the maximum and minimum times and ratios:

compressor min ratio max ratio min time max time
gzip 7.01% 10.67% 14.4s 31.4s
bzip2 4.85% 8.01% 295s 592s
xz 0-3 4.68% 5.52% 48s 133s
xz 4-9 4.54% 5.42% 193s 755s

Ok, so why did I separate the xz levels like this. It's obvious if you take a look at this chart:

level dict_size mode mf nice depth level mode mf nice depth
0 262144 fast hc3 128 4 0e normal bt4 273 512
1 1048576 fast hc4 128 8 1e normal bt4 273 512
2 2097152 fast hc4 273 24 2e normal bt4 273 512
3 4194304 fast hc4 273 48 3e normal bt4 192 0
4 4194304 normal bt4 16 0 4e normal bt4 273 512
5 8388608 normal bt4 32 0 5e normal bt4 192 0
6 8388608 normal bt4 64 0 6e normal bt4 273 512
7 16777216 normal bt4 64 0 7e normal bt4 273 512
8 33554432 normal bt4 64 0 8e normal bt4 273 512
9 67108864 normal bt4 64 0 9e normal bt4 273 512

Now, the extreme modes aside, the notable difference is that between 3 and 4, mode changes from "fast" to "normal" and the matchfinder algorithm is changes from hash chains to binary tree; the man page also states that "Hash Chains are usually used together with mode=fast and Binary Trees with mode=normal." So we can conclude that, obviously, the fast mode is really fast. The question is, are hash chains generally better for text compression or is the result of the decreased compression ratio from level 3 to 4 a side effect of the much decreased "nice" value or the depth value (depth 0 means the decoder determines a reasonable depth from mf and nice)? I don't have a definite answer to that question, but I know that with identical settings, bt4 compresses the log file in question slower (by a factor ranging from 2 to more than 5) than hc4, and achieves roughly the same compression ratio.

Here's another presentation of the same data:

Compression time and ratio, sorted by time
Again, it's obvious that gz is the fastest, although level 9 is almost indistinguishable from level 8. There is a significant improvement in compression levels for the xz 0-3 levels, and the processing time increases almost linearily upwards, whereas compression ratios vary greatly. Again, all the more evidence to avoid bzip2 for this kind of files.

Finally, I wanted a "tangible" value for the "goodness" of a certain compressor mode. Since compression will always be a tradeoff between time and size, the value of a compressor can be expressed in terms of these two factors. The lower each is, the better, but you need to weigh them. I decided to use this formula:

value = sqrt(time in seconds) * (compression ratio)^2

Compression ratio and rating, sorted by rating

To my surprise, xz -0 wins, the lower xz levels are followed by gz, with bzip being the worst of the bunch. I would have expected the value to increase towards xz -3, that bzip2 is at the low end of the spectrum was to be expected.

Moral of the story

You have large ASCII text files to compress
yes no
how much CPU power do you want to spend on it either you don't have a problem or you need to do similar research
very little it depends I don't care
use gzip use xz -0 to xz -3 or manually fine-tuned use xz -9 or manually fine-tuned

Fine-tuning xz

There is something to be said for fine-tuning the parameters of xz. Although it seems overwhelming at first, it's certainly worth to play with them a little bit if you want to fine-tune your compression. Here's an overview:

parameter values explanation impact on speed impact on compression
mode fast, normal matchfinder mode, fast is only really fast with mf=hc normal is factor 2 slower with mf=bt and factor >5 slower with mf=hc in this test, minimal
mf hc3, hc4, bt2, bt3, bt4 matchfinder - hash chain or binary tree bt is much slower than hc in this test, with identical other parameters, bt fared better than hc, but by a very small margin
dict 4k - 1.5G dictionary size, larger is better the larger the slower, also RAM usage increases the larger the better, although it doesn't make sense to make it larger than the input file size
nice 2-273 length of a "nice" match with mode=fast, higher values are somewhat slower, with mode=normal higher values are a lot slower the higher, the better, but no huge gains
depth 4-100 for mf=hc and 16-1000 fro mf=bt maximum search depth for the match finder CPU impact slightly lower than dict, but defnitely noticable - this is a good second parameter to balance speed vs. compression higher values improve compression, even more so with large input files and large dict size
pb 0-4 position bits, for text files this should be set to 0 almost none pb=0 gives slightly better compression for this use case

Using this knowledge, I was able to compress the tested file down to 36.4MiB (down from 49.7MiB at xz -9, total compression ratio is 1:30 vs. 1:22 at xz -9), although it took almost 51 minutes and a lot of RAM, using this command line:

xz --lzma2=mode=normal,mf=hc4,nice=273,depth=960,dict=512M,pb=0

A more balanced approuch would be

xz --lzma2=mode=fast,mf=hc4,nice=273,depth=64,dict=512M,pb=0

which compressed it to 47.7MiB in just 4:14 minutes (which is still faster and much smaller than any of the bzip2 compression levels, as well as faster and smaller than xz levels 5 to 9), but of course with the huge 512MiB dictionary, the memory usage is also enormous.

All tests were performed on an AMD Athlon 64 X2 4600+ (2.4GHz) with 8GB of RAM, so all the tests ran almost entirely in RAM; and the xz version used was 5.0.0 from Debian Squeeze.