Log file compression
For the test, I took one of the smaller files, containing 1094 MiB of pure text data, lots of repetitions, lots of numbers.
And I ran a couple of tests - compressing it with bzip2, gzip and xz in all available presets.
Here are some observations beforehand:
- gzip is always faster than any other compressor
- gzip -9 compresses better than bzip2 -1 but worse than bzip2 -2
- bzip2 is very slow compared to the compression ratio achieved
- xz has a distinct drop in performance between level 3 and 4, also level 4 compresses worse than level 3 - explanation later
Now keep in mind that this is a special use case - a big plain ASCII text file. It's not an academic one though, since log files are quite relevant in many applications.
Now let's look at some graphs, here's compression ratio and time, sorted by ratio ascending.
We notice some patterns here:
- gzip and bzip2's execution times and compression ratios are coherent with the compression level: higher level equals better compression and slower execution
- same goes for xz level 0 to 3 and xz level 4 to 9
Here an overview of the maximum and minimum times and ratios:
|compressor||min ratio||max ratio||min time||max time|
Ok, so why did I separate the xz levels like this. It's obvious if you take a look at this chart:
Now, the extreme modes aside, the notable difference is that between 3 and 4, mode changes from "fast" to "normal" and the matchfinder algorithm is changes from hash chains to binary tree; the man page also states that "Hash Chains are usually used together with mode=fast and Binary Trees with mode=normal." So we can conclude that, obviously, the fast mode is really fast. The question is, are hash chains generally better for text compression or is the result of the decreased compression ratio from level 3 to 4 a side effect of the much decreased "nice" value or the depth value (depth 0 means the decoder determines a reasonable depth from mf and nice)? I don't have a definite answer to that question, but I know that with identical settings, bt4 compresses the log file in question slower (by a factor ranging from 2 to more than 5) than hc4, and achieves roughly the same compression ratio.
Here's another presentation of the same data:Again, it's obvious that gz is the fastest, although level 9 is almost indistinguishable from level 8. There is a significant improvement in compression levels for the xz 0-3 levels, and the processing time increases almost linearily upwards, whereas compression ratios vary greatly. Again, all the more evidence to avoid bzip2 for this kind of files.
Finally, I wanted a "tangible" value for the "goodness" of a certain compressor mode. Since compression will always be a tradeoff between time and size, the value of a compressor can be expressed in terms of these two factors. The lower each is, the better, but you need to weigh them. I decided to use this formula:
value = sqrt(time in seconds) * (compression ratio)^2
To my surprise, xz -0 wins, the lower xz levels are followed by gz, with bzip being the worst of the bunch. I would have expected the value to increase towards xz -3, that bzip2 is at the low end of the spectrum was to be expected.
Moral of the story
|You have large ASCII text files to compress|
|how much CPU power do you want to spend on it||either you don't have a problem or you need to do similar research|
|very little||it depends||I don't care|
|use gzip||use xz -0 to xz -3 or manually fine-tuned||use xz -9 or manually fine-tuned|
There is something to be said for fine-tuning the parameters of xz. Although it seems overwhelming at first, it's certainly worth to play with them a little bit if you want to fine-tune your compression. Here's an overview:
|parameter||values||explanation||impact on speed||impact on compression|
|mode||fast, normal||matchfinder mode, fast is only really fast with mf=hc||normal is factor 2 slower with mf=bt and factor >5 slower with mf=hc||in this test, minimal|
|mf||hc3, hc4, bt2, bt3, bt4||matchfinder - hash chain or binary tree||bt is much slower than hc||in this test, with identical other parameters, bt fared better than hc, but by a very small margin|
|dict||4k - 1.5G||dictionary size, larger is better||the larger the slower, also RAM usage increases||the larger the better, although it doesn't make sense to make it larger than the input file size|
|nice||2-273||length of a "nice" match||with mode=fast, higher values are somewhat slower, with mode=normal higher values are a lot slower||the higher, the better, but no huge gains|
|depth||4-100 for mf=hc and 16-1000 fro mf=bt||maximum search depth for the match finder||CPU impact slightly lower than dict, but defnitely noticable - this is a good second parameter to balance speed vs. compression||higher values improve compression, even more so with large input files and large dict size|
|pb||0-4||position bits, for text files this should be set to 0||almost none||pb=0 gives slightly better compression for this use case|
Using this knowledge, I was able to compress the tested file down to 36.4MiB (down from 49.7MiB at xz -9, total compression ratio is 1:30 vs. 1:22 at xz -9), although it took almost 51 minutes and a lot of RAM, using this command line:
A more balanced approuch would be
which compressed it to 47.7MiB in just 4:14 minutes (which is still faster and much smaller than any of the bzip2 compression levels, as well as faster and smaller than xz levels 5 to 9), but of course with the huge 512MiB dictionary, the memory usage is also enormous.
All tests were performed on an AMD Athlon 64 X2 4600+ (2.4GHz) with 8GB of RAM, so all the tests ran almost entirely in RAM; and the xz version used was 5.0.0 from Debian Squeeze.