From the getting your cores off the sofa department
Every now and then there are really nifty multi-core applications that help with some of the more mundane Linux HPC chores. The -j option for make is one such example. I recently stumbled upon two other applications that take advantage of multi-core for file compression.
The two packages are parallel gzip and parallel bzip2. As you might surmise, each application uses multiple cores to speed up the compression of large files. I checked to see if these were in the Scientific Linux Yum repository, but they did not show up when I tried to install them. Thus, I decided to build them myself.
Downloading an building was simple (make sure you have bzip2-devel and zlib-devel libraries installed) I actually created a spec files for them and produced my own RPMs to be used in my collection of cluster packages.
Of course, building packages is only half the fun. I decided to test the packages on my Limulus personal cluster. The head node is a currently a quad-core Intel Q6600 running Scientific Linux 6.0. In order to create a large file, I used tar and archived /usr/share (assuming it had a mix of file types):
$ tar cvf usr.share.tar /usr/share
The resulting file was 1.1GB, which was big enough for my simple tests. I then tried both the serial and the parallel versions on the file, both compressing and decompressing. Note, decompression is a mostly a serial task, thus does not benefit from multiple cores. To be a bit more formal, I created a simple script (see below) that all runs the tests and check to see if the file survived. I then placed the results in the following table.
Version | time (seconds) | Speed-up | Compressed Size | % of Original |
---|---|---|---|---|
Sequential gzip | 97 | 1 | 461MB | 42 |
Parallel pigz | 27 | 3.6 | 460MB | 42 |
Sequential bzip2 | 233 | 1 | 413MB | 38 |
Parallel pbzip2 | 69 | 3.4 | 414MB | 38 |
Table One: Results for various parallel compression packages for a 1.1GB file
The results were impressive. Basically I saw a ~3.5 times speed up for either method. The sequential bzip2 compression takes much longer than the sequential gzip, but the compression can be much better. In this case, the extra compression was not all that great, but in cases were the file is more "compressible," bzip2 will create much smaller files. In the case of pbzip2, it can make the compression times for bzip2 more tolerable. Overall, a nice pay-off for less than an hour of compiling and testing. Now I'll get back to my regular work, which quite interestingly involves creating some disk images with dd. Now all I need is fast compression tool ...
Simple Test Script |
#!/bin/bash #Create file tar cvf usr.share.tar /usr/share/ #Initial File size echo "Initial size and md5" md5sum usr.share.tar ls -lh usr.share.tar #gzip echo "Sequential gzip" time gzip usr.share.tar ls -lh usr.share.tar.gz gunzip usr.share.tar.gz #pigz echo "Parallel pigz" time pigz usr.share.tar ls -lh usr.share.tar.gz unpigz usr.share.tar.gz #bzip2 echo "Sequential bzip2" time bzip2 usr.share.tar ls -lh usr.share.tar.bz2 bunzip2 usr.share.tar.bz2 #pbzip echo "Parallel pbzip2" time pbzip2 usr.share.tar ls -lh usr.share.tar.bz2 bunzip2 usr.share.tar.bz2 # Final File size echo "Final size and md5" md5sum usr.share.tar ls -lh usr.share.tar |