BASH – GZIP or BZIP2?

BASH is a free software Unix shell written for the GNU Project. Its name is an acronym which stands for Bourne-again shell. The name is a pun on the name of the Bourne shell (sh), an early and important UNIX shell written by Stephen Bourne and distributed with Version 7 Unix circa 1978, and the concept of being born again. BASH was created in 1987 by Brian Fox. In 1990 Chet Ramey became the primary maintainer. BASH is the default shell on most GNU/Linux systems as well as on Mac OS X and it can be run on most UNIX-like operating systems. It has also been ported to Microsoft Windows using the POSIX emulation provided by Cygwin, to MS-DOS by the DJGPP project and to Novell NetWare.

AWK is a general purpose programming language that is designed for processing text-based data, either in files or data streams, and was created at Bell Labs in the 1970s. The name AWK is derived from the family names of its authors — Alfred Aho, Peter Weinberger, and Brian Kernighan; however, it is not commonly pronounced as a string of separate letters but rather to sound the same as the name of the bird, auk. awk, when written in all lowercase letters, refers to the UNIX or Plan 9 program that runs other programs written in the AWK programming language. AWK is an example of a programming language that extensively uses the string data type, associative arrays (that is, arrays indexed by key strings), and regular expressions. The power, terseness, and limitations of AWK programs and sed scripts inspired Larry Wall to write PERL. Because of their dense notation, all these languages are often used for writing one-liner programs. AWK is one of the early tools to appear in Version 7 UNIX and gained popularity as a way to add computational features to a UNIX pipeline. A version of the AWK language is a standard feature of nearly every modern UNIX-like operating system.

The Script

Often times, it becomes necessary to compress files to save (or make) space (for more files??). With the availability of gzip and bzip2 on most Linux distributions, often times one is left to ponder – which mechanism is better for a given file? Although it’s a common understanding that bigger the file size, bzip2 performs better, the following script is expected to help one in that matter. It compresses a given file with both options, compares the resulting filesize with the original (uncompressed) file and then makes a decision to retain one of the three (uncompressed, gzipped or bzipped).

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
#! /bin/bash
 
# BASH script that takes ONE filename as a (mandatory) argument
# and compresses them using both 'GZIP' and 'BZIP2'. Then, compares
# the filesizes and retains the one with the smallest value.
# 
# Usage: ./optimal_compression.sh [FILENAME]
#
# 01 September, 2006
# Tue, 14 Oct 2008 12:53:11 -0400
# Sun, 19 Oct 2008 12:08:36 -0400
 
if [[ $# -ne 1 ]];
then
  # Display error message when no (or more than one) files are specified as arguments
  echo
  echo " You must specify a filename"
  echo
  exit
else
 
  # Assign the supplied filename to a variable
  FILENAME=$1
 
  # Other useful variables
  TODAY=`date +"%Y%m%d_%H%M%S"`
  BZIP=`which bzip2`
  GZIP=`which gzip`
 
  # Temporary files that contain compressed data
  BZIP_FILE="/tmp/tmp_$TODAY.bz2"
  GZIP_FILE="/tmp/tmp_$TODAY.gz"
 
 
  # Compress the files with gzip and bzip2
  # One may add '&' at the end of the lines below and
  # uncomment 'wait' line, if really big files are being
  # compressed. It may save time. However, the definition
  # OPTIMUM might need some modification if done so.
  $BZIP < $FILENAME > $BZIP_FILE
  $GZIP < $FILENAME > $GZIP_FILE
  # wait
 
  # Compare file size - of normal, gzipped and bzipped files
  # and determine which one has the smallest (optimal) value.
  # Remember, 'ls -l' sorts filenames alphabetically
  OPTIMUM=`ls -ltr $FILENAME $BZIP_FILE $GZIP_FILE | \
           awk '{print $5":"NR}' | sort -n | \
           awk -F ':' '{if ($1 != "") { print $2 }}' | head -1`
 
  case "$OPTIMUM" in
    1 ) echo
        echo " $FILENAME not compressed."
        echo
        ;;
    2 ) echo
        echo " $FILENAME compressed with $BZIP => $FILENAME.bz2"
        echo
        mv $BZIP_FILE "$FILENAME.bz2" 
        ;;
    3 ) echo
        echo " $FILENAME compressed with $GZIP => $FILENAME.gz"
        echo
        mv $GZIP_FILE "$FILENAME.gz" 
        ;;
  esac
 
  # Remove temporary files 
  rm -f $FILENAME $BZIP_FILE $GZIP_FILE
fi


Optimal Compression

2 Replies to “BASH – GZIP or BZIP2?”

  1. A jpeg file is probably a bad example – jpeg images are already compressed, and so you’re really testing how well they do on /already compressed/ data – which is going to be up to random chance, really.

  2. @bd_:

    Thanks for the info about JPEG images. I did notice that – one JPEG gave me BZIP2 while this one gave me GZIP; but had no idea why it was so :(

    For screenshot purposes, I couldn’t easily find a file in my computer that would get compresses via GZIP.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.