Huffman Compression

Next: Dictionary Approaches to Compression Up: Lossless Data Compression Previous: Run Length Compression

Huffman Compression

Huffman coding is the most widespread way of replacing a set of values of fixed size code words with an optimal set of different sized code words based on the statistics of the input data. The way a huffman code is constructed involves constructing a frequecny distribution of the symbols. This is then used to decide the new compressed representation for each symbol. The easiest way to do this is to consider the case for compressing alphabetic text, with symbols drawn from characters in an alphabet with 256 letters. If these are all equally likely to occur, then it is hard to compress the data. However, if there is a severe skew in the frequcny distribution in typical data (texts in this alphabet) then we can use more less bits to represent the most frequently occurring characters/codeword values, and more bits for the less commonly occurring ones, with some compression gain. So, how to build this new coding scheme for the alphabet? The classic scheme is to construct a tree from the frequency distribution, taking pairs of characters/codeword values at a time from the least frequently occurring values in the frequency chart, and adding a bit position to the string representation for them, with value 1 for one, 0 for the other, then moving to the next less commonly occurring; eventually, the most commonly occuring two values take two bits, one to say which is which, and one to say that it is one of these two values, or its one of the 254 other values. And so on...

This scheme is optimal in only one case, which is when the probability of occurrences of codeword values is distributed as a set of inverse powers of 1/2, i.e. 1/2, 1/4, 1/8, 1/6 etc. Otherwise the scheme is less and less good.

A generalization of huffman coding that avoids this latter problem is arithmetic codeing, which uses binary fraction representations in building the coding tree. In practice, this can be computationally expensive.

If one is transmitting huffman (or arithmetic) compressed data, one must also share the same codebook at sender and receiver: the list of codes and their compressed representation must be the same at both ends. This can be done on a case by case basis, but is usually based in long term statistics of the data (e.g. the frequency of occurence of the letter `ee'' in the written English language is a well known example of this sort of statistic).

Next: Dictionary Approaches to Compression Up: Lossless Data Compression Previous: Run Length Compression

Jon CROWCROFT
1998-12-03