Algorithms Tick 2 2023

Algorithms Tick 2: Huffman compression

TL;DR

Write a Huffman compression and decompression module in Python (version 3.7 or higher), providing Huffman encoding and decoding of a sequence of fixed-size symbols, complying with all of the following constraints. You will find useful explanations of Huffman coding in the CLRS4 textbook, the lecture notes and the following video https://youtu.be/lhvWBasgULA, but the following constraints are specific to this problem.

Constraints

  1. The symbol size is fixed at one byte, thus the Huffman code has 256 codewords.

  2. Symbol frequencies must be floats between 0.0 and 1.0 included and must all add up to 1.

  3. To ensure that we all generate exactly the same Huffman code given the same symbol frequencies (important for the autotester to work), you must follow these two conventions about priorities. While building the Huffman code, there is a crucial step in the core algorithm where you extract two subtrees from a priority queue, combine them into one by making them children of a new root, and re-insert the new tree into the priority queue. When you do that, you must put the higher-priority tree onto the 0 branch and the other onto the 1 branch. The priority of a tree is given by the aggregate frequency of the symbols in its leaves (highest priority for lowest frequency, hence the use of a min-heap). If two trees have the same aggregate frequency, you must take as the highest-priority tree the one containing the leaf with the smallest symbol (byte).

  4. Provided you respect the above convention, you may use the Python library’s heapq module, if you wish, to implement the priority queue; but you may roll your own min-heap if you prefer.

  5. To generate a Huffman code for a certain distribution of symbol frequencies, feed the frequencies to the HuffmanCode.__init__ constructor. The created HuffmanCode object contains the bijection between symbols and codewords and may thereafter be used to encode sequences of fixed-length symbols into sequences of variable-length codewords, and vice versa.

  6. Implementation-wise, symbols and sequences of symbols are represented as Python bytes, while codewords and sequences of codewords are represented as bitstrings from the bitstring module (not in the standard library: see https://bitstring.readthedocs.io/en/stable/index.html and install with pip install bitstring). Note that this module requires Python 3.7 or above.

  7. The HuffmanCode.encode method takes a Python bytes object (an immutable sequence of bytes) and returns a bitstring obtained by substituting each byte with its corresponding codeword. Special padding that may be removed without ambiguity, described next, is added at the end to ensure the encoded file fits in an integer number of bytes.

  8. To pad a bit string of length n bits, add a 1 bit and then as many 0 bits as necessary to reach a multiple of 8. To remove padding, remove all the trailing 0 bits and then one 1 bit. Strings where n was already divisible by 8 acquire an extra byte through this padding. All other strings occupy the same number of bytes as before.

  9. The HuffmanCode.decode method takes a validly encoded and padded bitstring and returns the original sequence of bytes by first stripping the padding and then decoding each codeword into its corresponding byte.

  10. You must copy the supplied huffman_template.py to huffman.py and edit the latter to provide implementations of the following methods: __init__, encode, decode, codewordFor, paddingSuitableFor, removePadding, with the APIs defined in the corresponding docstrings. Wherever appropriate, your higher level methods should make good use of the lower level methods that you were required to implement. You must also implement the additional helper methods makeOccurrencesTable and occurrences2frequencies that generate a table of frequencies from a sample sequence of bytes. (Hint: it is probably a good idea to plan the structure of your code on paper before starting to type away in the editor.)

  11. In your final submission to Moodle, you may not import any other Python modules than the ones already imported by the supplied huffman_template.py.

Code that violates any of the above constraints may not earn a tick, even if it passes the autotester. Failing the autotester may help you discover some bugs in your code but passing the autotester is not a guarantee that your code complies with the above constraints nor that it is correct.

Do not cheat. Do not collaborate with others on the tick. Do not copy other people’s code. Do not share your code with others. Any of these actions is liable to attract a severe penalty.

What to do

Hints

Before starting to code, construct a non-trivial test case, work out the correct answer on paper and then go through all the steps that your code, once written, would have to execute.

You may construct a frequencyTable (the parameter required by the huffman.HuffmanCode.__init__ constructor) from a string of symbols by calling the makeOccurrencesTable and then the occurrences2frequencies helper methods. Of course a HuffmanCode object generated from the frequencies of a sample sequence may be used to encode any other sequence as well, even though it will perform better on sequences with symbol frequencies close to the one used to generate the code.

Ensuring the validity of a precondition is the job of the caller, not of the callee, so the callee is not responsible for checking that a precondition is met on entry. However, during development, adding such a check may help you debug your implementation by pointing out what the caller did incorrectly, rather than just leaving you to figure out why the callee crashed.

If you get stuck and find the tick too hard, try this huffman_template_with_hints.py instead of the regular huffman_template.py. It has comments with the additional classes, methods, signatures and docstrings (but not implementations) that I wrote for my own solution. They might give you ideas.

Tiny cheat sheet for the bitstring module

The bitstring module provides four classes. You might for simplicity use only the most versatile of them, bitstring.BitStream, which includes all the features of the others. Look up its methods in the docs for the parent classes. The “stream” part of the class name refers to its ability to extract the individual bits from the bitstring, which may come in useful when implementing HuffmanCode.removePadding. You may create bitstrings from various sources such as hex strings, binary strings, raw bytes and so forth.