July 5, 2016

Huffman Tutorial Part 3

  1. Huffman Headers
    1. Header Format
      1. Limitations
  2. Assignment for This Part

In this part, we'll figure out how to sketch headers for your Huffman program!

Huffman Headers

Ha! Alliteration!

There are 2 options for storing the tree information:

  1. Use character frequencies
  2. Use the actual tree

If you use character frequencies, the header should be smaller than the other option, which is the actual tree. Of course, you would have to completely regenerate the tree from the character frequencies when you decode it. This is an example of sacrificing speed for space.

If you use the actual tree, the header should be a lot bigger, since there are lots of nodes (branches, leaves) on the tree. On the other hand, you won't need to regenerate the tree. This is an example of sacrificing space for speed.

For the headers, you won't want them to be too big - they are already big enough and take up 70-80% of small, compressed files. You also don't want them to be too small - it would limit you. Which is why I chose option #1.

Header Format

Here is the header format that I use:

NameSizeNote
Size2 bytesThe total size of the remaining header
Data????The character and frequencies

Here is the format for an entry in the data:

NameSize (bytes)Note
Character1
Frequency4The number of times a character appears

The total size of one entry is 5 bytes.

Limitations

Because I set the letter frequency to be 4 bytes in length, this system can safely handle ~4GB of any character. According to the standard distribution of the English language, having the letter 'e' appear 12.702% means that the average maximum file size this program can handle is ~31GB. That, of course, is not accounting for the space () character, which appears more often. Counting spaces, the system can handle ~21GB of plain-text.

Now, where can I find a 21GB plain-text file to test it on?

Assignment for This Part

  • :star: find/create a header format that best suits your purpose
  • document said format with tons of details
  • implement said format (no need for the binary data, just the header) by writing the header to a file
  • check the correctness of the format using binary/hex viewing tools such as xxd (debugging, basically)

(:star: denotes a challenging task. :star2: denotes an even more challenging task.)

Tags: compression Huffman encoder c++ data headers