Lets start with a string of text to encode with Huffman encoding. HELLO YOU HOTTIE Begin by counting each letter and the number of times it occurs. H = 2 E = 2 L = 2 O = 3 Y = 1 U = 1 T = 2 I = 1 Now organize these letters based on the number of times they occur, with the highest occurrence at the top and the lowest occurance at the bottom. O = 3 H = 2 E = 2 L = 2 T = 2 Y = 1 U = 1 I = 1 Before we begin our Huffman tree we need to take the two smallest values as these will go on the bottom of our Huffman tree. It doesn’t matter which way round they go, but try to keep this branch down in the far left corner for now. Each time we do this we add what’s called a parent node above the two nodes we have grouped this so we can use out parent node to access either of the two nodes added to it.
Take the parent node (whatever you decide to call it) and add it to our list, giving it a value equal to both vales found inside the node, so U = 1 and I = 1 there for U+I = 2. Re organize the list taking away the original ‘U’ and ‘I’ we have now combined into a parent node. It should look like this: O = 3 H = 2 E = 2 L = 2 T = 2 U + I = 2 Y = 1 Repeat the step again with the new table and this will become our second step in our Huffman tree. Here’s our new list following that O = 3 U + I + Y= 3 H = 2 E = 2 L = 2 T = 2 You can see that already we are able to connect this up. (NOTE: make sure you put the older vale to the left and the newer value branching off to the right, the reason will become apparent later.)
Repeat the same process to the list again. It should look like this.. T + L = 4 O = 3 U + I + Y= 3 H = 2 E = 2 Notice now when creating the Huffman tree we can’t connect it up, by no worries, just continue like so, remembering new values to the right and old values to the left.
And again. T + L = 4 H + E = 4 O = 3 U + I + Y= 3
And again. U + I + Y + O = 6 T + L = 4 H + E = 4
You can see as this list get’s smaller the tree comes closer to a point. This gives us the last two values before we connect it all up. T + L + H + E = 8 U + I + Y + O = 6
Finally we can connect it up with one final parent node. If you look any part of the tree, can be accessed by following a line representing a 1 or a 0 in binary. Just follow to the right for a 1, and to the left for a 0. I like the analogy that it’s like a computer system in which you open your root hard drive (c drive for example) and you have folders (parent nodes) the folders nearer the top of this tree hold the information you access most regularly maybe on a daily basis whilst folders nearer the bottom hold files rarely accessed maybe on a monthly basis, in the same way you wouldn’t have to click through folder after folder to access a file you use daily, here the program doesn’t have to use so many bit’s to access the most common data used within in file being compressed.
Lets have a look at our original chart again and see what binary codes have been assigned to each letter O = 3 = 01 H = 2 = 110 E = 2 = 111 L = 2 = 101 T = 2 = 100 Y = 1 = 001 U = 1 = 0000 I = 1 = 0001 Notice how the letters with the highest frequency of occurrence have the shortest binary codes or bits and everything has a unique code so that when you put the binary together no two codes will accidently fit together to create a code that resembles another code from the generated codes. When using Huffman encoding a look table like the one above will be included with the file and our text would look like this (without the spaces):
HELLO YOU HOTTIE 110 111 101 101 01 001 01 0000 110 01 100 100 0001 111 You might be asking yourself “How does this save space?” it saves space because usually to represent something we have all values using a single bit depth so for text like the sample above in it’s RAW uncompressed format it is called ACSCII this relies on 8 bit values so for any one letter there would be eight 0’s or 1’s so that’s 8x the number of characters which is 8 x 14 = 112 so in ACSCII the above text would require 112bits but with our Huffman encoding it requires only 41 bits! Impressive considering that’s a simple sentence just 3 words long and we have achieved a compression ratio of nearly 3:1