Diagram Huffman coding, learn to improve after really big!

Hello, Nuggets, I’m Silent King 2.

Huffman Coding is an entropy-coding algorithm for lossless data compression. It was developed in 1952 by David Huffman, an American computer scientist.

To be honest, I’ve been hearing about Huffman encoding for a long time. In addition to the general compression formats such as GZIP, BZIP2 and PKZIP, I also know that it is commonly used to compress character data with high repetition rates.

Think of it, the infinite combination of 26 letters in English, the repetition rate is so high! The commonly used Chinese characters are not many, 2500 or so, don’t ask me how to know, I have asked the search engine.

The more frequently characters are repeated, the more efficient Huffman coding works!

While you’re at it, let’s take a look at how Huffman coding works, because a good programmer needs to know what’s going on — and I’m going to use that phrase again.

Suppose the following string is sent over the network.

You should know that each character is 8 bits, and this string has 15 characters in total, so it takes up 15*8=120 bits. Is there any doubt? If you have any questions, please excuse me.

If we use Huffman encoding, we can compress this string to a smaller size. How did you do that?

Huffman code will first use frequency of the characters to create a tree, and then through the tree structure generated a specific encoding for each character, high frequency characters use shorter encoding, low frequency is used long code, this will reduce the average length of a string after coding, so as to achieve the goal of data lossless compression.

Take the above initial characters to illustrate the working steps of Huffman coding.

The first step is to calculate the frequency of each character in the string.

B appears 1 time, C appears 6 times, A appears 5 times, D appears 3 times.

The second step is to form a queue Q by sorting characters according to their frequency of occurrence.

The ones with the lowest frequency are in the front, the ones with the highest frequency are in the back.

The third step is to start building a tree using these characters as leaf nodes. We first create an empty node Z, assign the lowest frequency character to the left of Z and the second frequency character to the right of Z, and then assign Z to the sum of the frequencies of the two characters.

B has the lowest frequency, so it’s on the left, and then D, frequency 3, is on the right; And then I set the value of their parent to be 4, the sum of the frequencies of the children.

Then remove B and D from queue Q and add their sum to the queue at the position indicated by * in the figure above. Next, an empty node Z is recreated with 4 as the left node, A of frequency 5 as the right node, and the sum of 4 and 5 as the parent node.

Continue building the tree along the same lines until all characters appear in the nodes of the tree.

** Step 4, for each non-leaf node, assign 0 to the left and 1 to the right of the connector. ** At this point, the Huffman tree is built. Huffman tree becomes the optimal binary tree, which is a binary tree with the shortest weight path length.

When the tree is built, we count the number of bits to send.

1) Look at the character column. The characters A, B, C, and D add up to 4*8=32 bits. Each letter takes up one byte, or eight bits.

2) Look at the frequency column. A 5 times, B 1 times, C 6 times, D 3 times, total 15 bits.

3) Look at the code column. The code of A is 11, corresponding to 15→9→5 on Huffman tree, that is to say, 11 is needed to go from the root node to the leaf node A. B has to go 100; D has to go 101; C has to go through the path 0.

4) Look at the length column. The code of A is 11, which occurs 5 times, so it occupies 10 bits, namely 1111111111; The code for B is 100, which occurs once, so it takes up three bits, i.e. 100; The code of C is 0 and occurs 6 times, so it takes up 6 bits, or 000000; The code for D is 101, which occurs three times, so it takes up nine bits, namely 101101101.

Huffman coding is, in essence, assigning the most valuable resource (the shortest code) to the data with the highest probability of occurrence. In the example above, C is the most frequent, and its code is 0, which saves a lot of space.

Think about some of the situations in your life. This is also the case. We keep the most common ones at hand to improve efficiency and save time. So, I have a wild guess that this is how Hoffman found the optimal solution to the code.

Before Huffman encoding, the binary of the string “BCAADDDCCACACAC” is:

10000100100001101000001010000010100010001000100010001000100001101000011010000010100001101000001010000110100000101000011

So it’s 120 bits.

After encoding:

0000001001011011011111111111

It’s 28 bits.

But for decoding, the structure of the Huffman tree needs to be passed as well, so the 32 bits of the character and 15 bits of the frequency need to be passed as well. Overall, the number of encoded bits is 32 + 15 + 28 = 75, 45 bits less than 120 bits, which is very efficient.

Huffman coding Java example, I also posted here, for your reference.

class HuffmanNode {
    int item;
    char c;
    HuffmanNode left;
    HuffmanNode right;
}

class ImplementComparator implements Comparator<HuffmanNode> {
    public int compare(HuffmanNode x, HuffmanNode y) {
        returnx.item - y.item; }}public class Huffman {
    public static void printCode(HuffmanNode root, String s) {
        if (root.left == null && root.right == null && Character.isLetter(root.c)) {

            System.out.println(root.c + "|" + s);

            return;
        }
        printCode(root.left, s + "0");
        printCode(root.right, s + "1");
    }

    public static void main(String[] args) {
        int n = 4;
        char[] charArray = { 'A'.'B'.'C'.'D' };
        int[] charfreq = { 5.1.6.3 };

        PriorityQueue<HuffmanNode> q = new PriorityQueue<HuffmanNode>(n, new ImplementComparator());

        for (int i = 0; i < n; i++) {
            HuffmanNode hn = new HuffmanNode();

            hn.c = charArray[i];
            hn.item = charfreq[i];

            hn.left = null;
            hn.right = null;

            q.add(hn);
        }

        HuffmanNode root = null;

        while (q.size() > 1) {

            HuffmanNode x = q.peek();
            q.poll();

            HuffmanNode y = q.peek();
            q.poll();

            HuffmanNode f = new HuffmanNode();

            f.item = x.item + y.item;
            f.c = The '-';
            f.left = x;
            f.right = y;
            root = f;

            q.add(f);
        }
        System.out.println("Character | Huffman code");
        System.out.println("-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --");
        printCode(root, ""); }}Copy the code

The output of this example is as follows:

Characters | Huffman code -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - | | 0 C B 100 D | 101 | 11 ACopy the code

I’ll leave you a homework problem, consider the time complexity of Huffman coding, you can give the answer in the comments.

PS: by the way, I would like to recommend a Java version of LeetCode brush notes. I have seen many awesome brush notes, including Go version, C++ version, but not Java version. So this time, I feel that your favorites have another reason to eat dust!

After eating 300 LeetCode questions, I am so fat that I am about to explode! with Java

I’m the silent king of learning and sharing. If this article is useful to you, please give it a thumbs up and we’ll see you next time

Diagram Huffman coding, learn to improve after really big!

Related Posts

Red black tree – Details of insert operations

The number of concurrent Nginx operations can reach 3W?

How does a ThreadPoolExecutor perform a task?