One: Huffman tree and Huffman coding

Do you know the principle of file compression?

Suppose we have A file with only five characters, A, B, C, D, and E, which appear five times, four times, three times, two times, and one time, respectively.

We know that each letter of the Alphabet takes up one byte, or eight bits, so we can assume that the size of the original file is 15 bytes, or 120 bits.

Now, we need to design an algorithm to compress this file. What would you do?

Fixed length coding

If there are only five characters A, B, C, D, and E in this file, A, B, C, D, and E can be represented as 000,001,010,011,100 respectively in binary numbers. Instead of eight bits for a letter, we now use only three bits to represent a letter.

First, create a dictionary:

K V
A 000
B 001
C 010
D 011
E 100

Each character read from the original file is converted to a binary map in the dictionary and written to the compressed file.

After this conversion, the size of the compressed file is: 45 bits.

To unzip the compressed file, it is very simple. We just need to create a dictionary that is the opposite of the dictionary used in the unzip file:

K V
000 A
001 B
010 C
011 D
100 E

The compressed file is then read, three bits at a time, and the corresponding characters are found in the dictionary and written to the new file. This achieves the purpose of decompressing the restore file.

We define a “encoding format” for the original file, through which to achieve the purpose of compression and decompression. This coding method is called Fixed Length Code, or FLC; This compression algorithm is called Fixed Bit Length Packing.

The fixed bit length algorithm is one of the common compression algorithms.

Huffman tree and Huffman coding

Huffman Coding, a kind of entropy Coding for lossless data compression, was invented by American computer scientist David Albert Huffman in 1952.

In 1951, Huffman was a doctoral student at THE Massachusetts Institute of Technology (MIT) when Robert Fano gave him a term paper entitled “Finding the Most efficient binary Code.” Unable to prove which of the existing codes was the most efficient, Huffman abandoned the study of the existing codes and turned to new explorations, eventually discovering the idea of coding based on ordered frequency binary trees (Huffman trees), which he soon proved to be the most efficient.

Compared with fixed Length coding, Huffman coding is a Variable Length Code (VLC). The process is to make statistics on the weights of characters in the file. The letters with high probability of occurrence use shorter codes, whereas the letters with low probability use longer codes. In this way, the average length of the encoded string and the expected value are reduced, so as to achieve the purpose of lossless compression of data.

For example, there are only five characters A, B, C, D, and E in the file, and the frequency of the five characters is 5, 4, 3, 2, and 1 respectively:

First, we need to count the weights (frequency of occurrence) of each character in the file and add these characters to a priority queue maintained in descending order of weight:

Then, we take the two elements with the lowest weights as the left and right subtrees, and construct a new tree with the sum of the weights of the two elements as the root node:

Next, queue the root node of the tree:

Repeat until the queue is empty, and a Huffman Tree is built.

We mark the “left path” of all nodes in this Huffman tree as 0; All “right paths” are marked with 1:

This gives us a dictionary of the encoding for each character (the path from the root node to the character is the encoding representation for that character). The encoding corresponding to each character is expressed as follows:

K V
A 11
B 10
C 00
D 011
E 010

What is the size of the compressed file that is encoded by Huffman? 33 bits. Huffman compression saves more space than fixed length compression!

Two: construct Huffman tree

Now that I’ve shown you how to build a Huffman tree, let’s look at the code:

The first is the definition of Huffman tree nodes, Huffman tree nodes are weighted, which you can think of, the weight of a node is the number of times the corresponding value of the node appears in the file. We need to carry out the operation of joining and leaving nodes. Therefore, Huffman’s nodes must be comparable, and the comparison strategy is based on the weight of nodes.

The Java code for the node

public class Node<E> implements Comparable<Node<E>> {
    public E e;
    public double weight;
    public Node<E> left;
    public Node<E> right;

    public Node(E e, double weight, Node<E> left, Node<E> right) {
        this.e = e;
        this.weight = weight;
        this.left = left;
        this.right = right;
    }

    public Node(E e, Node<E> left, Node<E> right) {
        this(e, 0.0, left, right);
    }

    public Node(E e, double weight) {
        this(e, weight, null.null);
    }

    public Node(a) {
        this(null.0.0);
    }

    @Override
    public String toString(a) {
        return "Node [e = " + e + ",weight = " + weight + "]";
    }


    @Override
    public int compareTo(Node node) {
        return Double.compare(this.weight, node.weight); }}Copy the code

CreatehuffmanTree (); createhuffmanTree (); createhuffmanTree ();

HuffmanTree Java code

import java.util.*;

public class HuffmanTree {
    / * * *@paramList passes list * for all nodes@returnReturns the root node of the Huffman tree */
    public static Node createHuffmanTree(List<Node> list) {
        PriorityQueue<Node> queue = new PriorityQueue<>();
        for (int i = 0; i < list.size(); i++) {
            queue.offer(list.get(i));
        }
        return createHuffmanTree(queue);
    }

    / * * *@paramQueue Priority queue maintained in ascending order of node weight *@returnReturns the root node of the Huffman tree */
    public static Node createHuffmanTree(PriorityQueue<Node> queue) {
        while (queue.size() > 1) {
            Node left = queue.poll();
            Node right = queue.poll();
            Node parent = new Node(null, left.weight + right.weight, left, right);
            queue.offer(parent);
        }
        return queue.poll();
    }
Copy the code

Three: the implementation of file compression and decompression

Code store: github.com/jinrunheng/…

File compression

The demo I implemented for file compression and decompression is only valid for English text.

File compression process:

  1. Statistics the frequency of all characters in a file
  2. The Huffman tree is constructed by taking characters as the value of nodes and the frequency of characters as the weight of nodes
  3. Store the characters of each node of the Huffman tree with the corresponding 01 code as a mapping into the dictionary (HashMap)
  4. Read the file again and convert each character to the 01 encoded string format
  5. The 01 encoding string corresponding to this file is converted into real bits and written to the file to complete compression

Decompression process:

  1. Read the compressed file and convert the compressed file to a string of 01
  2. Store the 01 encoding and character values of each node of the Huffman tree in the dictionary as a map (HashMap)
  3. Traversal of the string converted from the compressed file; Two Pointers I and j are used. The first pointer I points to the starting point, and the second pointer J scans from left to right starting from the position of I. If the string corresponding to I ~ j has a matching key in the dictionary, then the corresponding value is written into the restored file from the dictionary and the position of pointer I is updated. If there is no matching key, the j pointer continues to move until it matches. When I moves to the last position in the string, the traversal is completed and the decompression is complete.

Harry Potter and the Sorcerer’s Stone (Sorcerer’s Stone)

  • The size of the original text is 450KB
  • The zip file test. Huffman is 260KB in size
  • Decompress the compressed file and restore the original text successfully

You can click on the link above to view the specific code.

Four: the optimality proof of Huffman coding

The following is a combination of content from Algorithms Fourth Edition

Huffman tree is also called “optimal binary tree”.

The meaning of “optimal” here is: when constructing a tree with n nodes (each node has its own weight), if the tree is constructed with the minimum weight path length (WPL), then the tree is the optimal binary tree.

WPL is the weighted path length of a tree, which means the sum of the weighted path lengths of all leaf nodes in the tree.

For example, in the Huffman tree shown in the figure above, the WPL value is:

7 times 1 plus 5 times 2 plus 2 times 3 plus 4 times 3 is 35

So how can we prove that given a set of R symbols and their frequencies, the Huffman encoded WPL constructed using a Huffman tree is minimal?

Proof:

Mathematical induction. It is assumed that Huffman coding is optimal for any set of symbols whose size is less than R. Let THT_HTH be the corresponding frequency (S1, F1) of symbol set calculated and encoded by Huffman algorithm… (sr,fr)(s_1,f_1),… (s_r,f_r)(s1,f1),… (sr,fr), and W(TH)W(T_H)W(TH) to represent the total length of the output (WPL).

Hypothesis (si, fi) (s_i, f_i) (si, fi) (sj, fj) (s_j f_j) (sj, fj) is the first selected two symbols, Then the algorithm will then calculate (si, fi) (s_i, f_i) (si, fi) and (sj, fj) (s_j f_j) (sj, fj) was (s ∗, fi + j) s ^ *, f_ ({I + j}) (s ∗, fi + j) after the replacement of a collection of r – a symbol of coding to the output TH∗T_H^*TH∗, where S ∗ S ^* S ∗ represents a new symbol in a leaf node whose depth is DDD. It can be noted that:


W ( T H ) = W ( T H ) d ( f i + f j ) + ( d + 1 ) ( f i + f j ) = W ( T H ) + ( f i + f j ) W(T_H) = W(T_H^*) – d(f_i + f_j) + (d+1)(f_i + f_j) = W(T_H^*) + (f_i + f_j)

Now, suppose (s1,f1)… (sr,fr)(s_1,f_1),… (s_r,f_r)(s1,f1),… (SR,fr) has a word lookup tree TTT whose optimal height is HHH. Note that the depth of (si,fi)(s_i,f_i)(si,fi) and (sj,fj)(s_j,f_j)(sj,fj) must both be HHH (otherwise swapping them with the depth of HHH would yield a smaller WPL word lookup tree). In addition, By (sj, fj) (s_j f_j) (sj, fj) and (si, fi) (s_i, f_i) (si, fi) brother nodes exchange can assume (si, fi) (s_i, f_i) (si, fi) and (sj, fj) (s_j f_j) (sj, fj) Is a sibling node. Now, consider to replace their parent nodes (s ∗, fi + j) s ^ *, f_ ({I + j}) (s ∗, fi + j) of the tree T ∗ ∗ T ^ * T. Note () can be achieved using the same method W (T) = W (T) ∗ + + fj (fi) W (T) = W (T ^ *) + (f_i f_j) + W (T) = W (T) ∗ + + fj (fi).

According to the mathematical induction, TH ∗ T_H * TH ∗ is optimal, namely W (TH ∗) < = W (T) ∗ W (T_H ^ *) < = W (T ^ *) W (TH ∗) < = W (T ∗). Thus there are:


W ( T H ) = W ( T H ) + ( f i + f j ) < = W ( T ) + ( f i + f j ) = W ( T ) W(T_H) = W(T_H^*) + (f_i + f_j) <= W(T^*) + (f_i + f_j) = W(T)

Because TTT is optimal, the equal sign must hold, and therefore THT_HTH is also optimal.

So, we prove that Huffman coding is optimal.

Five:

This paper introduces Huffman tree and Huffman coding, and uses a demo to verify that the compression algorithm based on Huffman coding is effective. At the end of the article, we verify that Huffman coding is optimal, and that the idea of Huffman coding is a typical greedy strategy. I’ll introduce greedy algorithms later, so stay tuned.

Ok, so far, Huffman coding and file compression this article I will introduce the end ~ welcome to pay attention to my public number [KIM_talk], here I hope you can harvest more knowledge, we see you next issue!