This is the seventh day of my participation in the August More text Challenge. For details, see: August More Text Challenge
Remember that Huffman code we learned in data structures? Today we’re going to go over Huffman coding, and what core ideas we can learn from Huffman coding. Before we begin, let’s review what Huffman coding is. Huffman coding is a data compression coding method, which can effectively save storage space. For example, let’s say we have A 200-length string message to transmit that contains only six different characters: A, B, C, D, E, and F. If it’s in plain text, that’s one byte per character, that’s 200 bytes of space. To save space, we can use three bits to represent a character. As follows:
A (000), B (001), C (010), D (011), E (100), f(101)Copy the code
We can use 200*3=600 bits to transmit, whereas plain text takes 1600 bits (1 byte equals 8 bits), so we save space. Is there any way to save more space? That’s the Huffman code.
Huffman code
Huffman encoding not only takes into account the number of different characters, but also the frequency of each character. According to the frequency, different length encoding is selected. Since Huffman code uses unequal length, how many bits should we read at a time to decompress? This problem makes decompression a little more complicated. In order to avoid ambiguity in the process of decompression, Huffman encoding requires that the encoding of each character should not be the prefix of another encoding.
Let’s take a look at how Huffman coding works. First, let’s assume that the frequency of these six characters is as shown in the figure below:
We treat each character as a node and insert them into a priority queue (in order of frequency from smallest to largest). Then we take the two nodes with the lowest frequency, A and B, from the queue, and create A new node X, set the frequency to the sum of the two nodes, and make this new node X as the parent of nodes A and B. Finally, the X node is put into the priority queue. Repeat this process until there is no data in the queue. As shown in the figure below:
Then, we add a weight to each edge, 0 for the left child, 1 for the right child, and the path from the root to the leaf is the Huffman encoding of the leaf’s corresponding characters.
So far, we’ve talked about Huffman coding. Now to abstract this problem, the goal of Huffman coding is to store a string of a given length in less space. So how do we encode the characters of the string in a way that minimizes the space? An intuitive idea is to take the characters that occur more often, and put them in a slightly shorter code; For characters that occur less frequently, use slightly longer codes. That’s the idea of a greedy algorithm. So let’s look at what kind of problems can we use the greedy algorithm for?
Greedy algorithm model
If you have a set of data, you define a limit and an expected value, and you want to pick a few of them, and the expected value is the highest if the limit is met. The first thing we need to think about is using the greedy algorithm. Going back to the Huffman code example, the expected value is the minimum length after encoding, and the limit value is the given string. This data set refers to each of the different characters in the string, and we encode each character, encoding the given string in such a way that the encoded length is minimized.
And then try to see if this problem can be solved by the greedy algorithm. With the same contribution to the limit value, we choose the data that contributes the most to the expected value. In Huffman encoding, the contribution of each character to the limit value is the same, it is the contribution of one character, and the one with the shortest encoding is chosen to contribute the most to the expected value. So, we take the characters that occur most often, and we put them in the fewest codes.
In the future, when we encounter this kind of problem, we can consider whether we can abstract into the greedy algorithm model, if conform to the greedy algorithm model, then this kind of problem can be solved.
More interesting knowledge, please follow the public account. If you want a PDF version of this article, you can send me a personal message.