Rust implements Huffman encoding and decoding compression algorithms

Huffman Coding is a tree-like Coding format that is often used for data compression. There are many kinds of C/C++ / Java implementations on the Internet, this article takes Rust as a perspective to see how to use Huffman coding to achieve file compression and decompression.

The code has been synchronized to github: github.com/oloshe/rust…

The compression

Calculates the CharWeightMap

Count the number of occurrences of each character as a weight

type Weight = u64;
/// character weights
pub struct CharWeightMap {
    pub inner: HashMap<char, Weight>
}

impl CharWeightMap {
    pub fn build(input: &String) - >Self {
        let mut map = HashMap::new();
        for (_, c) in input.char_indices() {
            map.entry(c).or_insert(0).add_assign(1);
        }
        Self { inner: map }
    }
    pub fn len(&self) - >usize {
        self.inner.len()
    }
    pub fn iter(&self) -> Iter<char, Weight> {
        self.inner.iter()  
    }
}
Copy the code

Build HuffmanTree

type RefHuffmanTree = Rc<RefCell<HuffmanTree>>;
type Weight = u64;

/// Huffman tree
pub struct HuffmanTree {
    pub value: Option<char>,
    pub weight: Weight,
    pub parent: Option<RefHuffmanTree>,
    pub left: Option<RefHuffmanTree>,
    pub right: Option<RefHuffmanTree>,
}

impl HuffmanTree {
    pub fn new() - >Self {
        Self {
            value: None,
            weight: 0,
            parent: None,
            left: None,
            right: None,}}pub fn build(char_weight: CharWeightMap) -> RefHuffmanTree
    {
        // The number of original nodes
        let n = char_weight.len();
        // The total number of nodes needed to build a complete Huffman tree
        let total = 2 * n - 1;
        // Initialize all nodes
        let vec = (0..total)
            .map(|_| Rc::new(RefCell::new(Self::new())))
            .collect::<Vec<Rc<RefCell<HuffmanTree>>>>();

        // Character node assignment
        char_weight.iter()
            .enumerate()
            .into_iter()
            .for_each(|(index, (ch, weight))| {
                // println! ("{}: {} ({})", index, &weight, ch);
                vec[index].borrow_mut().value = Some(*ch);
                vec[index].borrow_mut().weight = *weight;
            });

        for index inn.. total {// find the node with the least weight in [0, index-1]
            let m1 = Self::find_min(&vec[..index]).unwrap();
            // mark the parent as a node on index. This will not be found next time
            m1.borrow_mut().parent = Some(vec[index].clone());
            // find the node with the second lowest weight in [0, index-1]
            let m2 = Self::find_min(&vec[..index]).unwrap();
            // mark the parent of this node as a node on index.
            m2.borrow_mut().parent = Some(vec[index].clone());

            let w1 = m1.as_ref().borrow().weight;
            let w2 = m2.as_ref().borrow().weight;
            let weight = w1 + w2;

            vec[index].borrow_mut().weight = weight;
            vec[index].borrow_mut().left = Some(m1.clone());
            vec[index].borrow_mut().right = Some(m2.clone());
        }
        // The last node is the completed Huffman tree
        vec.last().unwrap().clone()
    }

    /// get the minimum value
    fn find_min(tree_slice: &[Rc<RefCell<HuffmanTree>>]) -> Option<Rc<RefCell<HuffmanTree>>> {
        let mut min = Weight::MAX;
        let mut result = None;
        for tree in tree_slice {
            let tree_cell = tree.as_ref();
            if tree_cell.borrow().parent.is_none() && tree_cell.borrow().weight < min {
                min = tree_cell.borrow().weight;
                result = Some(tree.clone());
            }
        }
        result
    }
}
Copy the code

Binary mapping huffmanaryMap

Binary is denoted by Vec

. Tree_dfs is a depth-first traversal Huffman tree.

/// binary mapping of characters, representing the corresponding binary bits of characters, can be replaced by bitvec
pub struct HuffmanBinaryMap {
    pub inner: HashMap<char.Vec<bool>>
}

impl HuffmanBinaryMap {
    pub fn build(huffman_tree: RefHuffmanTree) -> Self {
        let mut map = HashMap::new();
        Self::tree_dfs(&Some(huffman_tree), &mut map, &mut vec![]);
        Self { inner: map }
    }
    fn tree_dfs(
        tree: &Option<RefHuffmanTree>, 
        map: &mut HashMap<char.Vec<bool>>,
        vec: &mut Vec<bool>) {if let Some(tree) = tree {
            let tree = tree.as_ref().borrow();
            if let Some(ch) = tree.value {
                map.insert(ch, vec.clone());
            }
            vec.push(false);
            Self::tree_dfs(&tree.left, map, vec);
            let last = vec.last_mut().unwrap();
            *last = true; Self::tree_dfs(&tree.right, map, vec); vec.pop(); }}}/// Used to write configuration files
impl Display for HuffmanBinaryMap {
    fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
        let mut buf = String::new();
        self.inner.iter()
            .for_each(|(c, vec)| {
                let mut bit_str = String::new();
                vec.iter().for_each(|b| {
                    bit_str += if *b { "1" } else { "0"}}); buf +=format!("{}:{}\n", *c as u32, bit_str).as_str();
            });
        f.write_str(buf.as_str())
    }
}
Copy the code

Space is the number of eight-bit binary numbers that complement the end; Capacity is used to optimize memory allocation during decoding.

The left side of the colon is the u32 code corresponding to the character, and the right side is the binary number corresponding to it.

(Why u32? There are special character problems if you use characters, such as carriage returns and colons.

Bit_map uses the Display trait implemented above.

Coding HuffmanCodec: : encode

The encoding returns a tuple of compressed byte arrays and configuration file contents. The configuration file describes how to decompress.

pub struct HuffmanCodec;

impl HuffmanCodec {
    /// Huffman code
    pub fn encode(source: &String) - > (Vec<u8>, String) {
        // Build the character weight map
        let weight_map = CharWeightMap::build(&source);
        // Build the Huffman tree
        let tree = HuffmanTree::build(weight_map);
        // Huffman binary mapping table
        let bit_map = HuffmanBinaryMap::build(tree);
        // println! ("{}", bit_map);
        let mut result: Vec<u8> = vec![];
        let (mut buf, mut count) = (0.0);
        for (_, ch) in source.char_indices() {
            let vec = bit_map.inner.get(&ch).unwrap();
            vec.iter().for_each(|b| {
                buf <<= 1;
                if *b { buf |= 1 }
                count += 1;
                if count >= 8 {
                    result.push(buf);
                    buf = 0;
                    count = 0; }})}// The number of complementary bits at the end
        let mut space = 0u8;
        ifcount ! =0 {
            space = 8 - count;
            buf <<= space;
            result.push(buf);
        }
        // The result returned
        (
            result, // A compressed array of bytes
            format!("space:{}\ncapacity:{}\n{}", space, source.len(), bit_map), // Configuration file contents)}}Copy the code

You can think about why we need to fill out 8 places

The compressed file

fn hfm_compress(file: &str) {
    let filename = {
        let mut arr = file.split(".").collect::<VecThe < &str> > (); arr.pop(); arr.join(".")};// Open the file and read the string into memory
    let mut input_file = File::open(file).expect("File not found");
    let mut source_buf = String::new();
    input_file.read_to_string(&mut source_buf).expect("Failed to read file");

    // Huffman code
    let (u8_arr, config) = HuffmanCodec::encode(&source_buf);

    // Create a compressed file
    let output_file_name = format!("{}.hfm", filename);
    let mut output_file = File::create(&output_file_name).unwrap();
    output_file.write(&u8_arr).unwrap();

    // Create a compressed profile
    let output_cfg_file_name = format!("{}.hfm.config", filename);
    let mut output_cfg_file = File::create(&output_cfg_file_name).unwrap();
    output_cfg_file.write(config.as_bytes()).unwrap();
    println!("\n Compression succeeded! The \n file is saved as: {}\n configuration file: {}", output_file_name, output_cfg_file_name);
    let size_before = source_buf.as_bytes().len();
    let size_after = u8_arr.len();
    println!("Pre-compression size: {} bytes", size_before);
    println!("Compressed size: {} bytes", size_after);
    println!("Compression ratio: {:.2}%", ((size_after as f64 / size_before as f64) * 100.0));
}
Copy the code

Unpack the

Decompression configuration DecodeConfig

The reading of the configuration file is related to the previous writing of the configuration file and can be defined according to its own structure.

/// Configure the configuration file
pub struct DecodeConfig {
    pub inner: HashMap<String.char>,
    pub space: u8.pub capacity: usize,}impl DecodeConfig {
    /// read from configuration file
    pub fn build(source: &String) - >Self {
        let mut map = HashMap::default();
        let (mut space, mut capacity) = (0u8.0usize);
        let arr = source.split("\n");
        for s in arr {
            let pair: VecThe < &str> = s.split(":").collect();
            ifpair.len() ! =2 {
                continue;
            }
            let (ch, bit) = (pair[0], pair[1]);
            match ch {
                "space" => {
                    space = u8::from_str_radix(bit, 10).unwrap();
                    continue;
                },
                "capacity" => {
                    capacity = usize::from_str_radix(bit, 10).unwrap();
                    continue;
                },
                _ => (),
            }
            map.insert(bit.to_owned(), char::from_u32(u32::from_str_radix(ch, 10).unwrap()).unwrap());
        };
        Self { inner: map, space, capacity }
    }
    pub fn get(&self, k: &String) - >OptionThe < &char> {
        self.inner.get(k)
    }
}
Copy the code

Decoding HuffmanCodec: : decode

The following format. (“{u8:>0width$b}”, u8=num, width=8)

pub struct HuffmanCodec;

impl HuffmanCodec {
    pub fn decode(source: &[u8], decode_map: &DecodeConfig) -> String {
        // To prevent frequent memory allocation, define the capacity directly
        let mut result = String::with_capacity(decode_map.capacity);
        let bit_str = source.iter()
            .map(|num| {
                format!("{u8:>0width$b}".u8=num, width=8)
            })
            .collect::<Vec<String>>()
            .join("");
        // println! (" binary sequence: {}", bit_str);

        let mut tmp_str = String::with_capacity(20);
        let last_idx = bit_str.len() - decode_map.space as usize;
        for (i, ch) in bit_str.char_indices() {
            if i >= last_idx {
                break;
            }
            tmp_str.push(ch);
            if let Some(mch) = decode_map.get(&tmp_str) {
                result.push(*mch);
                tmp_str.clear();
            }
        }
        result
    }
}
Copy the code

Unzip the files

fn hfm_decompress(file: &str, config_file: &str, save_file: &str) {
    // Compress the file
    let mut encodede_file = File::open(file).expect(format!("File not found :{}",file).as_str());
    let mut buf = vec![];
    encodede_file.read_to_end(&mut buf).unwrap();

    // Read the configuration file
    let mut config = File::open(config_file).expect(format!("Configuration file not found :{}", config_file).as_str());
    let mut buf2 = String::new();
    config.read_to_string(&mut buf2).unwrap();
    
    // Build the configuration
    let char_map = DecodeConfig::build(&buf2);

    / / decoding
    let result = HuffmanCodec::decode(&buf, &char_map);

    let mut savef = File::create(save_file).unwrap();
    savef.write(result.as_bytes()).unwrap();

    println!("\n Unzipped successfully! \n File saved to: {}", save_file);
}
Copy the code

conclusion

As you can see, Rust implementations still lag behind those of other languages. Generally speaking, the logic and structure are clearer, which is also one of the advantages of Rust compared with C in my opinion.

This article has completely gone through the process of Huffman compression and decompression, and the deficiencies are also expected to give more advice and communication. Specific to the details of a lot of implementation methods can be in accordance with their own ideas to achieve, this article for reference only.

The code has been synchronously pushed to Github, interested students welcome star & like.

Github.com/oloshe/rust…