Js Unicode encoding to UTF-8 encoding

preface

Prior knowledge of Unicode and UTF-8 is required to read this article. If you are not familiar with it, I recommend you to read ruan Yifeng’s related article (also the source of this article) :

Character encoding notes: ASCII, Unicode and UTF-8
Unicode and JavaScript in detail

Utf-8 encoding rules

Utf-8 is a variable-length encoding method, with character lengths ranging from 1 to 4 bytes. The more commonly used characters are, the shorter the bytes are. The first 128 characters are represented by only one byte, which is exactly the same as ASCII characters.

Utf-8 encoding rules:

For single-byte symbols, the first byte is set to0The following 7 bits are the Unicode code for this symbol. So utF-8 encoding is the same as ASCII for English letters.
fornByte symbol (n > 1), before the first bytenBit is set to1In the firstn + 1Bit is set to0, the first two characters of the following bytes are set to10. The remaining bits, not mentioned, are all Unicode codes for this symbol.

The following table summarizes the coding rules, with the letter X representing the bits of code available.

Unicode symbol range (hexadecimal)	Utf-8 Encoding mode (binary)	The number of bytes
0000 0000-0000 007 f	0xxxxxxx	1
0000, 0080 – $0000-07 ff	110xxxxx 10xxxxxx	2
0000 0800——0000 FFFF	1110xxxx 10xxxxxx 10xxxxxx	3
0001 0000——0010 FFFF	11110xxx 10xxxxxx 10xxxxxx 10xxxxxx	4

Now, let’s interpret utF-8 encoding according to the table above. If the first byte is 0, the byte is a single character. If the first digit is 1, the number of consecutive 1’s indicates how many bytes the current character occupies.

After interpretation, students must have some understanding of UTF-8 coding rules. Here, we take Chinese characters as an example to demonstrate how to implement UTF-8 encoding.

goodUnicode is597d(Hexadecimal)
597dTo binary is101100101111101In decimal notation, is22909

As you can see from the table above, 597D is in the 00000800-0000FFFF (decimal, 2048-65535) range, so a good UTF-8 encoding takes 3 bytes, which means it is in the format 1110XXXX 10XXXXXX 10XXXXXX.

Then, starting with the last good binary bit, the x in the format is replaced back to front, and the extra bits complement the zeros. Thus, we have a good UTF-8 code: 11100101 10100101 10111101, converted to hexadecimal e5a5BD.

Code implementation

The following is the Unicode encoding to utF-8 encoding code implementation. It is implemented according to the UTF-8 coding rules. It is recommended to read the table above with the rules.

function toByte(data) {

    let parsedData = [];

    for (let i = 0, l = data.length; i < l; i++) {
        let byteArray = [];
        // The charCodeAt() method returns the Unicode encoding for the character at the specified position, with values ranging from 0 to 65535
        // An integer between, representing the UTF-16 code unit at the given index.
        let code = data.charCodeAt(i);

    // Hexadecimal to decimal: 0x10000 ==> 65535 0x800 ==> 2048 0x80 ==> 128
    if (code > 0x10000) { // 4 bytes
        // 0xf0 ==> 11110000 
        // 0x80 ==> 10000000

        byteArray[0] = 0xf0 | ((code & 0x1c0000) > > >18); // first byte
        byteArray[1] = 0x80 | ((code & 0x3f000) > > >12); // the second byte
        byteArray[2] = 0x80 | ((code & 0xfc0) > > >6); // the third byte
        byteArray[3] = 0x80 | (code & 0x3f); // the fourth byte

    } else if (code > 0x800) { // 3 bytes
        // 0xe0 ==> 11100000
        // 0x80 ==> 10000000

        byteArray[0] = 0xe0 | ((code & 0xf000) > > >12);
        byteArray[1] = 0x80 | ((code & 0xfc0) > > >6);
        byteArray[2] = 0x80 | (code & 0x3f);

    } else if (code > 0x80) { // 2 bytes
        // 0xc0 ==> 11000000
        // 0x80 ==> 10000000

        byteArray[0] = 0xc0 | ((code & 0x7c0) > > >6);
        byteArray[1] = 0x80 | (code & 0x3f);

    } else { // 1 byte

        byteArray[0] = code;
    }

        parsedData.push(byteArray);
    }

    parsedData = Array.prototype.concat.apply([], parsedData);
    
    console.log('Output result:', parsedData);
    console.log('To binary:'.parseInt(parsedData[0].toString(2)),
        parseInt(parsedData[1].toString(2)),
        parseInt(parsedData[2].toString(2))); }Copy the code

To make it easier for you to understand the code, we’ll use good as an example.

goodUnicode is597d(hexadecimal), binary is101100101111101Decimal is22909.
goodin0000 0800 - 0000 FFFF(Decimal,2048-65535.), its UTF-8 encoding is required3Bytes in the format of1110xxxx 10xxxxxx 10xxxxxx.

The following is the logic code that goes into the good example.

// The code must be combined with utF-8 rules, especially how to use binary complement rules
// Format: 1110XXXX 10XXXXXX 10XXXXXX
// Code binary ==> 1011001 01111101
// Each byte has its own complement:
// First byte: 101
// Second byte: 100101
// Third byte: 111101

// 0xf000 ==> 11110000 00000000
// 0xfc0 ==> 1111 11000000
// 0x3f ==> 111111
        
        
// 0xe0 ==> 11100000
// 0x80 ==> 10000000

// Bitwise and is usually used to clear or reserve certain bits.
// Bit or can be used to set the specified bit to 1

byteArray[0] = 0xe0 | ((code & 0xf000) > > >12);
byteArray[1] = 0x80 | ((code & 0xfc0) > > >6);
byteArray[2] = 0x80 | (code & 0x3f);
Copy the code

Operational analysis:

ByteArray [0] corresponds to the first byte, which is 1110XXXX in the format. The maximum code bit is 1111. The previous 12 code bits are complemented with 0, so 11110000 00000000, that is, 0xf000.

Bitwise and code & xf000 0, 1010000, 00000000 in > > > 12 (unsigned right shift 12), 101, finally carries on the bitwise or 0 xe0-0xfc | 101, 11100101.
ByteArray [1] corresponds to the second byte, which is the format: 10XXXXXX, the maximum code bit 111111, the preceding six code bits are filled with 0, so 1111 11000000, that is 0xfc0.

Bitwise and code & xfc0 0, 1001, 01000000 in > > > 6 (unsigned right shift 6), 100101, finally carries on the bitwise or 0 xe0-0xfc | 101, 10100101.
ByteArray [2] corresponds to the third byte, which is the format: 10XXXXXX. The maximum complement bit is 111111. There is no encoding bit before it, so 111111 is obtained, i.e. 0x3f.

Bitwise and code & x3f 0, 111101, finally carries on the bitwise or 0 xe0-0xfc | 101, 10111101.

Results show:

Bitwise binary is automatically converted to decimal, which we can do with the toString() method of the Number object.

Also, did you notice that if you concatenate the above three values 101, 100101 and 111101, it is the binary of code 1011001 01111101?

conclusion

As always, if you find something wrong in the article, please comment on it and make progress together. Finally, one more word: When reading code examples, be sure to look at utF-8 coding rules.

Rules matter!

Js Unicode encoding to UTF-8 encoding

preface

Utf-8 encoding rules

Code implementation

conclusion

Related Posts

Leetcode daily punch string problem

It’s not hard to understand what a Curryized function is

Use WebAssembly for speed and code reuse