Buffer and character encoding in Node.js

In Node.js, Buffer objects are used to process binary data as it is known that data is transferred over the network. Here’s an example:

console.log(Buffer.from('abcde'))
Copy the code

Will output:

<Buffer 61 62 63 64 65>
Copy the code

You may wonder what the numbers 61, 62, and 65 are, but these are the hexadecimal equivalent of the ASCII characters A through E. Here is the standard ASCII table:

ASCII

You can see that standard ASCII uses 7-bit binary numbers to represent upper and lower case letters, numbers, punctuation marks, and control characters. So that’s a total of two to the seventh, which is 128 characters. Each person can agree on his or her own set of symbols for which binary numbers are used, which is called encoding. The ASCII code is the standard answer given by the standardization Organization, which specifies which binary numbers are used to represent common symbols mentioned above. In ASCII code, the first 0 to 31 and the last (127), a total of 33 control characters, such as LF (newline), CR (enter), FF (page feed), DEL (delete), BS (backspace) and so on:

The rest are displayable characters, or printable characters:

Now that we know the code rules for ASCII, we return to our original question:

console.log(Buffer.from('abcde'))
// <Buffer 61 62 63 64 65>
Copy the code

The answer is pretty obvious. If we go to the ASCII code table and find the hexadecimal representation for a to E, it turns out to be exactly 61 to 65. Therefore, in addition to accepting a string as an argument, the buffer. from method can also accept a hexadecimal array.

console.log(Buffer.from([0x61.0x62.0x63.0x64.0x65]))
// <Buffer 61 62 63 64 65>
Copy the code

Note that 61 to 65 are hexadecimal numbers and should never be written in decimal:

console.log(Buffer.from([61.62.63.64.65]))
// <Buffer 3d 3e 3f 40 41>
Copy the code

This no longer means abcde, it means =>? @ the A. Also, note that when using arrays, each entry in the array can only be filled with a number between 0x00 and 0xff, which is the decimal range from 0 to 255, because a byte can represent a maximum of 256 characters.

Unicode character set and UTF-8 encoding

As mentioned above, ASCII code can only represent 128 characters, what about Latin characters or Chinese characters? Let’s try:

console.log(Buffer.from('abcde')) // <Buffer 61 62 63 64 65>
console.log(Buffer.from("Abcde")) // <Buffer 61 62 63 64 c3 a9>
console.log(Buffer.from('abcd easy')) // <Buffer 61 62 63 64 e6 98 93>
Copy the code

It can be found that the English letter E occupies one byte, the French letter E occupies two bytes 0xC3a9, and the Chinese letter yi occupies three bytes 0xe69893. If we specify ASCII encoding, the result is as follows:

console.log(Buffer.from("Abcde".'ascii')) // <Buffer 61 62 63 64 e9>
console.log(Buffer.from('abcd easy'.'ascii')) // <Buffer 61 62 63 64 13>
Copy the code

E is represented by 1 byte 0xe9 and e is represented by 1 byte 0x13. How does this work?

This is where the Unicode character set comes in. Because the ASCII code represents so few characters, the idea of putting all the characters of all the languages in the world together was born. This is the Unicode set, which starts at 0 and assigns each symbol a number called a code point. Because there are so many symbols, Unicode divides it into 17 planes (planes 0 to 16) with code points ranging from 0x0000 to 0x10FFFF. Each plane has 65,536 code points and can represent more than a million characters. The 0th plane is called the base plane. The remaining 16 planes are auxiliary planes, and the range and functions of each plane code point are as follows:

The most important one is plane 0 BMP, which contains the most 65536 character bits. Most commonly used characters are in this plane, such as ASCII characters and common Chinese characters. By looking up the table, it is found that the Unicode code point of E is U+00E9, and the code point of Yi is U+6613. Note that the code point here is only a sort method, which defines the order of Unicode characters, but does not involve the character encoding level. Since the largest code point is U+10FFFF, it takes at least 3 bytes to encode. The most intuitive encoding method is that each code point is represented by four bytes, and the byte content corresponds to the code point one by one. This encoding method is called UTF-32. The advantages and disadvantages of this encoding are obvious:

Advantages: Can be completely corresponding to Unicode code points, search efficiency is very high, is O(1).
Disadvantages: there is a waste of space, for example, ASCII code can be represented by 1 byte, the space after saving is 4 times larger.

So when you force binary data to be interpreted using ASCII encoding, I guess only the last byte of the code point, the last byte of utF-32 encoding, is preserved.

So why 0xC3a9 for E and 0xe69893 for Easy? This is due to utF-8 variable length encoding. As mentioned earlier, Unicode is just a character set and can be encoded in many different ways. Utf-8 is the default encoding used in Node.js, and in UTF-8, a character may be represented in 1 to 4 bytes. The rules are very simple: if it starts with a 0, it takes up 1 byte, ASCII compatible, and if it starts with a 1, the number of consecutive 1’s equals the number of bytes, for example:

eThe UTF-8 binary representation of isA 11000011-10101001It starts with 11 so it takes up 2 bytes
easyThe UTF-8 binary representation of is11100110:10011000:10010011It starts with 111 so it takes up three bytes

The following figure describes the characters of different code points in bytes:

Serial number range	byte
0x0000 – 0x007F	1
0x0080 – 0x07FF	2
0x0800 – 0xFFFF	3
0x010000 – 0x10FFFF	4

Buffer and character encoding in Node.js

ASCII

Unicode character set and UTF-8 encoding

Related Posts

Project actual combat: front-end user information storage and verification

TypeScript+React+ WebPack compact framework

Worth a look at the little program TabBar creative animation