Understanding of Unicode and encodings

ASCII

All information inside a computer is ultimately a binary value. Each bit has two states: 0 and 1, so 8 bits can be combined into 2^8, which is 256 combinations, known as 1 byte. That is, one byte can represent 256 different states, from 00000000 to 11111111.

ASCII code is a character code that governs the relationship between English characters and binary bits. Among them, 128 characters are specified, which only occupy the last 7 bits of a byte, and the first bit is uniformly 0. For example, the SPACE is 32 (binary 00100000).

The ASCII coding

128 characters is enough for English, but 128 characters is not enough for other languages. So some countries decided to use unused bits to encode new symbols. Therefore, different countries have different letters. At first, the characters from 0 to 127 are the same, but the characters from 128 to 255 are different.

For Chinese characters, there are even more symbols, one byte can only represent 256 symbols, so it is certainly not enough to use more than one byte to represent a symbol.

Unicode

Just because the above paragraph 128 to 255 May be different, there are many encoding formats in the world, the same binary number can be interpreted as different symbols. So if you want to open a text file, you have to know how to encode it, or you can read it in the wrong way, and it will be garbled. So, if there was a code that included all the codes in the world, and each symbol was given a unique code, then the garbled problem would be solved, and that’s Unicode, which is a code for all symbols.

Unicode is a very large set of Unicode partitions, each of which can hold 65536 (2^16) characters, called a plane. So far, there are 17 planes. The first 65536 character bits, called the basic plane (ABBREVIATED BMP), range in code points from 0 to 2^16-1, written in hexadecimal form from U+0000 to U+FFFF. All the most common characters are placed on this plane, which is the first plane defined and published by Unicode. The rest of the characters are placed in the auxiliary plane (SMP), with code points ranging from U+010000 to U+10FFFF.

Each symbol is coded differently. Specific symbol corresponding table, you can query Unicode, Chinese, Japanese and Korean Characters Unicode encoding table.

It starts at 0 and assigns each symbol a number, called a code point. For example, the symbol for code point 0 is NULL (indicating that all binary bits are 0).

U+0000 = null
Copy the code

The hexadecimal number immediately following U+ is a Unicode code point.

It is important to note that Unicode is just a set of symbols. It only specifies the code points of symbols, not how they are stored. For example, the Unicode for Chinese characters is hexadecimal 4E25, which translates to binary 100111000100101, meaning that it takes at least two bytes to represent this symbol and many more bytes to represent other larger symbols.

The first question is, how do you distinguish Unicode from ASCII, that is, how do you know if three bytes are used to represent three symbols or one symbol? The second problem is that ASCII represents all letters in a single byte. If Unicode were to use three or four quad characters for each character, two or three bytes of each letter would have to be zeros, which would be a huge waste of storage.

As a result, Unicode is stored in multiple ways, meaning that there are many different binary formats for storing code points.

UTF – 32 and utf-8

Utf-32 uses four bytes to represent code points, and the byte contents correspond to code points one by one, exactly.

U+0000 = 0x0000 0000
U+597D = 0x0000 597D
Copy the code

The advantage is that the conversion rules are simple and intuitive, and the search efficiency is high. The disadvantage is that it wastes a lot of space. English text of the same content is four times larger than ASCII (ASCII is stored in 1 byte, utF-32 in 4 bytes).

Utf-8 is a variable length encoding method, ranging in length from 1 byte to 4 bytes. The more commonly used characters are, the shorter the bytes are. The first 128 characters are represented in 1 byte, exactly the same as ASCII.

Utf-8 encoding rules:

For single-byte symbols, the first byte is set to0The following 7 bits are the Unicode code for this symbol. So utF-8 encoding is the same as ASCII for English letters
fornByte symbol (n > 1), before the first bytenBit is set to1In the firstn + 1Bit is set to0, the first two characters of the following bytes are set to10. The remaining bits, not mentioned, are all Unicode codes for this symbol

The advantage of fixed-length encoding is that it can locate characters quickly and has good support for string.charat (index) methods. In UTF-8, you need to start parsing character by character from scratch, which is slower. However, compared with query positioning, sequential output is more, so it is not usually felt to be slower.

UTF-16

Utf-16 is intermediate between UTF-32 and UTF-8, and combines the characteristics of both fixed-length and variable-length encoding methods.

Encoding rules: Characters in the base plane take 2 bytes and characters in the secondary plane take 4 bytes. That is, utF-16’s encoding length is either 2 bytes (U+0000 to U+FFFF) or 4 bytes (U+010000 to U+10FFFF).

JS encoding method used in UCS-2

Ucs-2 encodings use two bytes to represent characters that already have code points (UCS-2 is published with only one plane, the base plane, and all two bytes are sufficient).

So why not use the higher level UTF-16 encoding instead of UCS-2 encoding, because at the time of JS, there was no UTF-16 encoding.

JS can only handle UCS-2 encoding, so all characters in this language are 2 bytes. If it is 4 bytes, it will be treated as two double-byte characters. So character functions in JS are affected by this property and cannot return the correct result.

console.log('𝌆'.length) / / 2
console.log('𝌆'= = ='\uD834\uDF06') // true
'𝌆'.charCodeAt(0) / / 55348
parseInt('D834'.16) / / 55348
Copy the code

Unicode support has been added to ES6 to automatically recognize 4-byte code points, which will not be explained here.

Understanding of Unicode and encodings

ASCII

The ASCII coding

Unicode

UTF – 32 and utf-8

UTF-16

JS encoding method used in UCS-2

Related Posts

Explain the process of TCP three-way handshake

Recommend a good looking Theme Terminal (Windows)

The front-end deployment