As we all know, all the information inside a computation can be represented in binary, with each binary bit represented by a 0 or a 1. A Bit in a binary system is called a Bit, and it is the smallest unit of information. For example, the binary number 0101 is four bits. A Byte is a unit of computer information used to measure storage capacity. It does not discriminate between data types. One Byte is equal to eight bits (eight binary bits).
Cognitive character set
Character is a general name for all kinds of characters and symbols, including characters, numbers, punctuation marks, operation symbols and so on. Different characters are combined together to form a character set, such as the ASCII character set, GB18030 character set, Unicode character set and so on.
At the same time, since the computer only knows the binary numbers 0 and 1, character encoding is required if the computer wants to process the characters contained in various character sets accurately. The so-called character encoding, is to encode the character, is a character into the corresponding binary number rules, through it can facilitate text storage in the computer and through the communication network transmission.
Characters are converted to binary numbers as they are entered and stored in the computer. There are eight bits in a byte, so there are 256 (2 ^ 8) different states. Each state corresponds to one character, so one byte can represent 256 characters, ranging from 00000000 to 11111111. These 256 characters are sufficient for the English-dominated ASCII character set (128 characters in total), but not enough for the use of other language character sets, such as Chinese, which has about 100,000 characters.
At the same time, the character output also needs to convert the binary number into the corresponding character. Since most of the previous character sets were coded for the same language, it was necessary to know the encoding mode when opening a file; otherwise, opening a file with an inconsistent encoding mode would produce garbled characters.
Unicode
To deal with the incompatibilities that arise from using different encodings, one encoding is needed that contains all the characters in the world in one set — the Unicode character set — so that no garbled characters can be created.
Unicode, known as Unicode in Chinese, is an industry standard in computer science. It organizes and encodes most of the world’s writing systems, making it easier for computers to present and process text. Unicode continues to be improved, with more characters added with each new version. The latest version, 13.0.0, released in March 2020, already contains more than 130,000 characters.
Unicode defines a unique code point for each character, called U+[hexadecimal number]. For example, the code point of one is U+4E00, and the code point on is U+4E0A. With so many characters, Unicode arranges them into 17 groups called planes, each of which has 65536 (2 ^ 16) code points. All the most common characters are placed in the 0 plane, called the Basic Multilingual plane (BMP), with code points ranging from U+0000 to U+FFFF and 65536 characters in total. The remaining plane is called the auxiliary plane, with code points ranging from U+010000 to U+10FFFF.
The Unicode code point of a character is determined, but in the actual transmission process, due to different system platform design is not consistent, and for the purpose of saving space, the implementation of Unicode code point is different. For example, the code point U+4E0A converted to binary has 15 bits (100 1110 0000 1010) and can be stored in three or four bytes, depending on the implementation of Unicode. The Unicode implementation is called Unicode Transformation Format, or UTF for short. It is used to convert Unicode code points into specific sequences of bytes. Common Implementations of Unicode are: Utf-8, UTF-16, and UTF-32.
UTF-32
Utf-32 is a fixed-length implementation of Unicode encoding that uses four bytes (32-bit bits) to represent Unicode characters, and the byte content is exactly the same as the numeric value of the code point. For example, if the code point of one is U+4E00, adding two bytes of 0 before the code point is the utF-32 code 0x00004E00. The main advantage of UTF-32 is that it uses Unicode code points directly for indexing, which makes lookups efficient. The disadvantage is that each character is represented by four bytes, which is a huge waste of space. For example, utF-32 is four times larger than UTF-8 for the same English text, so UTF-32 is rarely used.
UTF-8
Utf-8 is a variable-length Unicode implementation that uses one to four bytes to represent a Unicode character. The encoding rules are as follows:
- For single-byte characters, the first byte is set to 0, and the next seven bits are the binary number of the Unicode code for that character.
- For multi-byte characters, the first n bits of the first byte are set to 1, the n + 1 bits are set to 0, and the first two bits of the following bytes are set to 10. The remaining bits are filled with the binary number of the character’s Unicode code, plus the extra bit complement of 0.
The rules are summarized in the following table, where x represents the bits occupied by the binary number of Unicode code points:
The number of bytes | The number of bits in a code point | Unicode character point range (hexadecimal) | 1 byte | 2 bytes | The byte 3 | 4 bytes |
---|---|---|---|---|---|---|
1 | 7 | U+0000 ~ U+007F | 1xxxxxxx | |||
2 | 11 | U+0080 ~ U+07FF | 110xxxxx | 10xxxxxx | ||
3 | 16 | U+0800 ~ U+FFFF | 1110xxxx | 10xxxxxx | 10xxxxxx | |
4 | 21 | U+10000 ~ U+10FFFF | 11110xxx | 10xxxxxx | 10xxxxxx | 10xxxxxx |
Utf-8 determines how many bytes a character is based on the number of bytes in the first byte: if the first byte is 0, the character is a single byte. If the first byte begins with a 1, the number of consecutive 1’s indicates how many bytes the character has.
Below, take the above as an example to illustrate the encoding and decoding process.
The Unicode code point is U+4E0A, which is in the range of the third line in the above table. The encoding process is as follows:
4E0A // Hexadecimal number of code point 4E0A 0100 1110 0000 1010 // Binary number of code point 4E0A --------------------------- 1110XXXX 10XXXXXXX 10xxxxxx // The third line of code point 4E0A corresponds to the template 11100100 101110000 10010100 The utF-8 encoded binary number E47094 // thus the utF-8 encoded hexadecimal number E47094Copy the code
The utF-8 encoding hexadecimal number is E47094. The decoding process is as follows:
E47094 // UTF-8 hexadecimal number E47094 11100100 101110000 10010100 // The corresponding binary number 1110XXXX 10XXXXXXX 10xxXXXX // belongs to the third row in the table, --------------------------- 0100 1110 0000 1010 // Get Unicode code point 4 E 0 A // Get Unicode code point hex number, U + 4 e0aCopy the code
Utf-8 has become the dominant form of encoding on the Internet because of its space-saving properties and ASCII compatibility.
UTF-16
Utf-16 is also a variable-length implementation of Encoding Unicode, using two bytes or four bytes to represent Unicode characters. Utf-16 encoding rules: Characters in the basic multilingual plane are represented by two bytes and characters in the auxiliary plane are represented by four bytes. That is, utF-16 encoding is either two bytes long or four bytes long.
This raises the question, when the computer encounters two bytes, does it treat it as one character, or does it treat the four bytes combined with two other bytes as one character? The solution: in the basic multilanguage plane, U+D800 to U+DFFF is an empty segment — that is, the code points do not correspond to any characters. Therefore, this empty segment can be used to map the characters of the secondary plane. Specifically, the value obtained by subtracting 0x10000 from the code point of the auxiliary plane character is represented as 20 binary bits, mapping the first 10 bits to U+D800 to U+DBFF and the last 10 bits to U+DC00 to U+DFFF. This means that the code points of characters in the secondary plane are split into code point representations of characters in the two basic multilingual planes.
Therefore, when two bytes are encountered and their code points are found to be between U+D800 and U+DBFF, it can be concluded that the code points following the two bytes should also be between U+DC00 and U+DFFF, and these four bytes must be read together. The following table shows the range of UTF-16 conversion points:
The number of bytes | Unicode character point range (hexadecimal) |
---|---|
2 | U+0000 ~ U+D7FF |
2 | U+E000 ~ U+FFFF |
4 | U+10000 ~ U+10FFFF |
According to the table above, the UTF-16 encoding is equivalent to the corresponding Unicode code point for characters in the basic multilingual plane. For example, if the Unicode code point of one is U+4E00, its corresponding UTF-16 code is 0x4E00.
Characters in the secondary plane are converted as follows:
- Code point minus
0x10000
, the resulting value is represented in 20 binary bits. - Add the first 10 bits
0xD800
You get the lead agent. - Add the next 10 bits
0xDC00
Get the rear end agent.
Take the ancient Greek number 𐅀 for example, whose Unicode code point is U+10140, and the conversion process is as follows:
- Code points
0x10140
Minus the0xD800
, the resulting value is expressed in binary as00000000000101000000
. - The first 10 bits plus
0xD800
After get0xD800
. - The next 10 bits plus
0xDC00
After get0xDD40
.
Thus the utF-16 encoding of 𐅀 is 0xD800 0xDD40.
Size end problem
When a character is stored in multiple bytes, there is a question of which byte comes first and which byte comes after. There are two types of byte order: big endian and little endian. For example, the code point of and is U+4E14, which needs to be stored in two bytes, one byte is 4E, the other byte is 14. In storage, if 4E is first, then 14 is big endian; If 14 comes first, 4E comes last and it’s a little endian.
Inconsistent understanding of byte order results in the possibility that the same byte stream may be interpreted differently. For example, a character whose hexadecimal code is 4E59 will be stored in two bytes, namely 4E and 59. When read on a Mac, the byte order is small endian. Then, on macOS, the 4E59 will be regarded as 594E, and the character will be found as “q”, while on Windows, it is big endian. The character encoded as U+4E59 is B.
So how does a computer know which byte order a file takes? Unicode specifies the use of U+FEFF to identify the byte order, which can only appear at the beginning of a byte stream. If the first character of the file is FE FF, the file uses the big-endian sequence. If the first character order is FF FE, the file uses little endian.
Which encoding method JavaScript uses
We all know that JavaScript uses the Unicode character set, but which Unicode implementation does it use? The answer is that JavaScript was born with ucS-2 encoding. A little historical context is needed here.
Historically, there have been two independent attempts to create a single character set. One is ISO/IEC, which was created by the International Organization for Standardization (ISO) in 1984
JTC1/SC2/WG2 Working Group, which started building UCS in 1989; The other is the Unicode Consortium, which was formed in 1988 by Xerox, Apple and other software manufacturers. Around 1991, participants in both projects realized that the world did not need two incompatible character sets. They began merging their work and revising previously published character sets to bring UCS code points into full alignment with Unicode.
UCS was developed faster than Unicode, and the first encoding method, UCS-2, was published in 1990, using two bytes to represent characters that already had code points (at the time there was only one plane — the basic multilingual plane, so two bytes was sufficient). The UTF-16 encoding was not released until July 1996, and it was explicitly declared a superset of UCS-2, i.e. the basic multilingual flat characters are both identical (two bytes), but the characters of the auxiliary flat are represented by four bytes. In short, UTF-16 replaces UCS-2. So, now there is only UTF-16 and no UCS-2.
In May 1995, Brendan Eich designed the JavaScript language in 10 days. In October of that year, the first interpretation engine was introduced; The following November, Netscape formally submitted the language standard to ECMA. Comparing the birth of UCS-2 to UTF-16, Netscape’s preferred choice at that point was ucS-2.
JavaScript character deficiencies
The UCS-2 encoding used by JavaScript causes it to operate on characters in the secondary plane (that is, characters that need to be represented by four bytes) with exceptions. For example, for an Emoji 😂, the code point is U+1F602 and the corresponding UTF-16 code is 0xD83D DE02. However, because this emoji is located in the auxiliary plane, JavaScript will not recognize it and will treat it as two separate characters U+D83D and U+DE02, as follows:
let emoji = '😂';
console.log('\u1F602'= = ='😂'); // false
console.log(emoji.length); // 2, the length is wrong, but it is an emoji, should be 1
console.log(emoji.charAt(0)); // � The code points between U+D800 and U+DFFF are empty segments, i.e. these code points do not correspond to any characters
// � The corresponding code point is U+FFFD, which is a special character indicating garbled characters
console.log(emoji.charAt(1)); / / �
console.log(emoji.charCodeAt(0)); // 55357, converted to hexadecimal 0xD83D
console.log(emoji.charCodeAt(1)); // 56834, converted to hexadecimal 0xDE02
Copy the code
None of the above javascript-related character manipulation is correct, and similar problems exist with other character functions.
ES 6 character improvements
ES 6 has more and stronger support for strings, including the following:
-
Support for characters with code points beyond \u0000 to \uFFFF is expressed as u{XXXXX} :
console.log('\u{1F602}'); / / 😂 Copy the code
-
ES6 adds a traverser interface for strings that recognizes code points greater than 0xFFFF, which the previous for loop did not:
let text = '😂 is a crying and laughing emoji '; for (let character of text ) { console.log(character); / / 😂 / / is / / a / / a / / to cry / / smile / / table / / love / / operator } for (let i = 0; i < text.length; i++) { console.log(text[i]); / / � / / � / / is / / a / / a / / to cry / / smile / / table / / love / / operator } Copy the code
-
Array.from(string).length can be used to get the correct length of a four-byte string:
let emoji = '😂'; Array.from(emoji).length; / / 1 Copy the code
-
Added several functions that can handle four-byte characters:
-
String. The prototype. CodePointAt () returns a String of a given location characters code point (decimal) :
let emoji = '😂'; // JavaScript treats emoji as a two-character encoding console.log(emoji.codePointAt(0)); // 128514, correctly recognises the character 😂 and returns the decimal of the code point console.log(emoji.codePointAt(1)); // 56834, return the decimal notation of U+DE02 console.log(emoji.charCodeAt(0)); // 55357, unable to recognize a four-byte character console.log(emoji.charCodeAt(1)); / / 56834 Copy the code
-
String.fromcodepoint () returns the corresponding character from the Unicode codepoint:
/ / for 😂 console.log(String.fromCharCode(0x1F602)); / / console.log(String.fromCodePoint(0x1F602)); / / 😂 Copy the code
-
String.prototype.at() returns a String in a given position (experimental, not currently supported).
-
-
ES6 provides the U modifier, which provides support for regular expression representation of four-byte characters:
let emoji = '😂'; console.log($/ / ^.test(emoji)); // false console.log(/^.$/u.test(emoji)); // true Copy the code
-
Provides the normalize() method to normalize a string to a specified Unicode normal form. Some characters consist of letters and intonation symbols, like efforts, and Unicode provides two ways of representing them. One is to offer you something at once with an accent, like efforts (\u01D1). The other was to offer synthetic symbols, the synthesis of letters and intonation symbols, two characters into one, for example, putting together efforts at STH. (\u004F\u030C), O (\u004F\u030C) :
// Direct characters console.log('\u01D1'); / / Ǒ // Synthesizes characters console.log('\u004F\u030C'); / / Ǒ Copy the code
These two character representations are supposed to be equivalent, but JavaScript doesn’t recognize them exactly:
console.log('\u01D1'= = ='\u004F\u030C' ); // false Copy the code
Providing normalize() in ES6 solves this problem:
console.log('\u01D1'.normalize() === '\u004F\u030C'.normalize()); // true Copy the code