This is the seventh day of my participation in the First Challenge 2022. For details: First Challenge 2022.

I have read some information about front-end encryption and decryption during the holiday of Chinese New Year. CharCodeAt () is mentioned in it, which can obtain ASCII character code points of strings. But the first sentence of the MDN documentation on this method is:

Returns an integer between 0 and 65535 representing the UTF-16 code unit at the given index

So what does charCodeAt() actually return? What is the connection between ASCII and UTF-16?



So what exactly is character encoding? What is UTF-8? Here, we begin our research journey with these questions

ASCII

Every piece of code that we write is actually a piece of text that the computer has to somehow convert into a binary number that the computer can recognize directly, like 010101. At the beginning when Americans invented computers, they would use text, a total of 26 English letters (including case is 56), 0 to 9 these 10 numbers, and common English symbols (including space) these three visible characters, as well as empty characters, newline, carriage return and other control characters.



So in 1963, the American National Standards Institute (ANSI) was launchedASCIIAmerican Standard code for Information Interchange) as a text character encoding standard for computers and other equipment. ASCII will all visible and invisible characters mentioned above are numbered one by one, such as 0 represents “null character”, 48 represents the number “0”, 65 represents the uppercase English letter “A”, 97 represents the lowercase English letter “A”, A total of 128. The numbers 0, 48 and 65 are called code points, or code points. The set of code points and the characters represented by code points is called the character set. The 128 character set is the ASCII character set. Americans convert code points directly into binary information (ASCII code) and store it in a computer. Here’s the problem. The largest code point is 127, so it will be 1111111 when converted into binary, which has only 7 bits. However, computers generally use 8 bits (1 byte) as the basic unit for reading and writing. Therefore, after these code points are converted into binary, a 0 will be added in front of them, that is, 127 will be converted to 01111111 for storage.



Note that characters are not necessarily stored in the computer as code points, but the mapping between characters and what the computer can store is calledcoding. The simplest is to store the code points corresponding to the characters directly into the computer in binary.

GB2312 with GBK

Because ASCII code to 8 bit represents 1 character, and 8 bit can represent 256 characters at most, to Europe and the United States can use, but to our Chinese use, even oracle is far from complete. So what to do? In order to solve the problem of Chinese storage, our country designed a set of 16 bits to represent a character, using partition management character set — GB2312 character set, converted into binary GB2312 code. Partition management is to allocate a total of 8,836 code points to 94 partitions, each containing 94 bits. The situation in each district is as follows:

  • Area 01-09: contains 682 characters except Chinese characters;
  • Area 10-15: blank area, no coding;
  • Zone 16-55:3,755 first-level Chinese characters, that is, common Chinese characters, sorted by pinyin;
  • Area 56-87:3008 second-level Chinese characters, which are basically words we don’t know, sorted by radical/stroke;
  • Area 88-94: also blank area, no coding;

If you look carefully, it is not difficult to find that GB2312 character set only 16-87 included Chinese characters, a total of 6,763 Chinese characters, which is not enough for the extensive and profound and long history of Chinese characters, so there is GB2312 extended GBK, including simple and traditional Chinese characters, Japanese and Korean, etc..

Unicode

The United States has ASCII code, The Chinese mainland has GBK code, each country uses their own design of a set of code, the same number in the computer memory, in different character sets, representing different characters, the garbled code problem comes. Thus Unicode, or universal code, was born, a standard that contains a character set and corresponding encoding rules. Its purpose can be quoted on the homepage of its official website:

Everyone in the world should be able to use their own language on phones and computers.

Giving everyone in the world the ability to speak their own language on their phone or computer. This involves collecting all the characters in the world and numbering them individually (code points). Unicode now includes emoji. In JS, we can usecodePointAt()Method to query the code points of Unicode characters.

Utf-32, UTF-8, and UTF-16

At the beginning, Unicode used the UCS-2 character set, a 16-bit (2-byte) encoding system that could represent 65536 characters. The characters are arranged in order, and the corresponding code points are marked in turn, and then the code points are directly converted into the corresponding binary information as ASCII and stored in the computer. Later, it was found that 65536 characters were not enough to contain all the characters in the world, and the UCS-4 character set was created, i.eUTF-32. It uses 32 bits, or four bytes, to represent a character, preceded by zeros if there are not enough digits. In total, it represents nearly 4.3 billion characters. However, each character needs 4 bytes, which is spatiically inefficient. For example, each ASCII character needs only 1 byte to represent, meaning that the same content of English text, UTF-32 requires 4 times as much space as ASCII. To solve utF-32 space efficiency problems,UTF-8Came into being. Utf-8 is a variable-length encoding rule for Unicode. By variable, utF-8 divides the code points of the UTF-32 character set into the following four ranges for different characters, which can be 8, 16, 24 or 32 bits:Description:

  • For example, starting with 1110 is so that the computer knows how to divide characters. 1110 indicates itself and the following two bytes (starting with 10) indicate one character.
  • X, y, Z, and a represent 0 or 1, and are represented by different letters to show how the corresponding binary code points are split when transferred to UTF-8 encoding.

Utf-8 is compatible with ASCII. The first 128 characters and code points of Unicode are exactly the same as those of ASCII. When encoded in UTF-8, they are all 1 byte, the binary information is the same.

Let’s talk about utF-16 again. UTF is an acronym for Unicode Transformation Format. Unicode is responsible for numbering characters, whether ucS-2’s 65536 characters or ucS-4’s more numerous characters, so that each character has its own code point. How to convert the code point information into 01 strings and store them in the computer is utF-8, UTF-16 or UTF-32. Utf-16 saves all unicode characters in 16 bits (2 bytes), so it is not compatible with ASCII.

Finally, back to the original question, what does charCodeAt() return? For 128 characters at the beginning of the Unicode character set, both utF-16 code units and ASCII character points are returned, since they are the same.