On the unity and chaos of character coding

The cause of

Recently, I was studying the source code of Babel. I saw the source code of Acorn lexical parsing with such a paragraph of logic:

pp.fullCharCodeAtPos = function() {
  let code = this.input.charCodeAt(this.pos)
  if (code <= 0xd7ff || code >= 0xdc00) return code
  let next = this.input.charCodeAt(this.pos + 1)
  return next <= 0xdbff || next >= 0xe000 ? code : (code << 10) + next - 0x35fdc00
}
Copy the code

At first glance, this code is completely unsolvable. It looks like a code range has been judged and a code has been recalculated. Digging a little deeper into character coding, he discovered a huge history.

Remember a few years ago, the Emoji question was quite popular. People were wondering why some Emoji in JavaScript were of length 2, and the question ‘𠮷’ seemed to be even more popular:

The early history of character encoding

Start with the telegraph

Before the invention of the telegraph, long-distance communication could only be carried out by means of post stations, carrier pigeons and beacon smoke, which required a very high cost. In the eighteenth century, people began to study the properties of electricity and the possibility of using it to transmit information.

Morse code — the earliest form of digital communication

American inventor Samuel Morse invented Morse code in 1836, which ushered in the telegraph age. The first operational telegraph line appeared in England in 1839. It was installed between two stations for communication by the Great Western Railway.

Morse code is a code denoted by dots (·) and dashes (-). According to the interval of time dimension, we can parse it by looking up the table:

The earliest Chinese telegrams used four Arabic numerals as codes, ranging from 0001 to 9999 in order of four digits, representing up to 10,000 Chinese characters, letters and symbols with four digits.

ASCII – the era of computer character coding

In 1946, the world’s first general-purpose computer, ENIAC, was born. At that time, it was an electronic numerical integration computer, which could not represent characters. Until 1963, the American National Standard Institute (ANSI) promulgated the ASCII coding scheme.

We all know that in a computer, all data is stored and computed using binary numbers (because computers use high and low levels to represent 1 and 0, respectively). For example, Like a, b, c, d 52 letters (including capital) and 0, 1, such as digital and some commonly used symbols (such as *, #, @, etc.) in a computer store also want to use the binary number, what binary Numbers which symbols, and specific use, of course, everyone can agree its own set of (this is called encoding), In order for everyone to communicate with each other without causing confusion, everyone had to use the same coding rules, so the American standardization organization came up with the ASCII code, which specifies the binary numbers used for common symbols.

ASCII uses a specified combination of 7 – or 8-bit binary numbers to represent 128 or 256 possible characters. Standard ASCII, also known as basic ASCII, uses seven binary digits (the remaining one binary digit is zero) to represent all upper and lower case letters, digits 0 through 9, punctuation marks, and special control characters used in American English.

The chaotic age of character coding in non-English speaking countries

The last gasp of ISO/IEC 646-7 bits

ISO/IEC 646 is a standard developed by the International Organization for Standardization (ISO) and the International Electrotechnical Commission (IEC) in 1972. ISO 646 except for the alphanumeric part of the English language, which is the same for all countries, some countries that use letters can modify ISO 646 according to actual needs to determine the character standards of the country.

ISO 646 uses several punctuation marks as additional marks for diacritical marks in various European languages, since there is no code point space to encode these diacritical marks directly:

An apostrophe doubles as an acute accent;
Backquote, backtick, opening quote mark as grave accent;
Double quotation mark (diaeresis or umlaut);
Caret serves concurrently as circumflex accent;
(swung Dash) doubles as tilde;
Comma doubles as cedilla

As you can see, many countries run out of characters at 7 bits and will replace the original ASCII characters with their own versions:

ISO 2022 – an 8-bit scheme compatible with ASCII and large characters

It soon became clear that seven bits was not enough for most Latin languages, and that ASCII essentially evolved from communications, where protocols used the eighth bit for checking and correcting errors. However, for computer memory, error checking becomes unnecessary. So 8-bit character encodings gradually emerged to represent more characters than ASCII. For this reason, the ECMA-35 standard, published in 1971, sets out common rules that should be followed by various 7 – and 8-bit character encodings. Ecma-35 was subsequently adopted as ISO 2022.

ISO 2022 is compatible with 7-bit encoding space, 0x00-0x1F is reserved for control characters and 0x20-0x7f represents graphic characters. Thus, in a 7-bit character encoding space, there are a total of 94 graphic characters (due to space occupying 0x20 and Del occupying 0x7F).

For a double-byte 7-bit encoding space, graphics characters can be 94 x 94 or 8836. This was also used as the encoding scheme for Chinese, Japanese and Korean characters.

The control characters specified in ISO 2022 can be divided into two parts: C0, C1; The print (graphic) characters are divided into four blocks: G0, G1, G2, G3. For 7-bit encoding, byte values 0x00-0x1f are reserved for the C0 control character block; Byte values 0x20-0x7F are used for G0, G1, G2, G3 character blocks. For single-byte encoded character sets, a print (graphic) character block may contain 94 or 96 characters; For double-byte encoded character sets, 1 print (graphic) character block may contain 94 x 94 characters. Switch between G0, G1, G2, G3 using escape sequences of control characters.

Under the provisions of ISO 2022, there are two major directions: one is 8-bit character set scheme based on Western Latin languages, and the other is dual-byte 8-bit encoding scheme based on China, Japan and South Korea:

ISO/IEC 8859-8 bit Standard for Latin character sets

ASCII contains Spaces and 94 “printable characters”, enough for English. However, other languages that use the Latin alphabet (mainly Those of European countries) have a certain number of additional symbolic letters that can be stored and represented using areas other than ASCII and control characters.

In addition to languages using the Latin alphabet, eastern European languages using the Cyrillic alphabet, Greek, Thai, modern Arabic, Hebrew, etc., can all use this form for storage and representation.

ANSI began this work in collaboration with ECMA in 1982. In 1985, ECMA-94, later known as ISO/IEC 8859 parts 1, 2, 3, 4, was published. Parts 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, and 16 were published in 1988, 1987, 1987, 1987, 1987, 1989, 1992, 2001, 1997 (officially renouncing R&D), 1998, 1998, 1999, and 2001, respectively.

ISO 8859 is based on the ISO 2022 standard on the basis of the ISO 2022 provisions of the G0 code point area of ISO 646 95 printable characters; In the control character code point area of C0 and C1, represents the control character defined by ISO 6429; In the G1 code point area, there are extended printable characters defined by each of the 16 parts of ISO 8859. Thus, ISO 8859 is fully compatible with 7-bit ASCII codes. ISO 8859 does not use the G2 and G3 regions of ISO 2022, nor does it use the “control character escape sequence” defined by ISO 2022 for converting between different character coensets or between G0, G1, G2, and G3 regions of the same coenset.

GB2312 – Dual byte 8 bit scheme based on EUC storage

GB/T 2312, GB/T 2312-80 or GB/T 2312-1980 is the National standard Simplified Chinese character set of the People’s Republic of China, the full name of the “Chinese Coded Character Set for Information Interchange · Basic Set”, usually referred to as GB (” NATIONAL standard “Chinese pinyin initials), also known as GB0, Promulgated by the State Administration of Standards of China in 1980 and implemented on May 1, 1981. GB/T 2312 code is widely used in Mainland China; Singapore and other places also use this code. Almost all Chinese systems and international software in mainland China support GB/T 2312.

There are 6,763 Chinese characters in GB/T 2312 standard, including 3,755 first-level Chinese characters and 3,008 second-level Chinese characters. It also contains 682 characters including Latin alphabet, Greek alphabet, Japanese hiragana and Katakana alphabet, and Russian Cyrillic alphabet.

The emergence of GB/T 2312 basically meets the needs of computer processing of Chinese characters. The Chinese characters included in GB/T 2312 have covered 99.75% of the frequency of use in mainland China. However, GB/T 2312 could not deal with the rare and traditional characters in people’s names, ancient Chinese and other aspects. Therefore, GBK and GB 18030 Character sets appeared successively to solve these problems.

In GB/T 2312, the received Chinese characters are “partitioned”, each region contains 94 Chinese characters/symbols, a total of 94 regions. Characters (actually code points) are represented by their region and bits, and are therefore called location codes:

Area 01~09 (682) : special symbols, numbers, English characters, tabs, etc., including Latin letters, Greek letters, Japanese hiragana and Katakana letters, Russian Cyrillic letters, etc. 682 full corner characters;
Zone 10~15: empty zone, reserved for expansion;
Region 16~55 (3,755) : Common Chinese characters (also known as first-level Chinese characters), sorted by pinyin;
Area 56~87 (3008) : Non-essential Chinese characters (also known as second-level Chinese characters), sorted by radical/stroke;
Zone 88~94: empty zone, to be extended.

In order to avoid the CR0 non-display character and space character in ASCII characters, the national code (also known as switching code) specifies the double byte encoding range of Chinese characters from (33,33) to (126,126). Therefore, “area code” and “bit code” should be added 32, respectively, as the NATIONAL standard code. To avoid conflicts with undisplayable and space characters 0 to 32 in ASCII characters.

But gb code and general ASCII code conflict, so the highest bit of each byte in gb code are changed from 0 to 1, that is, each byte plus 128, so as to get the gb code “machine code” representation, referred to as “internal code”.

In this case, the internal code is represented by the bytes of GB2312, which is in accordance with the EUC storage specification, that is, the location is added 0xA0 to avoid conflict with ASCII.

Entanglements and disputes in the era of unification

With the advent of ISO 2022, each country defines its own character set, but incompatibilities often occur between countries. A common problem with many traditional coding methods is that they allow computers to handle bilingual environments (usually using Latin letters and their native languages) but not multilingual environments (that is, mixing languages at the same time).

The advent of the Internet, in particular, has made uniform coding an even more pressing need.

But there are two groups that want to do this unified work:

ISO/IEC was created by the International Organization for Standardization (ISO) in 1984
The Unicode Consortium was formed in 1988 by Xerox, Apple and other software manufacturers

ISO 10646 standard – a standard caught in a political vortex

The original ISO 10646 character set is UCS-4, which is encoded in four octets per character and used to store different languages around the world. In theory, UCS-4 has 4,294,967,296 code points, enough to support all the writing in the world, and some say more than it could run out in the event of the galaxy’s demise.

But it was this forward-looking decision that set ISO 10646 back in the heyday of Xerox, which was promoting an international character set (which later became Unicode) in the early 1980s, when it rallied a group of supporters, With Joe Becker, Lee Collins (now at Taligent), Eric Mader, and Dave Opstad (Apple) already considering Unicode, Participation in Unicode development has expanded to include the leading industry representative community, They include Bill English (Sun Microsystems), Asmus Freytag (Microsoft), Mark Kernigan (metaphor), Rick McGowan (NeXT), Isai Scheinberg (IBM), Karen Smith-yoshimura (Research Libraries Group), Ken Whistler (University of California, Berkeley, Metaphor), and others.

At that time, Unicode was using 16-bit encoding and supported 65536 characters. Of course, these computer manufacturers led by the United States did not want to waste 4 bytes to store a character. For them, 65536 was enough. And they think that all of the world’s scripts can be mapped to this character set.

After several years of struggle, the ISO standard was finally compromised, and in 1993 ISO 10646-1 was released, adopting UCS-2, which was aligned with Unicode and became Unicode 1.1. This influenced the programming languages that were created at the time (such as today’s focus on JavaScript, And the old Java language).

Starting with Unicode 2.0, Unicode uses the same font libraries and codes as ISO 10646-1; ISO also promises that ISO 10646 will not assign values to UCS-4 encodes beyond U+10FFFF to keep them consistent. Both programs still exist independently and publish their standards independently. However, both unicode Consortium and ISO/IEC JTC1/SC2 have agreed to maintain code table compatibility for both standards and closely coordinate any future extensions. At the time of release, Unicode generally uses the most common font for the font, but ISO 10646 generally uses Century fonts whenever possible.

Unicode – The ultimate winner

In a race to become the industry standard, Unicode was launched by computer hardware and software vendors. The Unicode encoding corresponds to the ISO 10646 concept of the universal character set. The unicode version currently in use corresponds to UCS-2 and uses a 16-bit coding space. That’s 2 bytes per character. This can theoretically represent up to 216 (65536) characters. Basically meet the use of various languages. In fact, the current version of unicode does not fully use this 16-bit code, but reserves a lot of space for special use or future extensions.

The 16-bit unicode characters above form the basic multilingual plane. The most recent (but not actually widely used) version of unicode defines 16 auxiliary planes, which together take up at least 21 bits of coding space, slightly less than 3 bytes. But the fact that secondary flat characters still occupy 4 bytes of encoding space is consistent with UCS-4. Future versions will be expanded to ISO 10646-1 implementation level 3, which covers all ucS-4 characters. Ucs-4 is a larger, incomplete 31-bit character set with a constant zero first digit that takes up 32 bits, or 4 bytes. Theoretically, it can represent up to 231 characters, which can completely cover the symbols used in all languages.

Unicode characters are currently arranged into 17 groups called planes, each of which has 65536 (216) code points. Currently, however, only a few planes are used.

Basic Multilingual Plane (BMP), or Plane 0 or Plane 0, is an encoding block in Unicode. The encoding ranges from U+0000 to U+FFFF.

China, Japan and South Korea unified ideograms – Chinese tangled

CJK Unified Ideographs (CJK Unified Ideographs), also known as Unified Chinese characters, Unified Chinese codes (CJK: Unihan), which aims to encode ideograms from Chinese, Japanese, Korean, Vietnamese, Zhuang and Ryukyu languages with the same origin, same meaning, same shape or slightly different in the ISO 10646 and Unicode standards.

In 1991, countries rejected the first draft of ISO/IEC 10646, hoping for a consistent way of processing text. Based on the proposal of China and unicode Consortium, ISO 10646 and Unicode set up the China-Japan-ROK Joint Research Group. The China-Japan-ROK joint research team will independently define and produce ISO 10646 and Unicode codes based on each country’s Chinese character codes. At the end of the year, Unified Repertoire and Ordering (URO) was completed. In 1992, URO joined the second version of ISO 10646. However, some deficiencies were found and subsequently corrected.

In May 1993, the original “China-Japan-ROK Unified ideogram” was formally formulated, with a total of 20,902 characters located in the area U+4E00 — U+9FFF. There is also a Chinese character “○” (code point U+3007), which is put into the symbol and punctuation area as a number. A month later, Unicode 1.1 was created.

However, the unification of Chinese characters still has its critics:

Although the incorporation of different characters helps to reduce the number of words included, in academic research, such as ancient books, history and character studies, some documents need to combine different characters in different forms at the same time. The words which have been combined have become different meanings in these documents. When scholars use Unicode, they may have to display the same character in different computer fonts, or even create their own characters, or use other encodings instead of Unicode. For one thing, it is inconvenient to find and convert computer fonts; for another, it damages the purpose of Unicode to record every word; for another, it cannot be exchanged in plain text; and for another, it is difficult to exchange computer fonts due to licensing conditions. In addition, this is equivalent to not being able to accurately record documents in Unicode, which is not conducive to the computerization of texts.

If the search method is glyph – based, it will be confusing and difficult to search after the combination of characters with different glyphs. For example, a brush painting, a plaything, a “big head”. In China and Japan, three pictures are considered, while in traditional Chinese, four pictures are, a plaything is six. Unicode has several strokes for the same character due to different glyphs, causing confusion in retrieval. Even if characters are detected, strokes do not match the displayed glyphs. As a result, critics argue that Unicode’s merging of foreign characters is not desirable.

Unicode included many ghost characters, but on the other hand, it is hard to find its provenance, they rarely chance to use in real life, some even administration word, or just the name of one of words, that person is not a celebrity, even may have died, but the characters in the permanent become the standard, take up a code. For example, Taiwan lawyer Lu Qiu 𧽚, his name in the word “𧽚” should be “far”, but the household personnel mishear his grandfather said in Taiwan, the word “memory side” listen to “horse side”, grandpa dare not correct. Only when the person concerned grew up did he confirm that this was a typo, “no one has written this in 5,000 years” [26]. But the word has been permanently included in Unicode. For example, many characters used for people’s names in the Hong Kong Supplementary Character Set (HKSCS) have been pointed out by scholars to be ill-written or self-created characters with unknown origins, which have not been included in many authoritative character books. Scholars have criticized that these characters will be permanently damaged if they are included in the character database [27]. In his column, Li Xiang, a member of the Chinese information community, criticized the authorities for “failing to solve the pronunciation problems of thousands of wrong characters, white characters and artificial characters in the SUPPLEMENTARY character set” and urged “not to link the HKSCS with the application for ISO compulsory” [28]. However, these names are also included in Unicode. This constitutes the controversy of receiving too many words.

There are also critics who argue that Unicode’s inclusion of large numbers of incorrect characters and different glyphs for the same word that are highly similar is inherently inappropriate. Computer text itself can never completely record the literature, and the literature itself will be slightly different due to copying plate making and other reasons, if all the various writing methods of each word are coded, it will waste space. Completely nondestructive research and documentation can only be done by looking at the original or photocopy. It is wrong to shift nondestructive preservation to coding.

Unicode currently encodes some different characters separately, which makes retrieval difficult. As long as the writing method is slightly different, it cannot be detected, so that users have to repeatedly search for different writing methods when searching words, resulting in repeated labor, but a hindrance to literature research. Unicode, for example, places the words “son” and “𠒇” in different code points. When searching for documents, they could not find lei Chong 𠒇 when they searched for er, or lei Chong er when they searched for 𠒇.

Unicode implementation – UTF conversion

Unicode is implemented differently than it is encoded. A character’s Unicode encoding is determined. However, in the actual transmission process, because the design of different system platforms is not necessarily the same, and for the purpose of saving space, the implementation of Unicode encoding is different. The implementation of Unicode is called Unicode Transformation Format (UTF).

UTF-32

Utf-32 is the simplest implementation, using 32-bit bits to encode each Unicode code point. The UTF-32 encoding length is fixed, and each 32-bit value in UTF-32 represents a Unicode code point and is exactly the same value as that code point.

The main advantage of UTF-32 is that it can be indexed directly by Unicode code points. Finding the NTH encoding in an encoding sequence is a constant time operation. In contrast, other variable-length encodings require sequential access to find the NTH encoding in the encoding sequence. This made it possible in computer programming to express the position of a character in a coded sequence as an integer, which added one to get the position of the next character, as easily as an ASCII string.

The main disadvantage of UTF-32 is that it uses four bytes per code point and wastes a lot of space. Characters in non-basic multilingual planes are rare in most texts, making utF-32 nearly twice as much space as UTF-16 and four times as much space as UTF-8, depending on the proportion of ASCII characters in the text.

Utf-32 and UCS-4

In November 2003, the RFC 3629 standard limited Unicode support to code points up to U+10FFFF due to utF-16 encoding format limitations (the U+D800 to U+DFFF range is also reserved. Although areas of 0xE00000 to 0xFFFFFF and 0x60000000 to 0x7FFFFf were assigned to “reserved for private use” in the previous ISO standard (Unicode 2.1 1998), these areas have also been removed in subsequent versions. The ISO/IEC JTC 1/SC 2 WG2 specification states that all character assignments for UCS-4 will be restricted to Unicode in the future, so utF-32 and UCS4 can represent the same characters.

UTF-16

A sequence that maps abstract code points of the Unicode character set to 16-bit integers (i.e. codes) for data storage or transmission. Unicode characters require one or two 16-bit symbols to represent them, so this is a variable-length representation.

Many people mistake UTF-16 for a fixed-length 2-byte representation, which is actually a confusion with UCS-2:

Utf-16 can be regarded as the parent set of UCS-2. In the absence of surrogate code points, UTF-16 and UCS-2 refer to the same thing. But when auxiliary flat characters are introduced, it is called UTF-16. Now any software that claims to support UCS-2 encoding is implying that it cannot support word sets larger than 2 bytes in UTF-16. For UCS codes less than 0x10000, the UTF-16 code is equal to the UCS code.

Let’s see how UTF-16 maps to a non-fundamental plane:

The code points in Supplementary Planes are encoded as 16-bit codes (32 bits, 4 bytes) in UTF-16, and are called Surrogate pairs.

Subtracting 0x10000 from the code points yields values in the range of 20 bits long 0… 0 XFFFFF.
The highest 10 bit value (value range is 0… 0x3FF is added to 0xD800 to create the first token, or high surrogate, in the range of 0xD800… 0 XDBFF. Because high level proxies have smaller values than low level proxies, the Unicode standard now calls high level proxies lead surrogates to avoid confusion.
The lowest 10 bit value (value range is also 0… 0x3FF) is added to 0xDC00 to create a second token, or low surrogate, that now ranges from 0xDC00… 0 XDFFF. Because low level surrogates have larger values than high level surrogates, the Unicode standard now calls low level surrogates trail surrogates to avoid confusion.

The above algorithm can be understood as: the code points in the auxiliary plane range from U+10000 to U+10FFFF, and there are FFFFF points in total, that is, 220=1,048,576 points, which need 20 bits to represent. If represented as a sequence of two 16-bit integers, the first integer (called the leading proxy) contains the first 10 bits of the above 20 bits, and the second integer (called the trailing proxy) contains the last 10 bits of the above 20 bits. It is also possible to determine directly the range of values belonging to the leading integer proxy (210=1024) or the range of values belonging to the trailing integer proxy (also 210=1024) based on the value of the 16-bit integer. Therefore, 2048 code points that do not correspond to Unicode characters need to be retained in the basic multilingual plane, which is sufficient to accommodate the coding space required by leading and trailing agents. This is only 3.125% of the total 65536 code points in the basic multilingual plane.

Note that the Unicode standard states that U+D800… The value of U+DFFF does not correspond to any character. This makes the implementation of utF-16 proxy pairs possible.

Utf-16 use has been really extensive throughout history:

Utf-16 is used for text in OS apis for all currently supported Microsoft Windows versions (at least since Windows CE / 2000 / XP / 2003 / Vista / 7) (including Windows 10). In Windows XP, code points above U+FFFF are not included in any of the European language fonts that come with Windows. Older Windows NT systems (pre-Windows 2000) only support UCS-2. Files and network data tend to be a mix of UTF-16, UTF-8, and traditional byte encodings.

The Python language environment officially only uses UCS-2 internally since version 2.0, but the UTF-8 decoder to “Unicode” produces the correct UTF-16. Starting with Python 2.2, the “wide” Unicode version using UTF-32 is supported; [24] These are mainly used on Linux. Python 3.3 no longer uses UTF-16, but instead chooses the encoding from ASCII/Latin-1, UCS-2, and UTF-32 that provides the most compact representation for a given string.

Java originally used UCS-2 and added UTF-16 supplementary character support in J2SE 5.0.

JavaScript may use UCS-2 or UTF-16. As of ES2015, string methods and regular expression flags have been added to the language to allow strings to be handled from an encoding-independent perspective.

Ucs-2 is also supported by the PHP language and MySQL.

Swift 5, Apple’s preferred application language, switches from UTF-16 to UTF-8 as the preferred encoding.

UTF-8

Utf-8 (8-bit Unicode Transformation Format) is a variable-length character encoding for Unicode and a prefix code. It can encode all valid encoding points in the Unicode character set in one to four bytes. It is part of the Unicode standard and was originally proposed by Ken Thompson and Rob Pike. Since coding points with smaller values are generally used more frequently, using Unicode encoding directly is inefficient and wastes a lot of memory space. Utf-8 is designed to address backward compatibility with ASCII. The first 128 characters in Unicode are encoded using a single byte with the same binary value as ASCII, and their literals correspond to ASCII literals one by one. This allows original ASCII software to handle characters with little or no modification. Can continue to use. As a result, it has gradually become the preferred encoding method for E-mail, web pages, and other types of text to store or send.

Utf-8 encoding rules are very simple, with leading bits to distinguish between different ranges of Unicode:

Its variable range is between 1 and 6 bytes. In addition, UTF-8 encodes UCS in 8-bit units, but UTF-8 does not use the form of big-tail or small-tail. The first two bits of each character stored in UTF-8 start with “10” except for the first byte, so that the word processor can quickly find the start position of each character.

In the Mac, the default file is UTF-8 encoding. There are two sets of UTF-8 encoding implementations in the MySQL character encoding set: utF8 and UTF8MB4. Utf8 is an encoding implementation that occupies up to 3 bytes of space. “Utf8mb4” is the full implementation of UTF-8, which takes up to 4 bytes of space per word. This is due to the fact that MySQL began supporting UTF-8 encoding in version 4.1 (at the time referring to THE UTF-8 draft version as RFC 2279) in 2003, and restricted its implementation of UTF-8 encoding to a maximum of 3 bytes in September of that year. Utf-8 was later formalized as a standardized document (RFC 3629). Limiting the space footprint of UTF-8 encoding implementations is generally considered for compatibility and read optimization of database file design, but it does not achieve the goal. Moreover, utF-8 encodings began to fail to store Unicode characters (such as emoji characters) that needed to be stored in non-basic multilingual planes (since the 3-byte implementation could only store characters in basic multilingual planes). It was not until 2010 that version 5.5 released “UTF8MB4” to replace, “UTF8” renamed “UTF8MB3” and adjusted “UTF8” to “UTF8MB3” alias, the old “UTF8” encoding is not recommended to fix the legacy issues.

What character encoding does JavaScript use

In the ECMAScript5.1 specification:

A conforming implementation of this International standard shall interpret characters in conformance with the Unicode Standard, Version 3.0 or later and ISO/IEC 10646-1 with either UCS-2 or UTF-16 as the adopted encoding form, implementation level 3. If the adopted ISO/IEC 10646-1 subset is not otherwise specified, it is presumed to be the BMP subset, collection 300. If the adopted encoding form is not otherwise specified, it is presumed to be the UTF-16 encoding form.

As you can see, the JavaScript engine is free to use UCS-2 or UTF-16 internally. Most current engine implementations are UTF-16.

This explains the problem mentioned at the beginning, why in JS, ❤ is one length and 😂 is two lengths, very simple:

The corresponding Unicode for ‘❤’ is U+2764
The Unicode for ‘😂’ is U+1f602
The Unicode for ‘ji’ is U+5409
The Unicode for ‘𠮷’ is U+20bb7

So, ‘❤’ and ‘ji’ are in Unicode BMP, while ‘😂’ and ‘𠮷’ are in the extension plane, affected by UCS-2, JavaScript’s length implementation, does not recognize characters outside the extension plane.

Inside the engine, the characters in the extension plane are basically UTF-16 characters. Using ‘😂’ as an example, let’s calculate:

The Unicode for ‘😂’ is U+1f602, denoting it as a proxy pair:

Subtracting 0x10000 from the code point gives 0xf602
Add the value of the high 10 bits to 0xD800 to get the high level proxy 0xD83D. The calculation process is as follows:
Add the value of the low 10 bits to 0xDC00 to get the low level proxy 0xDE02 as follows:

Finally, we verify the proxy pair (0xD83D, 0xDE02) :

So it’s easy to understand why JavaScript behaves this way, and TC39 actually addresses this issue:

So ES6 provides access to Unicode code points:

By the way, the new generation of programming languages have adopted UTF-8 for strings (I always felt that proxy pairs were a poor way to accommodate the shortcomings of UCS-2 at the time) :

Golang uses UTF-8 to implement string
Rust uses UTF-8 to implement string and UTF-32 to implement char

Back to the implementation of Acorn, in fact, it is to parse the characters of the non-basic plane into its symbols, but the implementation method is not very intuitive, interested students can use mathematical knowledge to calculate, HERE I give a more intuitive implementation version (MDN) :

/ *! http://mths.be/codepointat v0.1.0 by @ mathias * / if (! String.prototype.codePointAt) { (function() { 'use strict'; // Strict mode, needed to support `apply`/`call` with `undefined`/`null` var codePointAt = function(position) { if (this == null) { throw TypeError(); } var string = String(this); var size = string.length; Var index = position? Number(position) : 0; if (index ! = index) { // better `isNaN` index = 0; } / / boundary the if (index < 0 | | index > = size) {return undefined; } var first = string. CharCodeAt (index); var second; If (// Check to start surrogate pair first >= 0xD800&&first <= 0xDBFF && // High surrogate size > index + 1 // Next surrogate unit) { second = string.charCodeAt(index + 1); if (second >= 0xDC00 && second <= 0xDFFF) { // low surrogate // http://mathiasbynens.be/notes/javascript-encoding#surrogate-formulae return (first - 0xD800) * 0x400 + second - 0xDC00 +  0x10000; } } return first; }; if (Object.defineProperty) { Object.defineProperty(String.prototype, 'codePointAt', { 'value': codePointAt, 'configurable': true, 'writable': true }); } else { String.prototype.codePointAt = codePointAt; }} ()); }Copy the code

The most intuitive core conversion:

C = (H - 0xD800) * 0x400 + L - 0xDC00 + 0x10000
Copy the code

Write in the last

Character coding is actually involved in the culture, politics, unification and other issues of each country, and is not purely a technical issue. Therefore, in the course of history, there will be a variety of burdens and technical debt, which also leads to a variety of magical problems in programming.

Every programmer should know something about character encoding. Once we have the basic concepts, we can have a deeper understanding of the programming language and character handling. This article I spent a lot of time to consult and research, I hope to bring you some help, more exchanges!

reference

The telegraph
A character encoding
ASCII
ISO/IEC 646
ISO/IEC 2022
ISO/IEC 8859
GB 2312
Universal coded character set
Unicode Revisited
Unicode character plane mapping
UTF-32
UTF-16
UTF-8
Annotated ECMAScript 5.1
Full Unicode in ECMAScript
JavaScript’s internal character encoding: UCS-2 or UTF-16?