Strings are an important concept in any programming language, but also a very complex problem.

What is encoding and decoding?

As we all know, the internal use of the computer is binary code for data storage, operation, and most people are more good at graphics, text reading understanding. In order to facilitate the understanding of the computer to store content or store graphic content, it is necessary to do a conversion between the two. Encoding is the process of converting data from one form or format to another. Simply speaking, it is the process of language translation. It is A set of algorithms, such as converting character A to 01000001 is an encoding process, and decoding is the reverse.

Character set

A character set is a collection of all abstract characters supported by a system. For example, in our Xinhua Dictionary, as in a character set, the Chinese character “XXX” is in line 10 on page 100, whereas “XXX” is in line 10 on page 100. Simply put, a character set is a mapping table that maps a string of code values to specific characters in an abstract character table. Common character sets include: ASCII character set, GBK character set, Unicode character set, etc. Different character sets specify a limited number of characters. For example, the ASCII character set contains only Latin characters, GBK contains Chinese characters, and Unicode contains all characters in the world.

ASCII code

Since the birth of the computer for a long time, its application scope is mainly in some developed countries in Europe and the United States, therefore, they need to convert in the program is Latin letters and Arabic numerals. Thus, ASCII (American Standard Code for Information Interchange) was created. ASCII coding rules for total 128 characters and the corresponding binary conversion relationship, 128 characters, including the display of 26 letters (case), 10 Numbers, punctuation and special control characters, also is the common characters in English and western European languages, this 128 characters in a byte (eight bits) to represent more than enough, Because a byte can represent 256 characters, only 7 bits of the byte are currently used, with the highest bits zeroed out. Therefore, lowercase a corresponds to 01100001 and uppercase A to 01000001.

GB2312 encoding

With the gradual popularization of computers in various countries, ASCII code does not take into account the character set of other languages, each country for their own language to expand ASCII code, especially our Chinese characters, extensive and profound, there are 3,500 commonly used Chinese characters, even EASCII (ASCII extension of the highest bit) can not support. Issued in 1980 by the General Administration of National Standards of China, the IMPLEMENTATION of GB2312 code on May 1, 1981, provides that each Chinese character is composed of two bytes, theoretically it can represent 65536 characters, but it only includes 6763 Chinese characters, please check GB2312 simplified Chinese code table for details.

GBK code

Chinese character code extension specification, known as GBK, full name is “Chinese character code Extension Specification (GBK)” version 1.0, by the National Information Technology Standardization Technical Committee formulated on December 1, 1995. GBK is to expand the byte range of GB2312-80, each Chinese character still occupies two byte space, mainly for the expansion of AA-AF and F8-Fe regions not used in the GB2312 character set. After the expansion, 21886 Chinese characters can be expressed, and the expansion content is mainly part of Chinese surname, traditional Chinese characters, Japanese kana, Greek letters and Russian letters that are not supported by GB2312.

GB2312 and GBK coding are not the focus of this article, including compatible with the former two, variable multi-byte coding GB 18030, not to do in-depth discussion, interested in the relevant information.

The Unicode character set

As each country has formulated its own character set and character code, international graphic and text transmission requires frequent encoding conversion, otherwise, gibberish will appear after the documents of country A are sent to country B and decoded according to the coding rules of country B. There are two solutions:

  1. Internationalized programs or systems need to install complex character sets and encoding rules to handle decoding and conversion between different languages.
  2. Put all the literal characters in the world into a single character set, using the same encoding/decoding method.

Obviously, the first approach looks like a pain in the ass. In 1991, the INTERNATIONAL Organization for Standardization (ISO) and the Unicode Consortium (Unicode consortium) developed the ISO/IEC 10646 (USC) and Unicode projects, respectively. Both wanted to unify the world’s characters using a standard set of characters. In order to avoid coding differences, they decided to combine their work, although the projects were independent. But they are mutually compatible. However, Unicode names are more widely used because they are easier to remember.

Unicode, which codifies and encodes most of the world’s writing systems, makes it easier for computers to present and process text. Unicode is still evolving. The latest version, 13.0.0, released in March 2020, contains more than 130,000 characters. It covers visual glyphs, encoding methods, standard character encodings, and character features such as upper and lower case letters.

Code points

In Unicode, each character is assigned a unique value, which we call a code point (aka code point). The code point is in the format of U+[XX]XXXX, where X represents a hexadecimal number ranging from U+0000 to U+10FFFF. For example, the code point of the character ‘A’ in the Unicode character set is U+0061.

'ah'.charCodeAt(0) // 21704 => 0x54c8
'\u54c8'= >'ah'
Copy the code

According to code points, the Unicode character set is divided into 17 groups, each called a Plane, with 65536 (FFFF => 16^4) code points per Plane

The plane Start and end code point values Chinese name English names
0 plane U+0000 – U+FFFF Basic multilingual plane Basic Multilingual Plane, abbreviatedBMP
1 plane U+10000 – U+1FFFF Multilingual supplementary plane Supplementary Multilingual Plane for shortSMP
2 plane U+20000 – U+2FFFF Ideograms complement the plane Supplementary Ideographic Plane, for shortSIP
No. 3 plane U+30000 – U+3FFFF Ideographic third plane 3) Tertiary Ideographic PlaneTIP
Plane 4. – Plane 13 U+40000 – U+DFFFF (Not in use)
14 plane U+E0000 – U+EFFFF Special purpose supplemental plane Supplementary Special- Purpose Plane, for shortSSP
15 plane U+F0000 – U+FFFFF Keep asPrivate Use Area (Area A) Private Use Area-A, for shortPUA-A
16 the plane U+100000 – U+10FFFF Keep asPrivate Use Area (Area B) Private Use Area-B, for shortPUA-B

The Unicode character set defines the mapping between characters and code points, but does not specify how the conversion is stored. If code points are stored directly in the stored procedure, each character needs at least 3 bytes (U+0000 ~ U+10FFFF occupies the space of the maximum code point), but most of the commonly used characters are actually distributed in the 0 plane (U+0000 ~ U+FFFF occupies only 2 bytes space). Each character then wastes almost a byte of storage space. To balance storage space, compatibility, and decoding for all flat characters, the Unicode standard defines UTF-8, UTF-16, and UTF-32 (UTF stands for Unicode Transformation Format).

Utf-8 encoding

The encoding mode is as follows:

  1. For single-byte characters, the first byte position is zero and the last 7 bits use the Unicode code point for that character. That is, the character encoding is the same as the ASCII code
  2. For n (n > 1) byte characters, the highest bit byte, beforenlocation1In the firstn + 1location0, the remaining low bytes are used10At the beginning, the valid binary bits (shown below)X to take up a) represents the Unicode code point for the character.
Unicode code point range (hexadecimal) Utf-8 Encoding mode (binary) The number of bytes
U+0000 ~ U+007F(0~127) 0xxx xxxx single-byte
U+0080 ~ U+07FF(128~2047) 110x xxxx 10xx xxxx Double byte
U+0800 ~ U+FFFF(2048~65535) 1110 xxxx 10xx xxxx 10xx xxxx Three bytes
U+10000 ~ U+10FFFF(65536~2097151) 1111 0xxx 10xx xxxx 10xx xxxx 10xx xxxx Four bytes

According to the above table, it is easy to compare Unicode code points to encode, taking character fish (Unicode code point is U+2EE5) as an example: The code point is in the range from U+0800 to U+FFFF, covering three bytes. Just replace the x in 1110 XXXX 10XX XXXX 10XX XXXX from 2EE5 => 0010 1110 1110 0101 (zero the x without padding). 1110 0010 1011 1011 1010 0101, which is E2BBA5 in hexadecimal format.

Decoding is also simple. If the highest bit of a byte is 0, the byte is a character, and the lower 7 bits are directly converted to Unicode code points. If the highest bit of the first byte is 1, the number of consecutive 1’s (up to 4), the number of bytes that the character takes up, the significant bit (x) is extracted and converted to Unicode bytecode. For example, the fish encoding result E2BBA5 => 1110 0010 1011 1011 1010 0101 occupies three bytes, and the valid bit 0010 1110 1110 0101 => 2EE5 is the Unicode code point.

Utf-8 encoding bytes are all significant bits except for the highest bits, and the space utilization is 7/8 (87.5%). The five digits (110 10) of the double bytes are unavailable, and the space utilization is 11/16 (68.75%), and so on. The three-byte utilization is 10/24 (41.67%), and the four-byte utilization is 11/32 (34.34%). In double bytes, the Unicode starting value is U+0080, and the encoding is 1100 0010 1000 0000, so the values between 1100 0000 1000 0000 and 1100 0010 1000 0000 (all valid values are zero to start values) are not encoded. Which is wasted.

Utf – 16 coding

Unlike UTF-8, UTF-16 uses 2 or 4 bytes for encoding.

Utf-16 originated from UCS-2(Universal Character Set coded in 2 OCTEts, 2-byte Universal Character Set), and was originally designed as 16-bit (double-byte) encoding. U+0000 ~ U+FFFF (0~65536). Later, when Unicode and USC merged, the coverage needed to be extended to U+10000 ~ U+10FFFF, and a new encoding mechanism — proxy mechanism was introduced to be compatible with the old version.

The code points in the range of U+0000 ~ U+FFFF are called Basic Multilingual Plane (BMP), and the extended code points in the range of U+10000 ~ U+10FFFF are called Supplementary Planes, Supplementary plane). There are 2048 free parts D800~DFFF in the BMP plane, which is called proxy region (this range does not point to any character, compatible with the old version encoding and decoding). The agent is divided into high agent region (U+D800 ~ U+DBFF) and low agent region (U+DC00 ~ U+DFFF).

The encoding mode is as follows:

  1. The code point is in the range of U+0000 ~ U+FFFF, and two bytes are directly filled with binary encoding, and zero is filled on the left if the number is not enough
  2. If the code point is in the range of U+10000 ~ U+10FFFF, the code point (U+0000 ~ U+FFFFF) is converted to 20 bits (5 F, 5ร—4) binary, and zero is added to the left side when the code point is less thanyyyy yyyy yyxx xxxx xxxx. Take the 10 higher y (yy yyyy yyyy) plus1101 1000 0000 0000 (D800)composition1101 10yy yyyy yyyyDenoted as high level agent; Extract the lower 10 bits x (xx xxxx xxxx) plus1101 1100 0000 0000(DC00)composition1101 11xx xxxx xxxxDenoted as low level proxy. The encoding result consists of a high-order proxy and a low-order proxy (a total of four bytes).
Unicode code point range (hexadecimal) Utf-16 Encoding mode (binary) The number of bytes
U+0000 ~ U+FFFF(0~65535) xxxx xxxx xxxx xxxx Double byte
U+10000 ~ U+10FFFF(65536~2097151) 1101 10yy yyyy yyyy 1101 11xx xxxx xxxx Four bytes

High level proxy = ((code point -0x10000) >>> 10) + D800 // Move 10 bits right to take 10 bits higher

Or high level proxy = (code point -0x10000) รท 0x400 + D800 // right shift 10 bits that is divided by 100 0000 0000 => 0x400

Low level proxy = ((code point -0x10000) &0x003FF) + DC00 // high 10 position 0, low 10 position 1, take and value

Or high level proxy = (code point -0x10000) % 0x400 + DC00

Take li (code point U+2F800) as an example (note that li here is not Li in Chinese -u +4E3D) :

  1. 0x2f800-0x10000 = 0001 1111 10,00 0000 0000
  2. High level proxy = (0x2F800-0x10000) / 0x400 + 0xD800 = 1101 1000 0111 1110 = 0xD87E
  3. Low level proxy = (0x2F800-0x10000) % 0x400 + 0xDC00 = 1101 1100 0000 0000 = 0xDC00

So the encoding result is: 0xD87E DC00. Decoding process:

  1. If the current double – byte encoding is not in D800~DBFF (high – level proxy range), it is directly decoded to code points
  2. If the current double-byte encoding is between D800 and DBFF, it is decoded with its next double-byte. Current double byte (1101 10YY YYYY YYYY) minus0xD800After (0000 00YY YYYY yyyy) move 10 bits left (YYYY YYYY YY00 0000 0000), the next double byte (1101 11XX XXXX XXXX) is subtracted0xDC00After (0000 00XX XXXX XXXX), the two add up to yyyy YYYY YYXX XXXX XXXX, plus U+10000, namely code point

For example 0xD87E DC00:

  1. If the first double-byte D87E is in the range from D800 to DFFF, subtract0xD800after0000 0000 0111 1110
  2. The second byte DC00 minus0xDC00for0000 0000 0000 0000
  3. The sum of the two is equal to0001 1111 1000 0000 0000That’s 0x1F800 plus 0x10000, which is 0x2F800

Utf-16 uses D800~DFFF to have a total of 2048 free parts. The high proxy area (U+D800 ~ U+DBFF) can represent 10 bits (2^10), and the low proxy area (U+DC00 ~ U+DFFF) can also represent 10 bits (2^10). There are exactly 20 digits between U+10000 and U+10FFFF. The problem can be solved perfectly by splitting the first 10 digits and the last 10 digits for high-low proxy.

UTF – 32 encoding

Utf-32 encoding is very simple, fixed four bytes as a character storage space, not enough left zeros. The Unicode maximum code point U+10FFFF is only three bytes long, so why UTF-32 is designed to have four bytes or why utF-24 is not encoded is unclear. Utf-8 and UTF-16 encode and decode with bit processing, but they have high storage utilization, UTF-32 sacrifice space utilization, but improve encoding and decoding efficiency.

conclusion

Both UTF-8 and UTF-16 encoding result in variable length and are optimized for storage space. The difference is that UTF-16 performs index operation quickly while UTF-8 is relatively weak. For example, for an encoding result XXXX XXXX 10XX XXXX XXXX XXXX,UTF-8 needs to parse the encoding result from beginning to end if it wants to know the character corresponding to the second byte. Utf-16 only needs to determine the second byte 10xx XXXX if it is in U+D800 to U+DBFF (high level proxy) its next byte forms a character with it, and if it is in U+DC00 to U+DFFF (low level proxy) its last byte forms a character with it.

Utf-32 does have a high storage waste rate, but it is fast in indexing (one character in every four bytes, if you want to get the NTH byte, just find the NTH/fourth byte) and calculating character length (byte length /4).

JavaScript code

The JavaScript language uses the Unicode character set, but the encoding used is neither UTF-8, UTF-16, nor UTF-32, but UCS-2. For historical reasons, JavaScript came out before UTF-16 was released, and UCS (the ISO/IEC 10646 mentioned above) was developing faster than Unicode, releasing the first encoding method ucS-2 (UTF-16 was released six years later). Use 2 bytes to represent characters that already have code points (there were only basic planes at the time, and two bytes were enough). When UTF-16 was released, it was specified to be a superset of UCS-2, so now there is only UTF-16 and no UCS-2.

Because JavaScript can only handle UCS-2 encodings (two bytes), the supplementary plane (four bytes) is treated as two characters, so JavaScript functions or attributes that handle characters are limited by this and sometimes do not return accurate results. For example, “Li” (code point U+2F800) mentioned above:

// 'li' (code point U+2F800)
// Direct copy code may be copied as Chinese "li" due to editor reasons, it is recommended to go to https://unicode-table.com/cn/2F800/ to copy "Li"
'she'.length / / 2
'she'.charAt(0) // \ud87e
'she'.charAt(1) // \udc00
'she'.substr(0.1) // \ud87e
'she'.substring(0.1) // \uD87E
'she'.slice(-1) // \uDC00
'she'.split(' ') // ['\uD87E', '\uDC00']
'she'= = ='\u2F800' // false
'she'= = ='\uD87E\uDC00' // true
'she'.charCodeAt(0) // 55422(0xd87e)
'she'.replace('\uD87E'.'0') // '0\uDC00'
'she'.indexOf('\uDC00') / / 1
// There is a problem with two characters

// The compatible processing is as follows:
var index = 0;
var len = str.length
while (index < len) {
  var charCode = str.charCodeAt(index)
  if (charCode >= 0xD800 && charCode <= 0xDBFF) { // High level proxy
    console.log(str.charAt(index) + str.charAt(++index))
  } else {
    console.log(str.charAt(index++))
  }
}
Copy the code

ES6

Es6 has a lot of compatibility handling for characters in supplementary planes, which makes up for the problems of previous versions.

  1. Unicode representation of characters

    ES6 allows the use of \uxxxx (XXXX for Unicode code point) to represent a character, while the characters of the supplementary plane (four bytes) can be \ uYYYY \uxxxx (YYYY for high proxy point, XXXX for low proxy point) or \u{XXXXX}.

  2. String traversal interface

    Use for… Of can correctly recognize supplementary plane (four-byte) characters in a string or iterate over them using an iterator to generate an Array […string] or array.from (STR) (essentially an iterator wrapper).

  3. JSON. Stringify modification

    According to the standard, JSON data must be utF-8 encoded. Escape sequences can be: “\”, “” “,”/”, “\b”, “\f”, “\n”, “\r”, “\t”, or double-byte Unicode code points (\uxxxx), and the supplementary plane (four-byte) must use the UTF-16 encoding agent.

    { "face": "๐Ÿ˜‚" }
    // or
    { "face": "\uD83D\uDE02" }
    Copy the code
  4. The regular U modifier

    Strings support Unicode notation, and regees need to be supported accordingly

    // \uD83D\uDC2A should be a character, the re does not match
    /^\uD83D/u.test('\uD83D\uDC2A') // false
    // \uD83D\uDC2A is recognized as two characters, so it can be recognized
    /^\uD83D/.test('\uD83D\uDC2A') // true
    Copy the code
  5. String.prototype.includes

    In theory, ES6 should support compatible supplementary plane (four-byte) strings, but unfortunately:

    // 'li' is treated as two characters
    'she'.includes('\uD87E') // true
    // This is confusing
    'she'.includes('\u{2F800}') // true
    Copy the code

conclusion

In everyday character manipulation, there are many scenarios where you need to deal with supplementary flat (four-byte) strings, such as an input field or a rich text field, and you need to be careful with emoticon characters.

reference

Programming with Unicode

wiki-UTF-16

Unicode website

Unicode and Javascript in detail

UTF-8

ASCII