Full text:

  • Character Encoding (I: Terms and origin of character Encoding)
  • Character coding (II: Simplified Chinese character coding and ANSI coding)
  • Character encoding (III: Unicode encoding system and byte order)
  • Character encoding (4: UTF series encoding details)
  • Character coding (five: Network transmission coding Base64, percent coding)

1.3 Evolution of simplified Chinese character coding

The development overview

The English letter, plus some other punctuation characters, does not exceed 256; one byte is enough to represent a character (2^8 = 256). However, some other characters have more than so many characters. For example, Chinese characters have more than 100,000 characters. A byte can only represent 256 characters, which is certainly not enough, so we can only use multiple bytes to represent a character.

So when computers were introduced to China, authorities devised a series of codes called GB (” GB “is an acronym for” national standard “in Chinese pinyin, the National standard of the People’s Republic of China).

Ps: GB series only GB 2312 code is ISO 2022 compliant.

According to the GB series encoding scheme, if a byte is 0 to 127 in a text, that byte has the same meaning as the ASCII code, otherwise, this byte and the next byte together make up Chinese characters (or other characters defined by the GB code).

Therefore, the GB series encoding scheme is fully and directly compatible with the ASCII encoding scheme down. That is, if all characters in a text encoded in the GB encoding scheme are defined in the ASCII encoding scheme (that is, the text consists entirely of ASCII characters), then the GB encoding is actually exactly the same as the ASCII encoding.

The earliest GB coding scheme is GB 2312, which contains less than 10,000 Chinese characters, which can basically meet the needs of daily use, but does not include some rare characters. Therefore, it has been expanded on the basis of GB 2312.

The expanded coding scheme on the basis of GB 2312 was called GBK (K is the pinyin initials for “expansion”), and it was further expanded to GB 18030, adding some Chinese ethnic minority characters, and some rare characters were compiled to 4 bytes.

GB series coding schemes including GB 2312, GBK and GB 18030 (excluding GB 13000, the same as below, no further details; GB13000 coding scheme is described in detail below), each extension completely retains the previous version of the code, so each new version is backward compatible.

It’s important to point out here that although multiple bytes are used to represent a character, However, the Chinese character encoding of the GB class has nothing to do with the CEF encoding of utF-8, UTF-16, UTF-32 and other characters in the Unicode encoding scheme (utF-8 still uses one byte encoding for ASCII characters, while non-ASCII characters are multi-byte encoding).

However, because of the need to use multiple bytes to represent a character, the GB family encodings, like the Unicode encodings described below, undoubtedly lead to higher complexity (including temporal complexity, spatial complexity, etc.) than the ASCII encodings that use only a single byte.

For example, when a multi-byte character is mixed with the original ASCII character:

  • Either the original ASCII characters are re-encoded into multi-byte representations to unify with other multi-byte characters (as is the case for UTF-16, UTF-32, etc.);
  • Either keep the ASCII character as a single byte encoding, but set the highest bit (that is, the first digit) of each byte in other multi-byte character encodings to 1 to avoid conflicts with the ASCII encoding with the highest byte bit being 0 (this is the approach used by GB, UTF-8, etc.).

The former has higher spatial complexity, because ASCII characters that used to be represented by a single byte must now be represented by multiple bytes, which obviously consumes more storage space. The latter has a higher time complexity because it uses a more complex Encoding Algorithm to avoid collisions and other considerations (such as extensibility, fault tolerance, etc.), which is definitely more computationally time-consuming.

In addition, whether the former or the latter, if the use of multi-byte encoding is multi-byte Code Unit (such as UTF-16, UTF-32 use multi-byte Code, while utF-8 non-ASCII characters are also multi-byte encoding, but use single-byte Code; Note that the GB series encodings, while the rest of the characters except ASCII characters are multi-byte encodings, still use single-byte code elements.) Due to historical reasons, this further raises the more troublesome byte-order ** problem. (For the introduction of encoding algorithm, code element and byte order, see the explanation later)

GB 2312 yards

GB 2312 coding scheme, namely “The Basic Set of Chinese Coded Character Set for Information Interchange”, is a national standard issued by the General Administration of Standards of China in 1980 and implemented on May 1, 1981. The standard number is GB 2312-1980, also known as: GB/T 2312, GB/T 2312-80 or GB/T 2312-1980.

GB 2312 code is suitable for information exchange between Chinese character processing and communication systems, and is widely used in Mainland China. Singapore and other places also use this code. Almost all Chinese systems and international software in mainland China support GB 2312 (excluding Hong Kong, Macao and Taiwan, they use several other traditional character encoding systems: CCCII, CNS 11643, BIG5 and BIG5. The original text does not explain in detail, understand yourself can).

GB 2312 code In order to avoid conflict with THE ASCII character code (0~127), it is stipulated that the value of the byte representing the code of a Chinese character (i.e. the internal code of a Chinese character) must be greater than 127 (i.e., the highest bit of the byte is 1). And two bytes larger than 127 must be connected together to represent a Chinese character (GB2312 is double byte encoding), the first byte is called high byte, the last byte is called low byte; A byte with a value less than or equal to 127 (i.e., the highest bit of the byte is 0) still represents an original ASCII character (ASCII is single-byte encoding).

Therefore, GB 2312 can be considered as a Chinese extension of ASCII (i.e., GB 2312 is fully and directly compatible with ASCII), just as EASCII is a European extension of ASCII, which can be called ASCII superset, 8-bit edition, etc.

However, it is clear that GB 2312 is different from the characters represented by the 128 to 255 extension of the EASCII code. In other words, while GB2312 and EASCII are both compatible with ASCII, GB2312 is not compatible with EASCII extensions. (PS: GB 2312 is incompatible with EASCII’s ASCII extension)

In fact, other popular character encoding schemes in the world, except ASCII, are basically compatible with ASCII (including direct compatibility and indirect compatibility, as described later), but they are not compatible with each other except for the parts of ASCII characters.

There are 6,763 Chinese characters in GB 2312 standard, including 3,755 first-level Chinese characters and 3,008 second-level Chinese characters. In addition to Chinese characters, GB2312 also contains 682 characters including Latin letters, Greek letters, Japanese hiragana and Katakana letters, and Russian Cyrillic letters.

In addition to Chinese characters, GB 2312 even includes numbers, punctuation, letters and other characters that are already available in ASCII, perhaps for visual reasons. In other words, the original single-byte ASCII characters were encoded in a two-byte version of GB 2312.

These 682 double-byte encoded characters are often referred to as “full corner” characters, and the corresponding single-byte ASCII characters are called “half corner” characters.

Full corner, half corner

Full – corner characters are the legacy of Chinese display and double – byte Chinese encoding.

Due to the limited number of pixels on early dot matrix displays, the width of the original ASCII Western characters (say, 8 pixels wide) was limited for displaying Chinese characters (in fact, early needle printers had this problem when printing). Therefore, Chinese characters are displayed at twice the width of ASCII characters (say, 16 pixels wide).

As a result, ASCII Western characters are displayed half as wide as Chinese characters. Perhaps to in western characters and Chinese characters mixed typesetting, let the Chinese and western characters can align with visual aesthetic considerations, then design the western letters, Numbers and punctuation and other special characters on the visual appearance also occupy a Chinese character of visual space (mainly width), and also the same as Chinese characters on internal storage to use 2 bytes for storage solutions. These Characters with the same display width as Chinese characters are called full-angle characters.

However, the original Western characters in ASCII are called half-corner characters because they only occupy half of the visual space of Chinese characters (mainly width) in terms of appearance and vision, and use 1 byte for storage in internal storage, compared with full-corner characters.

Later, some of these full-corner characters became widely used because they were more useful (such as the full-corner comma, “, “question mark”? , exclamation point “!” In the input method, the half Angle and the full Angle are the same in the Chinese input state; in the English input state, the full Angle is the same as the Chinese input state, but the half Angle is about half the width of the full Angle), which is dedicated to The Chinese, Japanese and Korean texts and has become the standard punctuation characters in China, Japan and Korea. Many of the other full-corner characters gradually lost their value (it is now rarely necessary to align Chinese and Spanish characters in plain text) and were rarely used again.

The de facto standard for global character encoding is the Unicode character set and utF-8, UTF-16 and other encoding implementations based on it. Unicode embraces many legacy encodings and preserves all characters for compatibility. Therefore, these full-corner characters in Chinese encoding schemes have been retained, and national standards still require fonts and software to support these full-corner characters.

However, the relationship between half-angle and full-angle characters in UTF-8, UTF-16, and so on is no longer a simple 1-byte and 2-byte relationship. See below for details.

The original text synthesizes ZhihuWhy does Chinese input method have full Angle and half Angle difference?”Under many answer the Lord’s answer, there are many modifications.

GBK code

GB 2312-1980 contains 6,763 Chinese characters, covering 99.75% of the usage frequency in Mainland China and basically meeting the computer processing needs of Chinese characters.

However, GB 2312 cannot deal with rare and rare characters in people’s names and ancient Chinese, such as some characters simplified after GB 2312-1980 (such as “lo”), some characters used in people’s names (such as singer David Tao’s “Zhe”), traditional characters used in Taiwan and Hong Kong, Japanese and Korean characters, etc. It’s not included.

Therefore, The National Technical Committee on Information Technology of Standardization Administration collects all the characters of GB 13000.1-1993 by using the unused code point space of GB 2312-1980. On December 1, 1995, Guo-biao Kuozhan issued the National Standard Extension Code (GBK), which is based on GB 13000.1-1993 (GB 13000 is described in detail below) and extended to GB2 312-1980. Chinese Internal Code Specification).

However, GBK, which contains all the characters of GB 13000.1-1993, although it is an extension based on GB 2312-1980, But in the way of coding and GB 2312-1980 is not exactly the same (and in order to conform to the international standard ISO/IEC 10646 GB 13000.1-1993 is completely different).

Although GBK is double byte encoding like GB 2312, GBK only requires that the first byte, i.e., the high byte greater than 127, is fixed to indicate that this is the beginning of a Chinese character (i.e., the first byte of GBK encoding must be 1; Of course, 0~127 still represents ASCII characters), no longer requires that the second byte, namely the low byte, must be greater than 127 as GB 2312 does (i.e., GBK encoding the low byte can be both 0 and 1).

Because of this, GBK, also a double – byte encoding, can contain more characters than GB 2312.

The GBK character set is fully backward compatible with GB 2312, and also supports some simplified Chinese characters, traditional Chinese characters, and Japanese kana that are not supported by GB 2312-1980 (however, this encoding does not support Korean characters, which is also its deficiency compared with Unicode encoding in actual use). It contains a total of 21,003 Chinese characters, 883 symbols, and provides 1894 character code points, combining simple and traditional characters.

The GBKCoding framework(Code Scheme) : GBK/1 contains other supplementary characters except GB2312 characters, GBK/2 contains GB2312 characters, GBK/3 contains CJK characters, GBK/4 contains CJK characters and supplementary characters, GBK/5 is non-Chinese characters, and UDC is user-defined characters.

Microsoft has already adopted GBK encoding in the Simplified Chinese version of Windows 95, which is an extension of Microsoft’s previous Code Page 936 (Code Page 936, abbreviated as CP936), which was almost identical to GB 2312-1980. (More on this later in the code page)

Microsoft’s CP936 is generally regarded as equivalent to GBK, and the Internet Assigned Numbers Authority (IANA) also uses CP936 as an alias for GBK.

But in fact, GBK defines 95 more characters than CP936 (15 non-Chinese characters and 80 Chinese characters), which were not included in ISO/IEC 10646 (i.e. UCS)/Unicode characters at that time. (More on UCS and Unicode later).

GB 18030 yards

China’s State Bureau of Quality and Technical Supervision launched the GB 18030-2000 standard on March 17, 2000 to replace GBK. In addition to retaining all THE GBK encoded Chinese characters, GB 18030-2000 expands again in the second byte, adding about 100 Chinese characters and four-tuple coding space.

GB 18030 “Supplement to the Basic Set of Chinese Coded Character Set for Information Interchange” is the most important Chinese coding standard after GB 2312-1980 and GB 13000-1993, and is one of the basic standards that Chinese computer systems must follow.

In 2005, the CODING scheme of GB 18030 was expanded on the basis of GB 18030-2000, and then GB 18030-2005 Chinese Coded Character Set for Information Technology was created.

As mentioned above, GB 18030-2000 is an upgraded version of GBK. Its main feature is that CJK unified Ideogram expands A Chinese characters on the basis of GBK. The main feature of GB 18030-2005 is to add CJK Unified Ideographic characters to expand B on the basis of GB 18030-2000.

Microsoft also defined a special code page for GB 18030: CP54936, but this code page is not actually used (there is no option in Windows 7’s “Control Panel” – “Regions and Languages” – “Administration” – “Languages for Non-Unicode Programs”; In the Windows CMD command line can be changed by CHCP 54936 command, then in CMD can display Chinese, but does not support Chinese input).

GB 13000 yards

Among all GB coding schemes, in addition to the GB series coding schemes such as GB 2312, GBK and GB 18030, which gradually expand and maintain downward compatibility, there is also a GB series coding scheme which is incompatible with GB 13000. (Note that although the GBK was developed, the main purpose is to include all the characters in GB 13000, but GBK encoding is completely different from GB 13000. Therefore, the customary GB family coding scheme does not generally include GB 13000.)

In order to carry on the unified coding to all the characters of all countries and regions in the world, in order to realize the unified processing of all characters in the computer in the world, the international Organization for Standardization has developed a new coding standard – ISO/IEC 10646 standard (namely the Universal Character Set, referred to as UCS, Compatibility with the Unicode standard developed by the Unity Consortium (see below).

This standard was first issued in 1993, at that time only issued its first part, namely ISO/IEC 10646.1:1993, in addition to the inclusion of other characters in the world, which also included Chinese characters from Mainland China, Taiwan, Japan and South Korea, a total of 20,902.

In order to be in line with international standards, China has formulated the national standard GB 13000.1-1993, Universal Multi-octagonal Coded Character Set (UCS) for Information Technology – Part I: Architecture and Basic Multilingual Plane, corresponding to ISO/IEC 10646.1:1993.

In 2010, it released its replacement standard — GB 13000-2010 universal Multi-octet Coded Character Set for Information Technology (UCS), which is equivalent to the international standard ISO/IEC 10646:2003 Universal Multi-octet Coded Character Set for Information Technology (UCS).

GB 13000 and the international standard ISO/IEC10646 and Unicode standards are currently basically consistent on the basic plane (i.e. BMP, see below).

The relationship between encoding schemes of Chinese characters (Big5Encoding scheme for traditional Chinese characters, mainly used in Hong Kong, Macao and Taiwan.

CJK code

CJK refers to The Unified Ideographs of China, Japan and South Korea, also known as Unihan, which aims to separate Chinese (including Zhuang), Japanese, Korean, Vietnamese, Ideograms with the same origin, same meaning, same or slightly different shape are assigned the same code point value in the Unicode standard and the ISO/IEC 10646 standard. (More on Unicode and ISO/IEC 10646 standards later)

CJK is an acronym of Chinese, Japanese and Korean. As the name suggests, it can support all three, but in reality, CJK can support a wide range of Asian double-byte scripts including Chinese (including Zhuang), Japanese, Korean, and Vietnamese.

The so-called “ideograms with the same origin, the same original meaning, the same shape or slightly different” are mainly Chinese characters, including traditional characters and simplified characters. Chic characters are also popular, including square And Zhuang characters, Japanese Kanji, Korean kanji (Kanji / 자), and Vietnamese Ju Characters (? Nan/Chữ Nom) and Writings of Confucianism (? Jong/Chữ Nho) etc. Ps: I can’t read Vietnamese characters…

The project originally consisted of Chinese characters and imitated Chinese characters used in Chinese, Japanese and Korean, collectively known as CJK Unified Ideographs. Later, the project was added with Vietnamese murmur, so it became known as CJKV.

Summary GB series

The GB Character Set belongs to the Double Byte Character Set (DBCS).

Note that the “GB class character set” here refers to the portion other than single-byte ENCODED ASCII characters, so it is narrow; Strictly speaking, the generalized “GB class character set” includes single-byte ASCII characters and double-byte non-ASCII characters, so the generalized GB class character set belongs to the single-byte and double-byte mixed character set. In a statement, the specific meaning of narrow or broad depends on the context.

The biggest feature of dbCs-based encoding scheme is that two-byte Chinese characters and one-byte English characters (ASCII characters) are fully compatible and can coexist in the same file.

Thus, in the days of DBCs-based encoding schemes, in order to support Chinese processing, programs had to pay attention to the value of each byte in the string, and if the value was greater than 127, a character in a double-byte character set was considered present.

When using the GB-like encoding scheme, it is always important to keep in mind that a Chinese character consists of two bytes (i.e. one Chinese character occupies as much storage space as two English characters).

1.4 Simplified Chinese character coding implementation

What is the specific implementation of GB series Chinese character coding schemes such as GB 2312, GBK and GB 18030? What is the location code? What is the gb code? What is the meaning of inside code, outside code and font code? How and why do they switch?

The following uses GB 2312 code as an example.

Location code

The GB2312 character set is divided into 94 districts, there were 94 bits, only one character on each of the location, which were containing 94 characters or symbols, used in the area and to to encode characters (in fact is the point of code point value, code number, character number), so called the location code (or called “location” is more appropriate).

In other words, GB 2312 encodes all characters including Chinese characters into a 94 * 94 two-dimensional table, in which row is “area” and column is “bit”. Each character is uniquely positioned by area and bit, and its corresponding area and bit number combined is location code.

For example, the character “wan” is 82 bits in the 45 area, so the location code of the character “wan” is: 45 82 (note that the code of GB is double byte code, so 45 is equivalent to the high byte and 82 to the low byte).

Ps: In multi-byte codes, the first byte is the high byte, and the following byte is the low byte.

ISO 2022 for compatibility with current 7-bit wide communication protocols/equipment. For 7-bit-wide coding space, 0x00-0x1f is reserved for control characters, and 0x20-0x7f is used to represent printing/”graphic” characters. Thus, in a 7-bit character encoding space, the total number of graphic characters is 94 (due to space occupying 0x20 code point, Del occupying 0x7F code point) or 96. For a double-byte 7-bit encoding space, graphics characters can be 94 x 94 or 8836. For a three-byte 7-bit encoding space, there can be 94 × 94 × 94, or 830,584 graphic characters (although there is no three-byte character set registered with ISO). From the 1970s to the 1980s, the number of Chinese character codes in Chinese, Japanese and Korean character sets was basically within this range. For each code point of double-byte encoding character, Japanese translation area point, Chinese translation code point; Area (area), point (point) and bit (bit) Therefore, GB 2312 and its related character set national standard, adopted “location code”.

In the GB 2312 character set:

  • Area 01~09 (682) : special symbols, numbers, English characters, tabs, etc., including Latin letters, Greek letters, Japanese hiragana and Katakana letters, Russian Cyrillic letters, etc. 682 full corner characters;

  • Zone 10~15: empty zone, reserved for expansion;

  • Region 16~55 (3,755) : Common Chinese characters (also known as first-level Chinese characters), sorted by pinyin;

  • Area 56~87 (3008) : Non-essential Chinese characters (also known as second-level Chinese characters), sorted by radical/stroke;

  • Zone 88~94: empty zone, to be extended.

National Code (Exchange code)

To avoid undisplayable ASCII characters 0000 0000 to 0001 1111 (hexadecimal 0 to 1F, decimal 0 to 31) and space 0010 0000 (hexadecimal 20), 32 in decimal) (why to avoid, and why only avoid ASCII 0 to 32 non-display characters and space characters, explained later), national code (also known as switching code) specifies the range of Chinese characters (0010 0001, 0010 0001) to (0111 1110), 0111 1110), the hexadecimal value is (21,21) to (7E, 7E), and the decimal value is (33,33) to (126,126).

Therefore, 32 must be added to area code and bit code respectively (the hexadecimal number is 20, which can be written as 20H or 0x20, and the suffix H or prefix 0x can both represent hexadecimal numbers) as the GB code. That is, the gb code is equivalent to shifting the location code back by 32 to avoid conflicts with the undisplayable and whitespace characters of 0 to 32 in ASCII characters.

GB 2312 is a DBCS double-byte character set, so GB is a double-byte code.

Thus, we can calculate the “ten thousand” character of the gb code as :(45+32, 82+32) = (77,114), hexadecimal is :(4D, 72), binary is :(0100 1101,0111 0010).

Internal code

However, gb codes cannot be used directly on computers because they would clash with the long-established ASCII code, resulting in garbled characters.

For example, the high byte 77 in the “ten thousand” gb code conflicts with the ASCII “M”, and the low byte 114 conflicts with the ASCII “r”. Therefore, in order to avoid conflicts with ASCII code, it is stipulated that the highest bit of each byte in the GB code is changed from 0 to 1, which is equivalent to each byte plus 128 (hexadecimal 80, that is, 80H; Binary is 1000 0000), thus obtaining the “machine code” representation of the GB code, referred to as “internal code”, thus achieving compatibility with the base ASCII code.

Since ASCII codes use only the lower seven bits of a byte, the “1” in the first digit (the highest digit) can be used as a marker to identify the Chinese character code. The computer interprets the “1” as a Chinese character when it processes the “1” as a Chinese character, and the “0” as an ASCII character when it processes the code.

Such as:

77 + 128 = 205 (binary 1100 1101, hexadecimal CD);

114+ 128 = 242 (1111 0010 in binary and F2 in hex).

We can check that out. Open Notepad, enter ten thousand characters, set the encoding to ANSI (the DEFAULT ANSI encoding of Windows Notepad in the Simplified Chinese operating system is GB class encoding, as explained later), and save the file, as shown in the following figure.

Then open the saved file with a binary editor (such as UltraEdit), switch to hexadecimal mode, you will see: CD F2, which is the internal code of “ten thousand” characters, as shown in the picture below.

Convert between three yards

From the location code (national standard definition) –> area code and bit code + 32 (+ 20H) respectively to get the NATIONAL code –> + 128 (+ 80H) respectively to get the machine code (no longer conflict with ACSII code).

Therefore, the region and bit of the location code are + 160 (i.e. + A0H, 32 + 128 = 160), and the interior code can be obtained directly. In hexadecimal format:

Location code (area code, bit code) + (20H, 20H) + (80H, 80H) = International code (area code, bit code) + (A0H, A0H) = Internal code (high byte, low byte).

Note: Hexadecimal numbers can be represented by either the suffix H or the prefix 0x.

Why add 20H and 80H?

The conversion of location code, GB code and internal code (PS: can be understood as three codes relative to ASCII offset respectively) is very simple, but the puzzle is why such conversion?

First of all, it should be noted that although GB 2312 is the Encoding scheme of Chinese characters, it also has encoding for 26 English letters and some special symbols. It should be reasonable to use ASCII encoding directly instead of recoding the characters that overlap with ASCII (33~127).

Add 20 h

Originally, when GB 2312 was formulated, it was decided to rearrange the printable ASCII characters, namely the English letters, numbers and symbols (33 to 126,127 are not printable DEL) into GB 2312, represented by two bytes. They are called full-corner characters (full-corner characters are twice as wide as ASCII characters on the screen, and the corresponding ASCII characters are later called half-corner characters for this reason).

The encoding of 33 non-printable characters in ASCII, such as the first 32 non-printable control characters (ASCII codes 0~31) and the 33rd non-printable space character (ASCII codes 32), is directly used and no longer re-coded.

Because to retain these 33 non-printable characters, the location code cannot be directly used as the machine internal code directly processed by the computer, and the location code needs to be offset 32 backwards to avoid conflicts (why offset 32, not 33? Because both area and bit codes in location codes are counted from 1, unlike ASCII codes which are counted from 0).

The hexadecimal representation of the decimal number 32 is 20H, which is why the area code and the bit code of the location code have to be added 20H respectively to get the GB code.

Add 80 h

However, if the national standard code is directly used as the internal code of the computer, it will conflict with SCII code, resulting in garbled code.

Compared with location code, gb code avoids the first 33 unprintable characters from 0 to 32 in ASCII code, but it does not avoid printable characters such as English letters, numbers and symbols (33 to 126, a total of 94 characters) and unprintable DEL (127) in ASCII code. That is, gb codes are not fully ASCII compatible.

Ps: As mentioned above, GB 2312 uses the new double byte (i.e., full Angle) 33 ~ 126, 127, etc. The original base ASCII is half Angle characters, which is not compatible.

To avoid conflicts with ASCII code completely, it was decided to set the highest bit of each byte in the GB code to 1, considering that ASCII only uses the lowest 7 bits of a byte and its highest bit (that is, the first digit) is always 0. That is, each byte in GB code and ASCII code actually only use the lower 7 bits in a byte), this is GB 2312 internal code (that is, internal code), referred to as GB 2312 code.

This completely separates ASCII from GB 2312. This is why gb code plus (80H, 80H) to get the reason inside the machine code.

Unresolved questions

If only to avoid conflicts with ASCII, why not just change the highest bits of the region code and bitcode from 0 to 1 (equivalent to adding 128 to each) in the first place without having to go through the superfluous intermediate conversion of the GB code? And you don’t have to go back 32, so you don’t have to waste all this coding space.

I am also very confused about this, searching on the Internet for a long time did not find the answer, so the specific reason is unknown. Perhaps it was ill-conceived in the beginning? Reserving some space for future expansion, perhaps? Or maybe there’s something else going on? Friends who know also hope to guide the maze.

As far as I know, the reason why the inner code avoids ASCII control characters is probably for fault tolerance. When text information is stored and transmitted, it is inevitable that a bit is reversed (0 becomes 1 or 1 becomes 0). If such an error occurs in one of the last seven bits of a byte, the Chinese characters represented will be shifted. If the first bit is wrong, the wrong byte is now encoded as another character that can be displayed. The latter error causes the error byte to become a control character if the 20H is not added when the code is designed, potentially causing a larger error on some devices that operate with control characters. For example, some teleprinter, telegraph, etc., a word into another word or into letters, no harm, but into control characters such as backspace, TAB characters, will lead to format, information disorder, and even cause equipment failure.

GB 2312 Mapping table of location code, National standard code and internal code (including 6,763 Chinese internal codes B0A1 ~ F7FE)

Foreign code (input code)

External code is also called input code, input code, is used to input Chinese characters into the computer in a group of keyboard symbols, is used as Chinese character input code.

The English alphabet has only 26 letters, so you can put all the characters on the keyboard (so western European, American and other coding standards have no input code, that is, no foreign code is required), and it is impossible to put all the Characters on the keyboard using this method. Therefore, Chinese character system needs to have its own input code system, so that Chinese characters and keyboard can establish a corresponding relationship (mapping relationship).

At present, the commonly used Chinese characters are divided into the following categories:

  • Numerical codes, such as location codes;

  • Pinyin coding, such as full spelling, double spelling, natural code and so on;

  • Character coding, such as wubi, table code, zheng code, etc.

Heavy codes often appear in Chinese external codes. The so-called double code refers to the same external code of Chinese characters corresponding to multiple Chinese characters, conversely, that is, there may be multiple external codes of Chinese characters are the same, equivalent to repetition, so it is called “double code”. For example, when using pinyin as a foreign code (that is, when using pinyin input method to input Chinese characters, there are quite a lot of homonyms), double code phenomenon is quite common.

When there is a heavy code, it is often necessary to attach the selection number to specify the Chinese character to be input (press the number key on the keyboard to select which Chinese character after the input method type Chinese character). In this case, it can be considered that the foreign code is actually equivalent to implicitly including the selection number.

Font code (output code)

Font code, also known as font code, font code, output code, belongs to a dot matrix code.

In order to output Chinese characters on display or printer, Chinese characters are designed into a dot matrix graph according to graphic symbols, and corresponding dot matrix codes (glyphs) are obtained.

In other words, 0 and 1 are used to represent the Chinese character glyphs, and Chinese characters are put into a square with N rows * n columns (i.e. a lattice). The square has N ^2 small squares, and each small square is represented by a binary number. The value of the square passed by strokes is 1, and the value of the square not passed by strokes is 0.

The display of a Chinese character usually uses 16×16 lattice or 24×24 lattice or 48×48 lattice. Given the size of the Chinese character lattice, the byte space required to store a Chinese character can be calculated.

For example, using 16×16 dot matrix to represent a Chinese character means that each Chinese character is represented by 16 lines with 16 dots. One dot needs 1 binary number, and 16 dots need 16 binary number (that is, 2 bytes), so 16 lines × 2 bytes/line = 32 bytes. That is, to represent a Chinese character with 16×16 dot matrix, the font code needs 32 bytes.

Therefore, number of bytes = number of dot rows x (number of dot arrays / 8).

Ps: You need to note that the Chinese character dot matrix in the original text is not the pixel dot matrix of the display, so the concept should not be confused.

Glyph code is mainly used to output (display output or print out) glyph inside the computer, what we see is just the text glyph, the glyph code itself is not directly “see”.

Obviously, characters represented by glyphs can be called “concrete” characters, as opposed to “abstract” characters in the abstract character table ACR, because they already have a “concrete” appearance.

Character library, character touch library

In order to display or print the Chinese character, the Chinese character information processing system also needs to be equipped with Chinese character character library, also known as font library, or font library for short, which centrally stores the Chinese character character information.

Font can be divided into display font and print font according to the output mode. The font used to display the output is called the display font, which needs to be called into memory when working. The font library used for printing out is called the print font library and does not need to be loaded into memory when working.

Font can also be divided into soft font and hard font according to the storage mode. The soft font library is stored on the hard disk in the form of font files (that is, font files), and is now used in this way (soft font library). Hard font library solidified font library in a separate memory chip, and then with other necessary components to form an interface card, plug in the computer, usually called han card. This method is now obsolete.

summary

It can be understood that in order to express Chinese characters in the computer and adopt a unified coding method formed by the Chinese character code is called internal code. The encoding of Chinese characters for the convenience of Chinese character input is external code, also called input code. The Chinese character encoding formed for displaying and printing Chinese characters is the font code, also known as the font code, output code.

Chinese characters from input to output process

Yards outside through the keyboard input Chinese characters, and then input method converts outside the Chinese character code (input code) to the current operating system by default character encoding scheme of character code (that is, the code point value), then according to the character code by code page table into Chinese character code (machine code) after the code page (see article), in order to realize the purpose of the input of Chinese characters. Then according to the selected font, find the font code (output code) corresponding to the font in the font library (that is, font library) through the Chinese character code, so as to convert the Chinese character code to Chinese character code, in order to achieve the purpose of display output and print output Chinese characters.

In fact, the input, processing and display of English characters are roughly the same, except that English characters do not need input codes (i.e., foreign codes), and can be directly typed into the corresponding English letters on the keyboard.

ASCII, And EASCII, ISO 8859 series, GB series, Big5 and Shift JIS, which are both ASCII compatible and mutually incompatible ANSI codes (” ANSI codes “are the asCI-compatible codes formulated for each country and region in the world) I code and incompatible with each other between all kinds of character encoding collectively, the next article will introduce in detail the context of this collectively), all belong to the traditional character encoding model, but not modern character encoding model, it is difficult to directly and simply apply the concept of modern character encoding model to express.

In terms of GB series codes, the location code is roughly equivalent to the character number of CCS in the modern character coding model, the national code is equivalent to the character sequence of CEF, and the machine code is equivalent to the byte sequence of CES.

However, due to GB code though is multi-byte encoding series, but the element is a single-byte code element (the concept of the element after the article has a detailed introduction), so there is no byte sequence problem, also there is no character encoding mode CES big end of the sequence, the concept of small end of the sequence (byte order and big end sequence, the concept of small end of the sequence after the article has a detailed introduction).

1.5 ANSI codes and code pages

ANSI code

As mentioned above, before the UCS/Unicode coding scheme of all the countries and regions in the world came out, each country and region designed their own coding scheme on the basis of ASCII coding scheme in order to record and display their own characters by computer.

In Europe, EASCII and ISO/IEC 8859 series character coding schemes have been designed. In order to display Chinese characters and related characters, China designed the GB series of codes (” GB “is the Chinese pinyin acronym for” national standard “, meaning “national standard”).

Similarly, Japanese, Korean, and other languages in different parts of the world have their own codes. All of these ASCII compatible and mutually incompatible character codes independently developed by individual countries (or rather, ASCII compatible and mutually incompatible, because “incompatible” actually means the parts of the whole other than ASCII compatible, as in the following), Microsoft refers to them collectively as ANSI codes.

So, even if you know the ANSI code, you need to know what country or region it is in order to decode it; Moreover, only one ANSI encoding scheme can be used to encode the same text. For example, the same ANSI encoding cannot be used to represent both Chinese and Korean text.

Why is it called Microsoft ANSI coding

Strictly speaking, ANSI is not literally a character code, but an acronym for the American National Standards Institute, a nonprofit organization in the United States. ANSI is the organization that did much of the standards-setting work, including the C language specification ANSIC, and the “Code Page” standard that corresponds to both ASCII compatible and mutually incompatible character encodings in countries and regions.

For example, ANSI specifies that the Code Page of simplified Chinese GB Code is 936, so GB Code is also called ANSI Code Page 936 (ANSI standard Code Page 936).

This is why the ASCII compatible and mutually incompatible character encodings are collectively called ANSI encodings by Microsoft.

Later, perhaps for the purpose of using uniform names, some code pages that were not ANSI standard at the time were also referred to as ANSI code pages by Microsoft, such as the CP943 code page.

In the coding processing of Windows system, ANSI code generally represents the default encoding mode of the system, and is not a certain encoding mode – in simplified Chinese operating system, ANSI code refers to the GB series encoding (GB2312, GBK, GB18030) by default; In traditional Chinese operating system, ANSI code refers to Big5 code by default (the traditional Chinese character code used in Hong Kong, Macao and Taiwan). The default ANSI encoding in Japanese operating systems is Shift JIS encoding, etc. You can view and change it in system Locale. (More on that later in this article)

The code page

Note to author: the content of the code page is not available online, so the following content cannot be sorted out by comparing multiple sources.

The English name of a Code Page is Code Page, often referred to as CP. Code page is also known as the “clock”, is a computer with a particular character set (accurately, character set a character encoding CEF) corresponding to a character encoding table (in this case, the character encoding in fact CES is character encoding mode, so the actual “character – byte” or “character – byte sequences” table, See more later).

Origin of code page

Originally, IBM referred to the character encodings supported by the BIOS of its computers as code pages. The common operating systems of the time were command line interfaces that used character drawing capabilities provided by the BIOS directly to display characters (or a set of glyphs embedded in the graphics card character generator). These BIOS code pages are also referred to as OEM code pages.

With the widespread use of graphical user interface operating system (the first widely accepted graphical user interface operating system is Windows 3.1), the operating system itself has the function of character drawing. Microsoft then moved to UTF-16 (which predates the now widely accepted UTF-8) as an encoding for Windows (before Windows 2000 was released), The ANSI Code Page based (early ANSI code name) standard defines a set of CODE pages that support ANSI codes and are therefore called “ANSI code pages.” Typically, code page 1252 (CP1252) implements ISO 8859-1 (latin-1) and code page 936 (CP936) implements GBK.

Different languages and locales in modern operating systems may use different code pages.

In addition to the more common ANSI code pages mentioned above (the code page standard adopted by Microsoft) and the IBM code page, other commercial giants have their own code pages, such as Oracle code pages, SAP code pages, and code pages developed by multiple companies. For example, EUC Code page (EUC stands for Extended Unix Code. It is a Code page jointly developed by several Unix system development companies for Unix systems. It uses 8-bit encoding (that is, 8-bit single-byte Code elements) to represent characters and was standardized in 1991. EUC is mainly used to represent and store Chinese, Japanese and Korean characters in Unix, Mac and Linux.

In addition to the ANSI code pages originally defined for ANSI encodings, organizations and vendors often define code pages for UTF encodings of the Unicode character set (UTF-8, UTF-16, UTF-32, etc.).

Code page and character set mapping

In addition, different organizations or vendors often use different code page names (usually numbered) for the same encoding method. For example, UTF-8 is called code Page 1208 (CP1208) at IBM, code page 65001 (CP65001) at Microsoft, and code page 4110 (CP4110) at SAP; Windows uses code page 936 (CP936) and Mac uses uc-CN code page to represent GBK code (uc-CN is equivalent to the alias of GBK coding scheme in Unix-like systems, which is equivalent to the CP936 code page in Windows).

(Attached: List of Code pages defined by Microsoft: Code Page Identifiers – Windows Applications)

It is important to note that in practice, code pages are generally not identical to the character set that they directly correspond to, and often extend the character set for a variety of reasons (for example, the standard does not keep up with practical needs).

For example, the ANSI code page 1252 (CP1252) adopted by Microsoft for the ISO 8859-1 character set extends the Latin-1 character set, with encodings 128 to 159 also defined as characters. This is different from the Latin-1 character set and is used to represent English and most European languages (Spanish and various Germanic/Scandinavian). In addition, IBM’s OEM code page 932 (CP932), which corresponds to the Japanese character set of Shift JIS, also extends Shift JIS. Shift JIS are also extended to Microsoft ANSI code page 943 (CP943) for the Shift JIS character set.

Code page representation

A code page can be represented as a table of characters mapped to single-byte or multi-byte values.

Note that while ANSI encodings are part of the traditional character encoding model, from the perspective of modern character encoding models, single-byte values and multi-byte values refer to platform-specific physical sequences of bytes. It does not refer to a sequence of characters in a logical sense independent of the system platform (although for earlier character encoding schemes, which were part of the traditional character encoding model, character sequences and byte sequences were actually the same).

This is especially true for UTF encodings of the Unicode character set that are part of the modern character encoding model. For example, a code page defined for UTF-16 stores a certain character encoding mode CES (i.e., one of the big-endian or mini-endian, the concepts of big-endian and mini-endian are described later) for utF-16 CEF.

Because of this, code pages are also called inner code tables.

In other words, a code page is a specific encoding implementation of a character set in a computer. Especially from the point of view of modern character encoding model, the code page can be regarded as a certain character encoding mode of character set, the concrete character encoding mode CEF, which can be understood as a “character-byte” (or more accurately “character-byte sequence”) mapping table. The computer realizes the two-way “translation” between “character – byte” by look-up table.

Code page mechanism, lookup table

The code page is mainly used to realize the physical storage and display of characters in each coding scheme in computer system. When the computer reads a binary byte, it needs to look up which character the byte belongs to in a code page stored in the computer. This search process is called table lookup.

For example, when using the input code (code) input Chinese characters, the input method software will need to input code (a weight code plus choose number) according to the code page can be converted into machine code (that is, the look up table) for storage, and then according to machine code and the corresponding font is set to the corresponding font file lookup glyph code for display (ps: The previous detailed introduction, “1.4 Simplified Chinese character coding implementation > Summary > Chinese characters from input to output process”).

Win system code page Settings

In Windows, the code page is set by default (default system locale) and is also available in Windows7, ps: Win10, the same) “Control Panel – Regions and languages – Administration – languages of non-Unicode programs – Change system locale” select the language in the list to change.

** Note that ** System locales can be used to determine the default encoding scheme (obviously primarily ANSI encoding scheme) and font for input and display of characters in programs that do not use Unicode encoding (that is, non-Unicode programs), This allows non-Unicode programs to run properly on a computer using the specified language (essentially using the specified ANSI encoding).

Therefore, when you install some non-Unicode programs on your computer, you may need to change the default system locale if garbled characters appear. Choosing a different language for the system locale does not affect the language display of the Windows system itself or other programs that use Unicode encoding schemes (i.e., Unicode programs).

However, it is obvious that if multiple non-Unicode programs in the same operating system use different ANSI encodings, it is normal to display non-Unicode programs in one ANSI-encoded language at a time. Languages that use non-Unicode programs in other ANSI encodings appear as garbled; It is impossible for a non-Unicode program to use different ANSI encodings, such as Chinese and Korean, because one of them must be garbled at the same time.

However, now that Unicode encoding schemes have become mainstream, non-Unicode programs are rare.

Locales in Win

In order to adapt to the cultural background and living habits of users in different parts of the world, Microsoft has designed the function of Locale in Windows.

A Locale is a set of Settings specific to a particular country or region, including code pages and formats for numbers, currency, times, and dates.

Inside Windows, there are actually two Locale Settings: the system Locale and the user Locale. System locales determine code pages, and user locales determine the format of numbers, currencies, times, dates, and so on.

You can set system locales (languages for non-Unicode programs) and user locales (standards and formats) in the Regions and Language Options section of the Windows Control Panel.

The code page corresponding to the system Locale is used as the default code page of Windows. When the encoding scheme for a text is not explicitly specified, Windows interprets the text data according to the default code page specified in the system Locale, which essentially represents an encoding scheme. This default code page is often referred to as the ANSI code page (ACP; Note that, as mentioned earlier, the default code page set in system locales is often referred to as ACP because system locales are primarily set for non-Unicode programs, although non-ANSI UTF encodings also define code pages.

Code pages for various languages (essentially coding schemes) are visible in the Code Page Conversion Table on Windows XP’s Advanced Regions and Language Options page (but not directly in Windows7).

Character Encoding (Iii: Unicode Encoding System and byte order)