Welcome to python Data Scientist

【 Key points 】

1. The concept of character encoding and decoding 2. The development process from ASCII encoding to Unicode encoding 3. Easily confused character encoding and character code

The last two episodes basically familiarized us with the common uses of strings, but they also reminded me of some concepts I’ve been puzzling over: ASCII, Unicode, character encodings, etc., so let’s talk about them today.

There are so many concepts about character encoding that it is easy to get confused when there are so many terms like ASCII, GB2312, Unicode, UTF-8, UTF-16, encoding, decode, etc.

In order to make these issues clear, I’m going to take a different approach today, not programming, but just telling a story, a historical timeline of the evolution of computers in different language countries, in order to get at these concepts once and for all.

What is character encoding and decoding?

The “language” that the computer itself can understand is binary numbers. The smallest information identifier is binary bits. Eight binary bits represent one byte. The language we can understand is a set of characters composed of English letters, Chinese characters, punctuation characters, Arabic numerals and many other characters. In order for the computer to work according to the human will, the character set used by the human must be converted into the computer can understand the secondary system of code, this process is coding, his reverse process is called decoding.

The evolution of ASCII to Unicode

When computers were first invented and used in the United States, the character set required to encode was not very large, nothing more than English letters, numbers, and some simple punctuation marks, so a single-byte encoding system was used. In this set of coding rules, characters in the desired character set are mapped one by one to 128 binary numbers, which have the highest bit of 0 and use the remaining 7 bits to form 00000000 to 01111111 (0X00 to 0X7F).

The 32 binary digits 0X00 to 0X1F are used to encode control characters or special characters for communication (such as LF newline, DEL delete, BS backspace). The 96 binary digits 0X20 to 0X7F are used to encode Arabic digits, upper and lower case letters, underscores, brackets and other symbols. The process of mapping this character set to 0x00-0x7F binary code is called basic ASCII encoding, through which the computer converts human language into its own language and stores it. Conversely, decoding is the process of reading binary numbers from disk and converting them into alphanumeric characters for display.

With the rapid spread of computers, people in non-English speaking Countries in Europe found that the character set designed by the Americans was not enough. For example, some accented characters, Greek letters and so on were not included in the character set. Therefore, the ASCII code rules were expanded by changing the highest bit from 0 to 1. Therefore, the 128 binary numbers 10000000~11111111 (0X80~0XFF) are extended. The best of these extensions is ISO 8859-1, often referred to as Latin-1. Latin-1 takes advantage of 128 to 255 binary numbers and includes enough additional character sets to cover basic Western European languages while being compatible with ASCII encoding rules in the range of 0 to 127.

As the number of countries using computers increased, the number of character sets needed to be encoded naturally increased. The original ASCII character set was limited by single character bytes, and its capacity was far from sufficient, such as the pressure of thousands of Chinese characters. Therefore, The State Administration of Standards of China issued a set of national standard “Chinese Coded Character Set for Information Interchange”, the standard number is GB 2312-1980. This character set contains a total of 6763 Chinese characters and 682 non-Chinese graphic characters. The character set is encoded by two bytes and is compatible with ASCII encoding mode downward. In short, the entire character set is divided into 94 extents, each with 94 bits, each represented by a byte counterpart. Each location corresponds to a character, so the region and bits can be used for two-byte encoding of Chinese characters. Later, rare characters, traditional Chinese characters and Japanese and Korean characters were also included in the character set, and then there was the GBK character set and the corresponding coding specification, GBK coding specification is also downward compatible with GBK2312.

In China at the same time, the development of the computer in the expansion of world each country different countries and regions will develop its own set of coding system, the coding system is multifarious, has highlighted the problem at this time, especially in the Internet communication environment, equipped with different coding system of computer communication between will don’t know each other in something, “said” After converting the required characters into binary code according to the encoding method of A encoding system, the original characters cannot be obtained by decoding on the computer of B encoding system. Instead, some strange characters will appear unexpectedly, which is called garbled code.

In order to achieve cross-language and cross-platform text conversion and processing requirements, ISO international Standardization Organization proposed a new standard of Unicode, which contains the Unicode character set and a set of coding specifications. The Unicode character set covers all characters and symbolic characters in the world. The Unicode encoding scheme assigns a uniform and unique binary code to each character in the Unicode character set, which completely solves the conflicts and garbled characters of different encoding systems. The coding scheme is simply as follows: there are 17 groups (called planes) in the coding specification, and each group contains 65536 code points (for example, group 0 is 0X0000~0XFFFF). Each code point corresponds to a unique character. Most of the characters are located in the code point of character set plane 0, and a small number of characters are located in other planes.

Character encoding and character code concept distinction

Now that Unicode encoding is mentioned, what is the UTF-8, UTF-16 encoding scheme that often accompanies it? So far we have been confusing character codes, which are the Ordinal Numbers of a particular character in a character set, with character encodings, which are binary sequences in bytes used to represent characters during transmission and storage. In ASCII encoding systems, character codes are consistent with character codes, such as character A, whose serial number in the ASCII character set, known as character code 65, the binary bit sequence stored on disk 01000001 (0X41, also 65 in decimal), and others, For example, in the GB2312 coding system, the value of character code and character code is also consistent, so we ignore the difference between the two.

In the Unicode standard, we currently use UCS-4, that is, the character code of each character in the character set is represented by 4 bytes, in which the character codes of 0~127 are compatible with the ASCII character set. The character codes of general Chinese characters are also concentrated before 65535, and the character codes larger than 65535 are used. That is, character codes that require more than two bytes to represent are relatively small. So if still remains a character code and consistent character encoding encoding, then the English letters, Numbers, originally only one byte code, code now requires four bytes, Chinese character code originally only two bytes, the code also need four bytes, it is uneconomic for storage or transport resources.

Therefore, it is necessary to re-encode between the character code and the character code, which leads to UTF-8, UTF-16 and other encoding methods. Utf-8 is designed to convert character codes in different ranges into character codes of different lengths. This encoding is in bytes and is fully compatible with ASCII characters. That is, the character codes of 0x00-0x7F are exactly the same as the character codes. The character codes of commonly used Chinese characters in Unicode are 4E00-9FA5. We can see in the correspondence at the end of the article that three bytes are used to encode Chinese characters. Similarly, utF-16 reencodes characters in the Unicode character set in 16-bit binary numbers. The principle is the same as UTF-8.

Therefore, we can see that in the current global interconnection background, Unicode character set and encoding method solves the problem of cross-language and cross-platform communication, while UTF-8 and other encoding methods effectively save storage space and transmission bandwidth, so it has been greatly promoted and applied.

[girl says] Good story ah, before the concept of the feeling of a mess, this episode explains the history of the development of character coding, easy to understand why a certain moment of the emergence of a certain way of coding, figure out the context, the character coding is easy to understand.

Schedule:

Unicode character codes correspond to UTF-8 encodings

0000 0000-0000 007| 0xxxxxxx



0000 0080-0000 07FF |
 110xxxxx 10xxxxxx



0000 0800-0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx



0001 0000-0010 FFFF |
 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

Copy the code

Python Data Scientist: