ASCII
ASCII is the American Standard code for Information Exchange, a scheme for encoding 7 or 8 binary bits that can assign (or specify) numerical values to up to 256(2^80) characters, including letters, digits, punctuation, control characters, and other symbols. The basic ASCII character set has a total of 128 characters, including 96 printable characters, including commonly used letters, numbers, punctuation marks, and so on, plus 32 control characters.
Each word in the file is an American standard image code or space code. Such files are called “American standard text files”, or slightly “text files”, and can usually be exchanged directly between different computer systems. Files containing control codes or non-American codes cannot normally be exchanged directly between different computer systems. There’s a common name for this type of file: binary
ANSI
In order to expand THE ASCII code to display their own languages, different countries and regions have developed different standards, resulting in GB2312, BIG5, JIS and other coding standards. These extended encodings of Chinese characters, which use two Bytes to represent a character, are called ANSI encodings, also known as MBCS (Muilti-Bytes Charecter Set). In simplified Chinese system, ANSI code stands for GB2312 code, in Japanese operating system, ANSI code stands for JIS code, so in Chinese Windows to transcode to GB2312,GBK(an extension of GB2312) only need to save the text as ANSI code. Different ANSI codes are incompatible with each other. When information is communicated internationally, it is impossible to store text belonging to two languages in the same ANSI code. A big disadvantage is that the same code value represents different words in different coding systems. It’s easy to get confused
GB2312
GB2312 is a kind of ANSI code, extending the initial ASCII code of ANSI code. In order to meet the needs of using Chinese characters in domestic computers, China General Administration of Standards issued a series of national standard codes of Chinese character set, collectively known as GB code, or national standard code. Among them, the most influential is the “Basic Set of Chinese Coded Character Set for Information Interchange” published in 1980, the standard number is GB 2312-1980, because of its very common use, it is often referred to as the national standard code. GB2312 code is widely used in mainland China; Singapore and other places also use this code. Almost all Chinese systems and international software support GB2312. GB2312 is a simplified Chinese character set consisting of 6,763 common Chinese characters and 682 full-angle non-Chinese characters
GBK
The emergence of GB 2312 basically meets the needs of computer processing of Chinese characters, but it cannot process rare characters in people’s names and ancient Chinese, which leads to the emergence of GBK and GB 18030 Chinese character set. GBK contains 21,886 Chinese characters and graphic symbols. There are 21,003 Chinese characters (including radicals and components) and 883 graphic symbols. GBK coding standard compatible with GB2312, a total of 21003 Chinese characters, 883 symbols, and provide 1894 character code, simple and traditional characters in a library. GB2312 code is the national Chinese character information exchange code of the People’s Republic of China, the full name of the Chinese character Coding character Set for Information Exchange — Basic Set, released by the State Administration of Standards in 1980. The basic collection contains 6,763 Chinese characters and 682 non-Chinese graphic characters, which are widely used in mainland China. Singapore and other places also use this code. GBK is an extension of GB2312-80.
Big5
In Taiwan, Hong Kong and Macao, the traditional Chinese character set is used. GB2312, released in 1980, is for simplified Chinese character sets and does not support traditional Chinese characters. In these regions where traditional Chinese character sets were used, there were many incompatible character set codes proposed by different vendors, which made it difficult to exchange information. In order to unify the traditional Chinese character set coding, in 1984, acer, Avatar, Jia Jia, One and Volkswagen, the five major Manufacturers in Taiwan jointly formulated a traditional Chinese coding scheme, known as the Big five code because of its source, the English writing Big5, and later translated back into Chinese characters, commonly known as big five code. Big 5 code is a set of traditional Chinese characters, including 13,053 traditional Chinese characters, 808 punctuation marks, Greek letters and special symbols.
unicode
The reason why garbled characters often appear in emails and web pages is that the information provider may be the ANSI encoding system of Japanese and the information reader may be the Chinese encoding system. They display the same binary encoding value using different encoding, resulting in garbled characters. This problem led to the creation of Unicode codes. If you had a code that included all the symbols in the world, whether it was English, Japanese, Chinese, etc., and everyone used that code table, there would be no code mismatches. Each symbol corresponds to a unique code, and the garble problem is eliminated. This is Unicode encoding.
Unicode, or universal Code in Chinese, is organized and encoded by the Unicode Consortium for most of the world’s writing systems. Like Unicode, the ISO organization is doing the same thing. Iso has an ISO/IEC 10646 project called Universal Multiple-OcTET Coded Character Set, or UCS. Later, both sides realized that there was no need for two common character sets over time, so both sides started to integrate, and by unicode2.0, the unicode encoding and UCS encoding were basically the same. But it’s a little different.
Unicode is popular, and UTF-8(a kind of Unicode) is very popular, UCS encoding is basically equivalent to UTF-16, UTF-32, so UCS is basically out of people’s view now.
UTF-8
Unicode does unify the encoding method, but it is not very efficient. For example, UCS-4(one of the Unicode standards) stipulates that if a symbol is stored in four bytes, three bytes must be zeros before each English letter, which is very expensive for storage and transmission. In order to improve the efficiency of Unicode encoding, utF-8 encoding came into being. Utf-8 can automatically choose the length of encoding based on different symbols. English letters, for example, can be used in a single byte.
UTF-16
Utf-16 is one of the uses of Unicode. Utf-16 has the advantage over UTF-8 that most characters are stored in fixed-length bytes (2 bytes), but UTF-16 is not compatible with ASCII.
Base64
Some E-mail systems (such as foreign mail boxes) do not support transmission of non-English characters (such as Chinese characters) due to historical reasons (think only America can use E-mail?). . Because an English letter is stored using THE ASCII encoding, one byte of memory (8 bits) is actually stored only in base 2 with 7 bits. The first byte is not used and is set to 0. Therefore, such a system considers any byte with a first byte of 1 to be wrong. And some encoding schemes (such as GB2312) not only use multiple bytes to encode a character, and the first digit is often 1, so the mail system will replace 1 with 0, so that the recipient will find the message garbled.
In order for the mail system to send and receive letters normally, it is necessary to convert symbols stored by other codes into ASCII codes for transmission. For example, send GB2312 code on one end -> according to Base64 rules -> convert to ASCII code, and receive ASCII code -> according to Base64 rules -> revert to GB2312 code.
BMP
Ucs-4 is divided into 2^7=128 groups based on the highest byte with the highest bit being 0. Each group is then divided into 256 planes based on the second highest byte. Each plane is divided into 256 rows based on the third byte, and each row contains 256 cells. Of course, cells in the same row differ only in the last byte; everything else is the same. Plane 0 of Group 0 is called Basic Multilingual plane, which stands for BMP. Or in UCS-4, code points with two zeros in height are called BMP. Ucs-2 is obtained by removing the first two zero bytes from ucS-4’s BMP. The BMP for UCS-4 is obtained by prefixing the two ucS-2 stanzas with two zero bytes. There are no characters assigned outside BMP in the UCS-4 specification
More can be seen
- Web coding is that kind of thing
- Thinking Logic of Computer Program (6) – How to Recover from Gibberish (1)
- Thinking Logic of Computer Program (6) – How to Recover from Gibberish (2)
- Computer program logic of thought (8) – The true meaning of char
reference
- www.fmddlmyy.cn/text6.html
- Zhidao.baidu.com/question/87…
- baike.baidu.com/item/ character code /8…
- Baike.baidu.com/item/UTF-16…