Character set encoding – ASCII/ISO-8859-1/GBK/ Unicode /UTF-8
Why code?
A: There are so many symbols to represent that there is no unified standard and computers cannot recognize characters that are not standardized
Fixed-length encoding is convenient for computer processing. GBK is not a fixed-length encoding and Unicode is a fixed-length encoding
ASCII – American Standard code for Information Interchange
Represented by the lower seven bits of a byte, the earliest known character set, the American standard, consists of 128 characters, two to the seventh
Problem: cannot contain characters from other countries
ISO-8859-1
One byte (8bit 1byte), 2 to the eighth, a total of 256 characters, including Latin and Western European characters, twice as many as ASCII
Problem: Still cannot contain all other country characters, Chinese
GBK
Two bytes represent a Chinese character. The encoding format of Chinese characters is A5 B6
GB2312/GBK/GB18030
- Gb2312 indicates only simplified characters
- GBK code can be used to represent traditional and simplified characters at the same time, compatible with GB2312
- Gb18030 contains all Chinese characters
The problem: There is no universal code, and every country has a different character set
unicode
In order to solve the problem of different coding standards all over the world, Unicode coding specification emerged. The encoding mode realized by unicode can represent all character sets in the world. Fixed-length double-byte is a fundamental standard, and the specific use is the implementation of using it
The Unicode encoding character set is designed to collect all characters in the world, assigning each character a unique character number known as a Code Point. The Unicode encoding character set is divided into 17 planes from U+0000 to U+FFFF, and now includes 1114,111 Code points
A Code Point is a numeric representation of a character.
Unicode, of course, is a very thick dictionary that records a single number for every character in the world. The exact correspondence, or division, is not a matter of concern, except that Unicode assigns a number to all characters.
U597d – > good
Problem: A character that used to represent one byte now needs two bytes
UTF-8
Unicode Translation Format-8, an implementation of Unicode, uses the Unicode standard
To solve the problem that Unicode now requires two bytes for characters that could be represented by one byte, the corresponding utF-8, UTF-16, and UTF-32 emerged
Utf-8 is a variable-length encoding that can use 1-4 bytes to represent a character, fully ASCII compatible
- For a single byte character, the first digit is set to 0, and the next 7 bits correspond to the Unicode code point for that character. Therefore, for characters 0-127 in English, it is exactly the same as ASCII. This means that documents from the ASCII era have no problem opening in UTF-8 encoding.
- For characters that need to be represented in N bytes (N > 1), the first N bits of the first byte are set to 1, the N + 1 bits are set to 0, the first two bits of the remaining n-1 bytes are set to 10, and the remaining binary bits are filled with the Unicode code points of the character
Unicode Range of hexadecimal code points | Utf-8 binary |
---|---|
0000 0000 – 0000 007F |
0xxxxxxx |
0000 0080 – 0000 07FF |
110xxxxx 10xxxxxx |
0000 0800 – 0000 FFFF |
1110xxxx 10xxxxxx 10xxxxxx |
0001 0000 – 0010 FFFF |
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx |
This differentiates whether the character is 1 byte or n bytes based on the encoding table
Reference blog: