Character set encoding – ASCII/ISO-8859-1/GBK/ Unicode /UTF-8

Why code?

A: There are so many symbols to represent that there is no unified standard and computers cannot recognize characters that are not standardized

Fixed-length encoding is convenient for computer processing. GBK is not a fixed-length encoding and Unicode is a fixed-length encoding

ASCII – American Standard code for Information Interchange

Represented by the lower seven bits of a byte, the earliest known character set, the American standard, consists of 128 characters, two to the seventh

Problem: cannot contain characters from other countries

ISO-8859-1

One byte (8bit 1byte), 2 to the eighth, a total of 256 characters, including Latin and Western European characters, twice as many as ASCII

Problem: Still cannot contain all other country characters, Chinese

GBK

Two bytes represent a Chinese character. The encoding format of Chinese characters is A5 B6

GB2312/GBK/GB18030

  • Gb2312 indicates only simplified characters
  • GBK code can be used to represent traditional and simplified characters at the same time, compatible with GB2312
  • Gb18030 contains all Chinese characters

The problem: There is no universal code, and every country has a different character set

unicode

In order to solve the problem of different coding standards all over the world, Unicode coding specification emerged. The encoding mode realized by unicode can represent all character sets in the world. Fixed-length double-byte is a fundamental standard, and the specific use is the implementation of using it

The Unicode encoding character set is designed to collect all characters in the world, assigning each character a unique character number known as a Code Point. The Unicode encoding character set is divided into 17 planes from U+0000 to U+FFFF, and now includes 1114,111 Code points

A Code Point is a numeric representation of a character.

Unicode, of course, is a very thick dictionary that records a single number for every character in the world. The exact correspondence, or division, is not a matter of concern, except that Unicode assigns a number to all characters.

U597d – > good

Problem: A character that used to represent one byte now needs two bytes

UTF-8

Unicode Translation Format-8, an implementation of Unicode, uses the Unicode standard

To solve the problem that Unicode now requires two bytes for characters that could be represented by one byte, the corresponding utF-8, UTF-16, and UTF-32 emerged

Utf-8 is a variable-length encoding that can use 1-4 bytes to represent a character, fully ASCII compatible

  1. For a single byte character, the first digit is set to 0, and the next 7 bits correspond to the Unicode code point for that character. Therefore, for characters 0-127 in English, it is exactly the same as ASCII. This means that documents from the ASCII era have no problem opening in UTF-8 encoding.
  2. For characters that need to be represented in N bytes (N > 1), the first N bits of the first byte are set to 1, the N + 1 bits are set to 0, the first two bits of the remaining n-1 bytes are set to 10, and the remaining binary bits are filled with the Unicode code points of the character
Unicode Range of hexadecimal code points Utf-8 binary

0000 0000 – 0000 007F

0xxxxxxx

0000 0080 – 0000 07FF

110xxxxx 10xxxxxx

0000 0800 – 0000 FFFF

1110xxxx 10xxxxxx 10xxxxxx

0001 0000 – 0010 FFFF

11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

This differentiates whether the character is 1 byte or n bytes based on the encoding table

Reference blog: