Don't understand character set encoding?

Character set encoding – ASCII/ISO-8859-1/GBK/ Unicode /UTF-8

Why code?

A: There are so many symbols to represent that there is no unified standard and computers cannot recognize characters that are not standardized

Fixed-length encoding is convenient for computer processing. GBK is not a fixed-length encoding and Unicode is a fixed-length encoding

ASCII – American Standard code for Information Interchange

Represented by the lower seven bits of a byte, the earliest known character set, the American standard, consists of 128 characters, two to the seventh

Problem: cannot contain characters from other countries

ISO-8859-1

One byte (8bit 1byte), 2 to the eighth, a total of 256 characters, including Latin and Western European characters, twice as many as ASCII

Problem: Still cannot contain all other country characters, Chinese

GBK

Two bytes represent a Chinese character. The encoding format of Chinese characters is A5 B6

GB2312/GBK/GB18030

Gb2312 indicates only simplified characters
GBK code can be used to represent traditional and simplified characters at the same time, compatible with GB2312
Gb18030 contains all Chinese characters

The problem: There is no universal code, and every country has a different character set

unicode

In order to solve the problem of different coding standards all over the world, Unicode coding specification emerged. The encoding mode realized by unicode can represent all character sets in the world. Fixed-length double-byte is a fundamental standard, and the specific use is the implementation of using it

The Unicode encoding character set is designed to collect all characters in the world, assigning each character a unique character number known as a Code Point. The Unicode encoding character set is divided into 17 planes from U+0000 to U+FFFF, and now includes 1114,111 Code points

A Code Point is a numeric representation of a character.

Unicode, of course, is a very thick dictionary that records a single number for every character in the world. The exact correspondence, or division, is not a matter of concern, except that Unicode assigns a number to all characters.

U597d – > good

Problem: A character that used to represent one byte now needs two bytes

UTF-8

Unicode Translation Format-8, an implementation of Unicode, uses the Unicode standard

To solve the problem that Unicode now requires two bytes for characters that could be represented by one byte, the corresponding utF-8, UTF-16, and UTF-32 emerged

Utf-8 is a variable-length encoding that can use 1-4 bytes to represent a character, fully ASCII compatible

For a single byte character, the first digit is set to 0, and the next 7 bits correspond to the Unicode code point for that character. Therefore, for characters 0-127 in English, it is exactly the same as ASCII. This means that documents from the ASCII era have no problem opening in UTF-8 encoding.
For characters that need to be represented in N bytes (N > 1), the first N bits of the first byte are set to 1, the N + 1 bits are set to 0, the first two bits of the remaining n-1 bytes are set to 10, and the remaining binary bits are filled with the Unicode code points of the character

Unicode Range of hexadecimal code points	Utf-8 binary
0000 0000 – 0000 007F	0xxxxxxx
0000 0080 – 0000 07FF	110xxxxx 10xxxxxx
0000 0800 – 0000 FFFF	1110xxxx 10xxxxxx 10xxxxxx
0001 0000 – 0010 FFFF	11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

This differentiates whether the character is 1 byte or n bytes based on the encoding table

Reference blog:

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

Don’t understand character set encoding?

Character set encoding – ASCII/ISO-8859-1/GBK/ Unicode /UTF-8

ASCII – American Standard code for Information Interchange

ISO-8859-1

GBK

GB2312/GBK/GB18030

unicode

UTF-8

Don’t understand character set encoding?

Character set encoding – ASCII/ISO-8859-1/GBK/ Unicode /UTF-8

ASCII – American Standard code for Information Interchange

ISO-8859-1

GBK

GB2312/GBK/GB18030

unicode

UTF-8

Related Posts

The MYSQL InnoDB engine does not release table Spaces after TRUNCate or DELETE tables

Summary of interview questions for Java backend development by major companies

SpringMVC– Data processing