The MySQL character encoding set has two utF-8 encoding implementations: UTF8 and UTF8MB4.
Using UTF8 will cause errors in storing emojis and complex Chinese and traditional characters.
Why is that? This article can give you the answer from the source.
What is a character set?
Characters are all kinds of characters and symbols, including various national characters, punctuation marks, expressions, numbers and so on. A character set is a collection of characters. There are many types of character sets, and each character set can represent a different range of characters. For example, some character sets cannot represent Chinese characters.
Computers can only store binary data, so how should English, Chinese characters, facial expressions and other characters be stored?
We need to match these characters to the binary data, such as “a” for “01100001” and “01100001” for “A”. The process by which characters correspond to binary data is called “character encoding”, whereas the process by which binary data is parsed into characters is called “character decoding”.
What are the common character sets?
Common character sets include ASCII, GB2312, GBK, and UTF-8…… .
The main differences between character sets are:
- The range of characters that can be represented
- encoding
ASCII
ASCII (American Standard Code for Information Interchange) is a character set primarily used in modern American English (and this is where the ASCII character set is limited).
Why doesn’t the ASCII character set take into account other characters such as Chinese? The computer was invented in the United States at a time when it was still in its infancy and not yet widely used in other countries. As a result, the ASCII character set was published in the United States without considering compatibility with other languages.
The ASCII character set has so far defined 128 characters, of which 33 control characters (such as carriage return and delete) cannot be displayed.
The length of an ASCII code is one byte, that is, eight bits. For example, “A” corresponds to the ASCII code “01100001”. However, the highest bit is 0 only as the check bit, and the remaining 7 bits are combined with 0 and 1, so 128 (2^7) characters can be defined in the ASCII character set.
Because the ASCII code can represent too few characters. Later, people extended it to get the ASCII extended character set. The ASCII extended character set uses 8 bits to represent a character, so 256 (2^8) characters can be defined in the ASCII extended character set.
ASCII character encoding
GB2312
As we said above, the ASCII character set is a character set applicable to modern American English. As a result, many countries have tinkered with a character set that suits their national language.
GB2312 character set is a relatively friendly character set for Chinese characters, including more than 6700 Chinese characters, basically covering the majority of commonly used Chinese characters. However, GB2312 character set does not support most of the rare and traditional characters.
For English characters, GB2312 and ASCII codes are the same, 1 byte encoding is ok. For non-British characters, 2-byte encoding is required.
GBK
The GBK character set can be seen as an extension of the GB2312 character set, which is compatible with the GB2312 character set and contains more than 20,000 Chinese characters.
In GBK, K is the initial letter of “Kuo” in Kuo Zhan (extension).
GB18030
GB18030 is fully compatible with GB2312 and GBK character set, which includes Chinese ethnic minority characters and Japanese and Korean characters. It is the most comprehensive Chinese character set so far, with a total of more than 70,000 Chinese characters.
BIG5
BIG5 focuses on traditional Chinese, with more than 13,000 characters.
Unicode & UTF-8 encoding
Many character sets have been created to better suit the native language.
We also mentioned above that different character sets differ in the range of characters they can represent and in the encoding rules. This leads to a very serious problem: looking at a file containing characters using the wrong encoding can produce garbled characters.
For example, if you use UTF-8 encoding mode to open GB2312 encoding format file, there will be garbled characters. One example: GB2312 encoding the character “niu” as “C5A3” in hexadecimal, and “C5A3” encoded in UTF-8 becomes “ţ”.
You can online through the site for encoding and decoding: www.haomeili.net/HanZi/ZiFuB…
Thus we understand the nature of garble: different or incompatible character sets are used for encoding and decoding.
To solve this problem, people thought, “If only we could have a character set that included all the characters in the world!” .
Then Unicode was born with this mission.
The Unicode character set contains almost all known characters in the world. However, the Unicode character set does not specify how to store these characters (that is, how to represent them using secondary data).
Then, there’s UTF-8 (8-bit Unicode Transformation Format). Similar to utF-16, UTF-32.
Utf-8 uses 1 to 4 bytes per character, UTF-16 uses 2 or 4 bytes per character, and UTF-32 fixed bits 4 bytes per character.
Utf-8 can automatically choose the encoding length based on different symbols, just as English characters need only 1 byte, which is enough for the ASCII character set. Therefore, utF-8 encoding and ASCII are the same for English characters.
Utf-32 has the simplest rules, but its limitations are obvious, consuming four times as much space as UTF-8 for characters such as English letters.
Utf-8 is the most widely used character encoding.
MySQL character set
MySQL supports many character encoding methods, such as UTF-8, GB2312, GBK, BIG5.
You can run the SHOW CHARSET command to check.
In general, we recommend using UTF-8 as the default character encoding.
There is, however, a small hole.
MySQL character encodings have two sets of UTF-8 encodings:
- Utf8: UtF8 encoding supports only 1-3 bytes. In UTF8 encoding, Chinese is 3 bytes, other numbers, English, symbols are 1 byte. However, emoji symbols account for 4 bytes, as do some more complex characters and traditional characters.
- Utf8mb4: full implementation of UTF-8, legal! Up to 4 bytes are supported to represent characters, so they can be used to store emoji.
Why are there two utF-8 encoding implementations? Here’s why:
Therefore, if you need to store emoji-type data or some complicated characters or traditional characters in MySQL database, the database encoding must be utF8MB4 instead of UTF8, otherwise the storage will error.
Show me! (environment: MySQL 5.7+)
We specify the database CHARSET as UTF8.
CREATE TABLE `user` (
`id` varchar(66) CHARACTER SET utf8mb4 NOT NULL,
`name` varchar(33) CHARACTER SET utf8mb4 NOT NULL,
`phone` varchar(33) CHARACTER SET utf8mb4 DEFAULT NULL,
`password` varchar(100) CHARACTER SET utf8mb4 DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
Copy the code
When we insert data into the database, we get an error!
INSERT INTO 'user' (' id ', 'name ',' phone ', 'password ') VALUES ('A00003', 'guide ', '181631312312', '123456')Copy the code
The following error message is displayed:
Incorrect string value: '\xF0\x9F\x98\x98\xF0\x9F... ' for column 'name' at row 1Copy the code
reference
- Character, and character Encoding (Charset & Encoding) : www.cnblogs.com/skynet/arch…
- Ten minutes about the character, and character encoding: cenalulu. Making. IO/Linux/chara…
- Unicode – wikipedia: zh.wikipedia.org/wiki/Unicod…
- GB2312- Wikipedia: zh.wikipedia.org/wiki/GB\_23…
- Utf-8 – wikipedia: zh.wikipedia.org/wiki/UTF-8
- GB18030- Wikipedia: zh.wikipedia.org/wiki/GB\_18…
What? Is utf8 not recommended in MySQL?