ACMUG solicits original technical articles. For details, please add A_CMUG or scan the QR code at the end of this article to follow our wechat official account. Please send your articles to: [email protected]. Description: Knowledge is priceless, labor is paid, ACMUG guest writer reward plan (revised version)
About the author: Zhou Xiao
Common network ID Seanlook. I worked as Oracle DBA for TP-Link for 2 years before I became a full-time MySQL manager. I will write down some problems and experience in my work and put them on my website http://seanlook.com when I am free
01
Utf8 is similar to UTf8MB4
Look at the official manual instructions: https://dev.mysql.com/doc/refman/5.6/en/charset-unicode-utf8mb4.html
The character set named utf8 uses a maximum of three bytes per character and contains only BMP characters. The utf8mb4 character set uses a maximum of four bytes per character supports supplementary characters:
-
For a BMP character, utf8 and utf8mb4 have identical storage characteristics: same code values, same encoding, same length.
-
For a supplementary character, utf8 cannot store the character at all, whereas utf8mb4 requires four bytes to store it. Because utf8 cannot store the character at all, you have no supplementary characters in utf8 columns and need not worry about converting characters or losing data when upgrading utf8 data from older versions of MySQL.
MySQL added utF8MB4 encoding after 5.5.3. Mb4 is most Bytes 4. Utf8mb4 is a superset of UTF8 and fully compatible with UTF8, which can store more characters in four bytes.
But regardless of databases, standard UTF-8 character set encodings can encode 21 characters in 1 to 4 bytes, which covers almost every language in the world that can be seen. However, utF8 in MySQL is up to 3 bytes long, which means it supports only the basic multitext plane of Unicode (U+0000 to U+FFFF). It contains most international characters, including control characters, Latin characters, Chinese characters, Japanese characters and Korean characters, but not all of them. The most common are emoji, which are commonly used on mobile phones, and some less commonly used Chinese characters, such as “shu”, which require four bytes to be encoded.
Note: QQ inside the built-in expression does not count, it is through a special mapping to a GIF image. The general input method comes with it.
That is, when you are required to store these expressions or wide characters in your database, you can define the field as UTF8MB4. At the same time, note that the connection character set should also be set to UTF8MB4. Otherwise, Incorrect String value will appear in strict mode: / xF0 xA1 x8B/xBE/xE5 / xA2… Errors such as for column ‘name’ will truncate subsequent data in non-strict mode.
Tip: Another way to store emojis is not to care about the database table character set, as long as the connected character set uses latin1, but trust me, you don’t want to do this because the mixed character set management is extremely poor and the storage space is very large (you can wonder why).
02
How to choose between utF8MB4_ unicode_ CI and UTF8MB4_ general_ ci
In addition to storage, characters also need to be collated or sized, involving collation corresponding to the encoding character set. Utf8mb4_unicode_ci utf8MB4_general_ci utf8MB4_ci utf8MB4_general_ci utf8MB4_ci utf8MB4_general_ci utf8MB4_ci What’s the difference between UTF8_general_CI and UTF8_unicoDE_ci
Mainly from two aspects of sorting accuracy and performance:
-
accuracy
Utf8mb4_unicode_ci is based on standard Unicode to sort and compare, and can sort accurately between various languages
Utf8mb4_general_ci does not implement Unicode collation, and the collation results may not be expected in the case of certain special languages or characters.
But in the vast majority of cases, the order of special characters does not have to be that precise. Unicode, for example, considers ß, œ for SS, and OE; In general, these are thought to be S, e, or AAAa āă, respectively, equal to A.
-
performance
Utf8mb4_general_ci is faster for comparison and sorting
Utf8mb4_unicode_ci In special cases, Unicode collation implements a slightly more complex sorting algorithm to be able to handle special characters.
But in the vast majority of cases, no such complex comparison occurs. General may theoretically be faster than Unicode, but it is far less of a performance consideration than today’s cpus, indexing and SQL design are. My personal recommendation is UTf8MB4_unicode_ci, and 8.0 will most likely use the default rule as well.
This is another way of telling us not to use fields as primary keys or unique indexes that might generate garbled characters. I came across a case where the URL was used as the only index, but it could be garbled, making it very difficult to fix them later.
03
How to convert from UTF8 to UTF8MB4
3.1 “fake” conversion
If your table definition and concatenation character set are UTF8, execute directly on your table
ALTER TABLE tbl_name CONVERT TO CHARACTER SET utf8mb4;
The character type of all columns on the table can be changed to UTF8MB4, and the default character set of the table definition can also be changed. When connecting, use set Names UTF8MB4 to insert four-byte characters. (If you still use UTF8 connection, it’s fine as long as there are no four-byte characters).
Mysql 5.6 native ONLINE DDL; mysql 5.6 native ONLINE DDL; mysql 5.6 native ONLINE DDL; Second, it may automatically define the table field type, such as VARCHAR being converted to MEDIUMTEXT, which can be modified to specify the original type.
ALTER TABLE TBL_name DEFAULT CHARACTER SET UTF8MB4 ALTER TABLE TBL_name DEFAULT CHARACTER SET UTF8MB4
Most importantly, the latin1 character set you are using to write historical data, the table definition is latin1 or UTF8, do not expect to use ALTER… CONVERT … Utf8 enables you to achieve the purpose of historical Chinese data reading, no egg use, honest logic dump. That’s why I call it a “fake” conversion
3.2 character – set – server
Once you have decided to use utf8mb4, it is strongly recommended that you modify the server side character-set-server= UTf8mb4. Different languages handle this differently. However, the Java driver relies on the character-set-server option, described below.
Also be careful of some special options, such as Tencent cloud CDB connection character set to set a pit. Setting global init_connect is not recommended for individuals.
04
Key 768 long error
When the character set is changed from UTF8 to UTF8MB4, the most common problem is that the index key is too long.
For COMPACT or REDUNDANT table row formats, InnoDB has a maximum of 768 bytes for a single index, and fields are defined as the number of characters that can be stored. For example, VARCHAR(200) is used to store 200 Chinese characters, and indexes are defined as the maximum length of the character set type. Error 1071: utf8 maxbytes=3 utf8MB4 maxbytes=4 utf8MB4 maxbytes=4 utf8 maxbytes=3 utf8MB4 maxbytes=4 Specified key was too long; Max Key Length is 767 bytes.
COMPRESSED and DYNAMIC formats are not restricted, but it is still not recommended that the index is too long, wasting space and CPU search resources.
If you have defined a field longer than this length, you can add a prefix index. If you cannot add a prefix index (like a unique index), you can change the character set of the field back to UTF8 or latin1.
But, (banging on the blackboard, very important), To guard against Illegal mix of collations (UTf8_general_ci,IMPLICIT) and (UTf8MB4_general_ci, utfcoercible) for operation ‘=’ error: The connection character set uses UTF8MB4, but if the SELECT/UPDATE WHERE condition contains columns of type UTF8 and characters to the right of the condition that do not belong to UTF8, this exception will be raised. I stepped in the hole.
One more friendly note: Key_len in the EXPLAIN result refers to the length of the search index, in bytes, and is measured in the maximum number of single-character bytes supported by the character set, which is one reason why INDEX_LENGTH is inflated.
05
C/C++ memory allocation problem
This is a thorny issue on our side of the development. C or C++ uses the libmysqlClient dynamic library on Linux to connect to MySQL. After obtaining data, the program transfers data according to a user-defined network protocol and a fixed number of bytes defined by the MySQL field. C++ memory allocation for single character from utf8 to utf8mb4 is increased from 3 to 4.
When using UTF8MB4, the official recommendation is to use vARCHar instead of char as much as possible to reduce fixed storage space waste (see here for the choice between char and varchAR). However, the size of vARCHAR can not be arbitrarily increased during the development and design of the table, although it is a variable length, but when the client defines variables to obtain data, it is based on the defined length, rather than the actual length. Allocate as needed to avoid using too much memory for your program.
06
Java driver use
The utF-8 encoding in Java supports 4 bytes, so mb4 is not required. However, if you read and write emoji from MySQL, the MySQL driver should be 5.1.13 or later. The database connection is still characterEncoding=UTF-8.
But that’s not all. There’s a big hole. The official manual also reads:
Connector/J did not support UTF8MB4 for Servers 5.5.2 and newer.
Connector/J now auto-detects servers configured with character_set_server=utf8mb4 or treats the Java encoding utf -8 passed
using characterEncoding=… as utf8mb4 in the SET NAMES= calls it makes when establishing the connection. (Bug #54175)
The Java driver automatically checks the server character_set_server configuration. If it is UTF8MB4, the driver sets SET NAMES UTF8MB4 when establishing the connection. Other languages, however, do not rely on such features.
07
Master/slave replication error
I haven’t encountered this problem, I just read the official documents and have seen similar technical articles.
Some character sets are not supported because the secondary library version is lower than the primary library version. Or manually changing the character set definition of a table or field from a library can cause an exception.
08
Join query problem
MySQL > alter table character set error: MySQL > alter table character set error: MySQL > alter table character set error: MySQL > alter table character set error: MySQL > alter table character set error: MySQL > alter table character set error:
CREATE TABLE t1 (
f_id varchar ( 20 ) NOT NULL ,
f_action char ( 25 ) NOT NULL DEFAULT ” COMMENT ” ,
PRIMARY KEY ( `f_id` ),
) ENGINE = InnoDB DEFAULT CHARSET =utf8 ROW_FORMAT=DYNAMIC;
CREATE TABLE t1_copy_mb4 (
f_id varchar ( 20 ) CHARACTER SET utf8mb4 NOT NULL ,
f_action char ( 25 ) NOT NULL DEFAULT ” COMMENT ” ,
PRIMARY KEY ( `f_id` ),
) ENGINE = InnoDB DEFAULT CHARSET =utf8 ROW_FORMAT=DYNAMIC;
1. EXPLAIN extended select * from t1 INNER JOIN t1_copy_mb4 t2 on t1.f_id=t2.f_id where t1.f_id= ‘421036’ ;
2. EXPLAIN extended select * from t1 INNER JOIN t1_copy_mb4 t2 on t1.f_id=t2.f_id where t2.f_id= ‘421036’ ;
Corresponding to the screenshots of 1,2 above:
Where 2 has convert:
-
(convert(t1.f_id using UTf8MB4) = ‘421036’)
The explanation can be found at the beginning of the website:
Similarly, the following comparison in the WHERE clause works according to the collation of utf8mb4_col:
SELECT * FROM utf8_tbl, utf8mb4_tbl
WHERE utf8_tbl.utf8_col = utf8mb4_tbl.utf8mb4_col;
Only index failures occur in utF8MB4 column to the left of the condition. (See here for more information about implicit type conversions in MySQL.)
09
reference
-
https://dev.mysql.com/doc/refman/8.0/en/charset-unicode-conversion.html
-
http://forums.mysql.com/read.php?103,187048,188748#msg-188748
-
Why are we using utf8mb4_general_ci and not utf8mb4_unicode_ci?
-
How to support full Unicode in MySQL databases
-
10 minutes Learn to understand and solve the problem of MySQL garbled characters
Note: Copyright of technical articles included in ACMUG belongs to the original author. If you have any questions, please contact the author.
Read the forward, hand left lingering fragrance. Focus on us and make progress together.
Pay attention to the public account of ACMUG, participate in community activities, exchange open source technology, share learning experience and make progress together.