This article was published by horstxu

Mysql > show create table
show create table

CREATE TABLE `table1` (
    `id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
    `field1` text COLLATE utf8_unicode_ci NOT NULL COMMENT 'field 1'.`field2` varchar(128) COLLATE utf8_unicode_ci NOT NULL DEFAULT ' ' COMMENT Fields' 2 ',
    PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8_unicode_ci;
Copy the code

We can read most of the fields, but today’s topic is the COLLATE keyword. What does utf8_unicoDE_CI mean after this value? Use this topic to test a DBA during the interview, should be able to stump a large number of people.

What is COLLATE used for?

Development using PHPMyAdmin may look familiar because the Chinese header already gives the answer:

Utf8_unicode_ci is actually a rule for sorting. For columns of character type in mysql, such as VARCHAR, CHAR, and TEXT, a COLLATE type is required to tell mysql how to sort and compare the columns. In short, COLLATE affects the ORDER of ORDER BY statements, filters BY less-than signs in WHERE conditions, and queries with **DISTINCT**, **GROUP BY**, and **HAVING**. In addition, when mysql creates an index, if the index column is of character type, it will affect the index creation, but the effect is not perceptible. In general, wherever character type comparison or sorting is involved, COLLATE is involved.

The difference between various collates

COLLATE is usually related to data encoding (CHARSET). Generally, each CHARSET has multiple collates that it supports, and each CHARSET specifies a COLLATE as the default value. For example, the default COLLATE of Latin1 is latin1_Swedish_ci, the default COLLATE of GBK is GBk_chinese_ci, and the default COLLATE of UTF8MB4 is UTf8MB4_general_ci.

As an aside, mysql has utF8 and UTF8MB4 encodings. In mysql, please forget ** UTF8 ** and always use ** UTf8MB4 **. This is a legacy of mysql. Utf8 in mysql only supports characters of up to 3bytes. For some characters of up to 4bytes, utf8mb4 is not supported.

Many collates come with _CI, short for Case Insensitive, which means that “A” and “A” are sorted and compared indiscriminately. Selection * from table1 where field1=”a” select * from table1 where field1=”a” Meanwhile, collates with the _cs suffix are Case Sensitive.

Run the show collation command in mysql to view all collates supported by mysql. In the case of UTF8MB4, all collates supported by this encoding are shown below.

In the picture, we can see the order of languages in many countries. In China, utF8MB4_general_CI (default), UTF8MB4_unicode_CI, and UTF8MB4_bin are commonly used. Let’s explore the differences between these three:

First, utf8MB4_bin compares all characters as binary strings, and then compares them from the highest bits to the lowest bits. So obviously it’s case sensitive.

There is no difference between UTF8MB4_unicode_CI and UTF8MB4_general_ci for Chinese and English. For the system we developed for domestic use, either one is fine. For some western letters, utf8MB4_unicode_ci is more appropriate than UTf8MB4_general_ci. General is an older standard in mysql. For example, the German letter “ß” is equivalent to the letter “ss” in UTF8MB4_unicode_ci (German custom), but it is equivalent to the letter “s “in UTF8MB4_general_ci. However, the subtle differences between the two encodings are hard to perceive for normal development. We rarely use text fields to sort things directly, but at the very least, can a letter or two in the wrong order really have catastrophic consequences for the system? According to various posts and discussions found on the Internet, more people recommend using UTF8MB4_unicode_CI, but for the system that uses the default value, I do not reject it very much and do not think it is a big problem. Conclusion: UTF8MB4_unicoDE_CI is recommended, and there is no need to spend time retrofitting systems that already use UTF8MB4_general_CI.

Note also that since mysql 8.0, the mysql default CHARSET has been changed from Latin1 to UTF8MB4 (see link), and the default COLLATE has been changed to UTF8MB4_0900_ai_ci. Utf8mb4_0900_ai_ci is basically a further refinement of Unicode. 0900 refers to the Unicode Collation Algorithm version. Ai means accent insensitive, such as E, E, E, e and E are insensitive. Related references 1, Related references 2

COLLATE Sets the level and priority

Setting COLLATE can be specified at the example level, library level, table level, column level, and SQL level. The instance-level COLLATE setting is the collation_connection system variable in the mysql configuration file or startup directive.

Set COLLATE at library level as follows:

CREATE DATABASE <db_name> DEFAULT CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
Copy the code

If CHARSET and COLLATE are not set at the library level, the library level default CHARSET and COLLATE are set at the instance level. In versions below mysql8.0, the default CHARSET is Latin1 and the default COLLATE is latin1_swedish_ci if you don’t change anything. Since mysql8.0, the default CHARSET has been changed to UTf8MB4 and the default COLLATE has been changed to UTf8MB4_0900_AI_CI.

CREATE TABLE COLLATE (COLLATE); CREATE TABLE COLLATE (COLLATE);

CREATE TABLE (...... ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;Copy the code

If CHARSET and COLLATE are not set at the table level, the table level inherits the CHARSET and COLLATE at the library level.

The column level is specified when the column is declared in CREATE TABLE, for example

CREATE TABLE (

`field1`VARCHAR (64CHARACTER SET UTF8MB4 COLLATE UTF8MB4_general_ci NOT NULL DEFAULT' ',...) ...Copy the code

If CHARSET and COLATE are not set for the column level, the column level inherits the CHARSET and COLLATE of the table level.

Finally, you can override the COLLATE setting of any database table column by declaring the COLLATE setting when you write a SQL query.

SELECT DISTINCT field1 COLLATE utf8mb4_general_ci FROM table1;

SELECT field1, field2 FROM table1 ORDER BY field1 COLLATE utf8mb4_unicode_ci;
Copy the code

If all displays are set, the order of priority is SQL Statement > Column Level Settings > Table level Settings > Library level Settings > Instance level Settings. That is, a COLLATE specified on a column can override a COLLATE specified on a table, and a COLLATE specified on a table can override a library-level COLLATE. If not specified, the next-level Settings are inherited. If no COLLATE is specified, the COLLATE of the column is the same as that of the table.

So that’s COLLATE for mysql. However, in the system design, we should try to avoid the system to rely heavily on the sorting results of Chinese fields, in mysql query should also try to avoid the use of Chinese query conditions.

This article has been authorized by the author to Tencent Cloud + community, more original text pleaseClick on the

Search concern public number “cloud plus community”, the first time to obtain technical dry goods, after concern reply 1024 send you a technical course gift package!