MySQL Character set introduction
In simple terms, the character set in MySQL is a set of symbols and encodings. For example, we have an alphabet that contains the letters A through Z. We assign a number to each letter, for example, A = 1, b = 2, etc. So a is a symbol, and the number 1 is the code. The combination of all letters from A to Z and their corresponding codes is called a character set.
A collation is a set of rules for comparing characters in a character set. Similarly, if we want to compare two string values, for example, a and B. The easiest way to do this is to use their code, so for example, a is 1, b is 2, and we can see from the code that 1 is less than 2, so we can say that a is less than B.
MySQL supports many character sets, and each character set uses at least one collation that defines how characters in the character set are compared. We can use the following statement to view all character sets available in the MySQL database server:
SHOW CHARACTER SET; .+----------+-----------------------------+---------------------+--------+
| Charset | Description | Default collation | Maxlen |
+----------+-----------------------------+---------------------+--------+
| big5 | Big5 Traditional Chinese | big5_chinese_ci | 2 |
| dec8 | DEC West European | dec8_swedish_ci | 1 |
| cp1251 | Windows Cyrillic | cp1251_general_ci | 1 |
| utf16 | UTF- 16 Unicode | utf16_general_ci | 4 |
| utf16le | UTF- 16LE Unicode | utf16le_general_ci | 4 |
| cp932 | SJIS for Windows Japanese | cp932_japanese_ci | 2 |
| eucjpms | UJIS for Windows Japanese | eucjpms_japanese_ci | 3 |
+----------+-----------------------------+---------------------+--------+
Copy the code
Or use the following statement to view the collation.
show collation like 'utf8%';
Copy the code
The default character set
In general, character sets in MySQL work like this:
- When the database is created, the character set is derived from the server character_set_server variable.
- When you create a table, the character set is derived from the database.
- When a column is created, the character set is derived from the table.
1. View the CHARACTER_set_server variable
SHOW GLOBAL VARIABLES LIKE 'character_set_server%';
+----------------------+---------+
| Variable_name | Value |
+----------------------+---------+
| character_set_server | utf8mb4 |
+----------------------+---------+
Copy the code
2. View the database character set
1.create database test;
2.use test;
3.select @@character_set_database, @@collation_database;
+--------------------------+----------------------+
| @@character_set_database | @@collation_database |
+--------------------------+----------------------+
| utf8mb4 | utf8mb4_general_ci |
+--------------------------+----------------------+
Copy the code
The third command is written as follows:
SELECT DEFAULT_CHARACTER_SET_NAME, DEFAULT_COLLATION_NAME FROM INFORMATION_SCHEMA.SCHEMATA WHERE SCHEMA_NAME = 'test'
Copy the code
Note, however, that in Mysql8.0, the default character set has changed from latin1 to UTF8MB4.
3. Change the default character set.
To do a test, first change the default character_set_server, then create the database and check the character encoding, but note that the character_set_server change does not appear to be immediately effective, you need to disconnect the connection.
1.set global character_set_server = utf8;
2.create database utf8database
3.SELECT DEFAULT_CHARACTER_SET_NAME, DEFAULT_COLLATION_NAME FROM INFORMATION_SCHEMA.SCHEMATA WHERE SCHEMA_NAME = 'utf8database';
+----------------------------+------------------------+
| DEFAULT_CHARACTER_SET_NAME | DEFAULT_COLLATION_NAME |
+----------------------------+------------------------+
| utf8 | utf8_general_ci |
+----------------------------+------------------------+
Copy the code
To obtain the length of the
MySQL provides the LENGTH() function to return the LENGTH of a string in bytes, and the CHAR_LENGTH() function to find the LENGTH of a string in characters. If we use the LENGTH() function to calculate the LENGTH of the string of characters, we get a higher result than CHAR_LENGTH(). For example, a Chinese character is 3 bytes, and a letter is 1 byte.
1selectLength (" nuggets "),char_length(" nuggets "), length (" ABC ");+------------------+-----------------------+---------------+
|Length (" nuggets ")| char_length(" nuggets ")| length("abc") |
+------------------+-----------------------+---------------+
| 6 | 2 | 3 |
+------------------+-----------------------+---------------+
1 row in set (0.000 sec)
Copy the code
Of course we can CONVERT a string to a specific character set using the CONVERT function. For example, in the following example, it converts the UTF8-encoded string to GBK, where a Chinese character is 2 bytes, so the final result is 4.
select length(convert(" nuggets"using gbk));
+-------------------------------------+
| length(convert(" nuggets"using gbk)) |
+-------------------------------------+
| 4 |
+-------------------------------------+
1 row in set (0.000 sec)
Copy the code
Controls the configuration of communication between the server and client
Some MySql clients do not have the ability to support multiple character sets at the same time and can only use one character set at a time. Character set conversion between client and server is controlled by the following system variables.
-
character_set_server
MySql service default character set
-
character_set_database
Database default character set
-
character_set_client
MySql assumes that the client is using the CHARACTER_set_client variable as the character set for sending statements.
-
character_set_connection
The server converts the statement sent by the client from character_set_client to Character_set_connection
-
character_set_result
The server query results are converted to this character set when returned to the client. This includes result data (such as column values), result metadata (such as column names), and error messages. If you do not want to convert, you can set the character_set_results value to NULL or binary.
Trap utf8
Try not to use UTF-8 in Mysql, if you do use utF8MB4, because Mysql UTF-8 encoding is different from the real UTF-8 encoding. This is because it does not provide full Unicode support, which can lead to data loss or security issues, such as emoticons or special characters that cannot be displayed. Utf8mb4 is the “safest” character set because it supports 4-byte Unicode, while UTF8 only supports a maximum of 3 characters.
In addition, UTF8MB4_general_CI is a set of simplified collation rules, just for speed, and UTF8MB4_unicode_CI can sort accurately in multiple languages.