MySQL Character set introduction

In simple terms, the character set in MySQL is a set of symbols and encodings. For example, we have an alphabet that contains the letters A through Z. We assign a number to each letter, for example, A = 1, b = 2, etc. So a is a symbol, and the number 1 is the code. The combination of all letters from A to Z and their corresponding codes is called a character set.

A collation is a set of rules for comparing characters in a character set. Similarly, if we want to compare two string values, for example, a and B. The easiest way to do this is to use their code, so for example, a is 1, b is 2, and we can see from the code that 1 is less than 2, so we can say that a is less than B.

MySQL supports many character sets, and each character set uses at least one collation that defines how characters in the character set are compared. We can use the following statement to view all character sets available in the MySQL database server:

SHOW  CHARACTER SET; .+----------+-----------------------------+---------------------+--------+
| Charset  | Description                 | Default collation   | Maxlen |
+----------+-----------------------------+---------------------+--------+
| big5     | Big5 Traditional Chinese    | big5_chinese_ci     |      2 |
| dec8     | DEC West European           | dec8_swedish_ci     |      1 |
| cp1251   | Windows Cyrillic            | cp1251_general_ci   |      1 |
| utf16    | UTF- 16 Unicode              | utf16_general_ci    |      4 |
| utf16le  | UTF- 16LE Unicode            | utf16le_general_ci  |      4 |
| cp932    | SJIS for Windows Japanese   | cp932_japanese_ci   |      2 |
| eucjpms  | UJIS for Windows Japanese   | eucjpms_japanese_ci |      3 |
+----------+-----------------------------+---------------------+--------+

Copy the code

Or use the following statement to view the collation.

show collation like 'utf8%';
Copy the code

The default character set

In general, character sets in MySQL work like this:

  • When the database is created, the character set is derived from the server character_set_server variable.
  • When you create a table, the character set is derived from the database.
  • When a column is created, the character set is derived from the table.

1. View the CHARACTER_set_server variable

SHOW GLOBAL VARIABLES LIKE 'character_set_server%';

+----------------------+---------+
| Variable_name        | Value   |
+----------------------+---------+
| character_set_server | utf8mb4 |
+----------------------+---------+
Copy the code

2. View the database character set

1.create database test;

2.use test; 

3.select @@character_set_database, @@collation_database;

+--------------------------+----------------------+
| @@character_set_database | @@collation_database |
+--------------------------+----------------------+
| utf8mb4                  | utf8mb4_general_ci   |
+--------------------------+----------------------+

Copy the code

The third command is written as follows:

SELECT DEFAULT_CHARACTER_SET_NAME, DEFAULT_COLLATION_NAME  FROM INFORMATION_SCHEMA.SCHEMATA WHERE SCHEMA_NAME = 'test'
Copy the code

Note, however, that in Mysql8.0, the default character set has changed from latin1 to UTF8MB4.

3. Change the default character set.

To do a test, first change the default character_set_server, then create the database and check the character encoding, but note that the character_set_server change does not appear to be immediately effective, you need to disconnect the connection.

1.set global character_set_server = utf8;

2.create database utf8database

3.SELECT DEFAULT_CHARACTER_SET_NAME, DEFAULT_COLLATION_NAME  FROM INFORMATION_SCHEMA.SCHEMATA WHERE SCHEMA_NAME = 'utf8database';

+----------------------------+------------------------+
| DEFAULT_CHARACTER_SET_NAME | DEFAULT_COLLATION_NAME |
+----------------------------+------------------------+
| utf8                       | utf8_general_ci        |
+----------------------------+------------------------+

Copy the code

To obtain the length of the

MySQL provides the LENGTH() function to return the LENGTH of a string in bytes, and the CHAR_LENGTH() function to find the LENGTH of a string in characters. If we use the LENGTH() function to calculate the LENGTH of the string of characters, we get a higher result than CHAR_LENGTH(). For example, a Chinese character is 3 bytes, and a letter is 1 byte.

1selectLength (" nuggets "),char_length(" nuggets "), length (" ABC ");+------------------+-----------------------+---------------+
|Length (" nuggets ")| char_length(" nuggets ")| length("abc") |
+------------------+-----------------------+---------------+
|                6 |                     2 |             3 |
+------------------+-----------------------+---------------+
1 row in set (0.000 sec)

Copy the code

Of course we can CONVERT a string to a specific character set using the CONVERT function. For example, in the following example, it converts the UTF8-encoded string to GBK, where a Chinese character is 2 bytes, so the final result is 4.

select length(convert(" nuggets"using gbk));
+-------------------------------------+
| length(convert(" nuggets"using gbk))   |
+-------------------------------------+
|                                   4 |
+-------------------------------------+
1 row in set (0.000 sec)

Copy the code

Controls the configuration of communication between the server and client

Some MySql clients do not have the ability to support multiple character sets at the same time and can only use one character set at a time. Character set conversion between client and server is controlled by the following system variables.

  1. character_set_server

    MySql service default character set

  2. character_set_database

    Database default character set

  3. character_set_client

    MySql assumes that the client is using the CHARACTER_set_client variable as the character set for sending statements.

  4. character_set_connection

    The server converts the statement sent by the client from character_set_client to Character_set_connection

  5. character_set_result

    The server query results are converted to this character set when returned to the client. This includes result data (such as column values), result metadata (such as column names), and error messages. If you do not want to convert, you can set the character_set_results value to NULL or binary.

Trap utf8

Try not to use UTF-8 in Mysql, if you do use utF8MB4, because Mysql UTF-8 encoding is different from the real UTF-8 encoding. This is because it does not provide full Unicode support, which can lead to data loss or security issues, such as emoticons or special characters that cannot be displayed. Utf8mb4 is the “safest” character set because it supports 4-byte Unicode, while UTF8 only supports a maximum of 3 characters.

In addition, UTF8MB4_general_CI is a set of simplified collation rules, just for speed, and UTF8MB4_unicode_CI can sort accurately in multiple languages.