Almost all articles on the Internet regard “UTF8” as the real UTF-8, including my previous articles and projects (face covering); So I hope more friends can read this article.
I believe that there are many people in the same boat as me, this is inevitable.
So, when you set up MySQL and MariaDB databases in the future, remember to change the database encoding to UTF8MB4. One of these days, a programmer in your place or your boss will find out that you’re a technical genius. Full version Java interview questions address: Java backend questions integration
One, error review
Write emoji text directly into SQL, execute insert statement error;
INSERT INTO ` csjdemo `. ` student ` (` ID `, ` NAME `, ` SEX `, ` AGE `, ` CLASS `, ` GRADE `, ` HOBBY `) VALUES (' 20 ', 'Chen ha ha 😓', 'male', '20', 'Class 181 ',' Grade 9 ', 'Film '); Copy the codeCopy the code
[Err] 1366 – Incorrect string value: ‘\xF0\x9F\x98\x93’ for column ‘NAME’ at row 1
Alter database encoding, system encoding and table field encoding format → UTF8MB4, then you can:
INSERT INTO ` student ` (` ID `, ` NAME `, ` SEX `, ` AGE `, ` CLASS `, ` GRADE `, ` HOBBY `) VALUES (null, 'ha ha Chen 😓 😓' and 'male', '20', 'Class 181 ',' Grade 9 ', 'Film '); Copy the codeCopy the code
MySQL utf8
MySQL’s “UTF8” is not really UTF-8.
In MySQL, the “UTF8” encoding only supports up to three bytes per character, while true UTF-8 supports up to four bytes per character.
In UTF8 encoding, Chinese is 3 bytes, other numbers, English, symbols are 1 byte.
However, emoji symbols account for 4 bytes, as do some more complex characters and traditional characters. Write failure, should be changed to UTf8MB4.
As shown in the figure above, this is the data stored after the encoding is changed to UTF8MB4, you can clearly compare the number of characters and bytes. Because of this, 4 bytes of content into the UTF8 encoding, certainly not, insert ah, is it (Pan hand).
MySQL has never fixed this bug, and in 2010 they released a character set called “UTf8MB4” that cleverly circumvents the problem.
Of course, they didn’t advertise the new character set (probably because they were embarrassed by the bug), so developers are still being advised to use “UTF8” on the web, but that advice is wrong.
1. Utf8mb4 is the real UTF-8
Yes, MySQL’s “UTF8MB4” is the real “UTF-8”.
MySQL’s “UTF8” is a “proprietary encoding” that can encode only a few Unicode characters.
All MySQL and MariaDB users using “UTF8” should switch to “UTf8MB4” and never use “UTF8” again.
So what is coding? What is UTF-8?
As we all know, computers use zeros and ones to store text. For example, if the character “C” is saved as “01000011”, then the computer needs to go through two steps to display this character:
- The computer reads “01000011” and gets the number 67 because 67 is encoded as “01000011”.
- The computer looked for 67 in the Unicode character set and found “C”.
The same:
- My computer maps “C” to 67 in the Unicode character set.
- My computer encoded 67 as “01000011” and sent it to the Web server.
Almost all web applications use the Unicode character set because there is no reason to use any other character set.
The Unicode character set contains millions of characters. The simplest encoding is UTF-32, which uses 32 bits per character. This is the easiest way to do it, because computers have always treated 32 bits as numbers, and that’s what computers are good at. The problem is, it’s a waste of space.
Utf-8 saves space. In UTF-8, the character “C” requires only 8 bits, and some infrequently used characters, such as “😓”, require 32 bits. Other characters may use 16 or 24 bits. An article like this one, if encoded in UTF-8, would take up about a quarter of the space of UTF-32.
A brief history of UTF8
Why did MySQL developers disable “UTF8”? The answer may be found in the MySQL version commit log.
MySQL has supported UTF-8 since version 4.1, in 2003, and the UTF-8 standard used today (RFC 3629) came later.
Older versions of the UTF-8 standard (RFC 2279) supported up to 6 bytes per character. On March 28, 2002, MySQL developers used RFC 2279 in the first MySQL 4.1 preview release.
In September of that year, they made a tweak to the MySQL source code: “UTF8 now only supports 3-byte sequences at most.”
Who submitted the code? Why would he do that? That question is unknown. After moving to Git (MySQL originally used BitKeeper), many of the submitter names in the MySQL code base were lost. There is no clue on the September 2003 mailing list to explain the change.
But we can try to guess:
In 2002, MySQL made the decision that if users could ensure that every row of a data table used the same number of bytes, MySQL would get a big performance boost. To do this, the user needs to define text columns as “CHAR”, each of which always has the same number of characters. If the number of characters inserted is less than the defined number, MySQL will fill in the following space. If the number of characters inserted is more than the defined number, the following portion will be truncated.
MySQL developers initially experimented with UTF-8 using 6 bytes per character, CHAR(1) using 6 bytes, CHAR(2) using 12 bytes, and so on.
It should be said that they did the right thing in the first place, but that version was never released. But that’s what it says in the documentation, and it’s widely circulated, and everyone who knows UTF-8 agrees with what it says in the documentation.
But obviously, MySQL developers or vendors are worried about users doing two things:
- Using CHAR to define columns (CHAR is ancient by now, but it was faster to use it in MySQL back then, though not since 2005).
- Set the CHAR column encoding to “UTF8”.
My guess is that the MySQL developers were trying to help users who wanted a win-win in space and speed, but they screwed up the “UTF8” coding.
So the result is no winner. Users who want to win both space and speed actually use more space and slower than expected when they use the CHAR column “UTF8”. Users who want correctness can’t save characters like “😓” when they use “utf8” encoding because “😓” is four bytes long.
After the invalid character set was published, MySQL could not fix it because it would require all users to rebuild their databases. Finally, MySQL rereleased “UTF8MB4” in 2010 to support true UTF-8.
Third, summary
Recently in view of the Internet company interview asked knowledge points, summed up the Java programmer interview involves most of the interview questions and answers to share with you, I hope to help you review before the interview and find a good job, but also save you on the Internet to search for information time to learn.
Content covers: Java, MyBatis, ZooKeeper, Dubbo, Elasticsearch, Memcached, Redis, MySQL, Spring, SpringBoot, SpringCloud, RabbitMQ, Kafka, Linux and other technology stacks.
Full version Java interview questions address: Java backend questions integration