Idea Chinese Coding Problem Analysis (1)

Today, when testing the Spring Boot application, I encountered the problem of IDEA Chinese garbled characters. I took this opportunity to understand some default rules of IDEA encoding.

Character set and character encoding

The first step is to understand the difference between character sets and character encoding (GBK is better called character encoding). Once you know that Unicode is a character set, UTF-8, and GBK is a character encoding, some of the problems are easy to understand.

Idea configuration file encoding Settings

In figure 1, you can set the encoding mode of the configuration file. When you enter Chinese characters, they will be stored in the file according to the UTF-8 encoding mode (the UTF-8 encoding of Chinese characters is three bytes, for example, the UTF-8 encoding of “information” is E4BFA1 E681AF, and each encoding in the file is also three bytes). If the value is set to GBK, it is stored in the file according to the GBK encoding mode (the GBK encoding of “information” is D0C5 CFA2, and each byte is stored in the file).

U4FE1 \u606F: 4FE1 606F: u4FE1\u606F: 4FE1 606F: u4FE1\u606F: 4FE1; u4FE1; u606F: 6bytes The Unicode codes for Latin letters, etc., are the same, prefixed with \u for identification).

By default, Chinese characters in Java source files are recognized as Unicode codes in memory in utF-8 mode, as shown in the figure above. When saved in a file, it will be saved according to the file encoding you set (UTF-8 as 3 bytes, GBK as 2 bytes).

To see the binary encoding of a file, use Winhex.

The code for

After understanding some of the coding rules in IDEA, you can quickly analyze the cause of garbled characters.

The configuration file is garbled

In the figure above, if we uncheck the two places, the Chinese characters in the properties file will be garbled regardless of whether the first place is set to UTF-8 or GBK. Why? The reason is that by default, idea reads configuration files in accordance with Unicode codes. (For example, the UTF-8 encoding of “message” is E4BFA1 E681AF, and each byte in Unicode is recognized as a Unicode code, making it six Unicode characters.) ** Therefore, in the configuration file, to correctly identify Chinese, need to check 2, also remember to modify the file, this configuration will take effect.

There is an ununderstood problem (italic bold) : when reading configuration files, IDEA is recognized once according to UTF-8 encoding and then converted to Unicode code. Can the recognized encoding mode be changed? (This can be verified by setting the GBK code, which is also recognized by UTF-8) or the entire conversion process from configuration file to memory reading is not fully understood, hope to understand the big 牛 can guide.

Chinese characters in the source file are garbled

The reason for garbled Chinese characters in the source file is simple. The Chinese characters in the source file are encoded in one encoding mode and parsed in another encoding mode. This usually happens when the source file is imported into other people’s projects. For example, when someone else’s project GBK code, when you import according to UTF-8 code, it will no doubt garbled. As for how to modify many online tutorials, such as this one.

There is also an unsolved problem here, because the Chinese of the source file is recognized by UTF-8 by default and then converted into Unicode code and put into memory. If it is Chinese encoded by GBK, there will be recognition error (it has been verified in practice). Could you change the recognition mode here to GBK?

conclusion

The garbled code problem of IDEA is mainly caused by the inconsistency between the encoding mode when the file is read and the encoding mode when the file is saved. Therefore, it is crucial to understand the actual role of the encoding configuration of IDEA, so as to solve the problem quickly and well. At present, there are many other encoding configurations of IDEA, which need to be further understood.

Idea Chinese Coding Problem Analysis (1)

Character set and character encoding

Idea configuration file encoding Settings

The code for

The configuration file is garbled

Chinese characters in the source file are garbled

conclusion

Related Posts

PlantUML is used in GoLand to generate UML diagrams

Fir. Im Weekly – Build an ideal live broadcast platform from scratch

Git Operation Manual