Always encountered the problem of garbled code: how does it arise, and how to solve it?

preface

Chinese garbled code problem is common in our daily development, so how does the garbled code problem come into being? How to solve the problem of garbled code? This article will be combined with basic concepts and examples to elaborate, I hope you have a harvest.

A simple example of garbled code

package whx;

import java.io.UnsupportedEncodingException;

public class TestEncodeAndDecode {
    public static void main(String[] args) throws UnsupportedEncodingException {

        String str = "Test Chinese garble";
        byte[] b = str.getBytes("GBK");
        System.out.println(new String (b,"UTF-8")); }}Copy the code

GBK encoding, utF-8 decoding, generate garbled code, the running results are as follows:

Related basic concepts

To understand the root cause of garbled characters, you need to understand the concepts of bits, bytes, characters, and character sets.

(bit)

A bit is the smallest unit of data stored in a computer. 1 or 0 represents 1 bit. For example, 10010010 represents 8 bits of binary number.

byte

A byte is a unit of measurement used in computer information technology to measure storage capacity. A string of binary digits treated as a unit is a small unit of information.

1 B = 8 bits (1 byte equals 8 bits) 1 KB = 1024 B = 1024 Bytes 1 MB = 1024 KB 1 GB = 1024 MB 1 TB = 1024 GBCopy the code

character

Characters are the letters, numbers, words and symbols used in computers. They are the smallest units of data access in data structures. For example, a, a, B, B, major, +, *, and % all represent one character.

In ASCII encoding, one alphanumeric character is stored in 1 byte. In GB 2312 encoding or GBK encoding, a Chinese character storage requires 2 bytes. In UTF-8 encoding, one alphanumeric character is stored in 1 byte and one Kanji character in 3 to 4 bytes. In UTF-16, each alphanumerical character or kanji character needs two bytes to be stored. In UTF-32, any character in the world needs four bytes to be storedCopy the code

Character set

A character set is a collection of multiple characters. There are many types of character sets, and each character set contains different numbers of characters. Common character set names:

ASCII character set GB2312 Character set Unicode character setCopy the code

Encoding and decoding

Computers only know binary 1 and 0, and human beings have their own language, both sides to be able to communicate information, there must be from the text to 0, 1 transformation, and 0, 1 to the text transformation.

Encoding: Converting text characters into 0 and 1 machine code that computers can recognize.

Decode: To parse binary numbers stored in a computer into words and characters.

Common character sets and encoding methods

Common character sets include ASCII, GBK, Unicode and so on

The ASCII character set

ASCII character set: it includes displayable characters such as English letters, Arabic numerals, and Western characters, as well as control characters such as return keys and backspace.

ASCII code: it is a character code developed in the United States to convert English characters into binary, specifying 128 character codes.

GBXXXX character set

GBXXXX series includes GB2312, GBK, GB18030, which is suitable for information exchange between Chinese character processing and Chinese character communication systems.

GB2312

The full name is “Chinese Coded Character Set for Information Interchange”, supporting more than 6,000 Chinese characters.
National simplified Chinese character set, compatible with ASCII, used in mainland China and Singapore.
Each Chinese character and symbol is represented by two bytes.
The high byte ranges from A1 to F7, and the low byte ranges from A1 to FE. The encoding is obtained by adding 0XA0 to the high and low bytes respectively.

GBK

The full name of GBK is “Code Extension Specification for Chinese Characters”, which extends GB2312 and supports more than 20,000 Chinese characters by adding support for traditional Chinese characters.
Each Chinese character and symbol is also represented by two bytes.
The high byte ranges from 81 to FE and the low byte ranges from 40 to FE.

GB18030

GB 18030, full name “Information Technology Chinese coded Character Set”, compatible with GB2312, GBK coding, can support 27484 characters
Variable length multi-byte encoding is used, and each word can be composed of 1, 2, or 4 bytes.
1 byte from 00 to 7F; 2 bytes high bytes from 81 to FE, low bytes from 40 to 7E and 80 to FE; The first three bytes are from 81 to FE, and the second four bytes are from 30 to 39.

The Unicode character set

Unicode is a character coding scheme developed by the international organization that can contain all characters and symbols in the world. The UNICODE character set has multiple encodings, namely UTF-8, UTF-16, and UTF-32.

UTF-8

Is a variable-length character encoding for Unicode.
It can be used to represent any character in the Unicode standard, and the first byte of its encoding remains ASCII compatible, allowing original ASCII software to continue to be used with little or no modification.
Utf-8 uses 1 to 4 bytes for each character (ASCIl requires only 1 byte, Latin and Greek require two bytes, Japanese and Chinese use three bytes, and other rarely used languages use four bytes).

UTF-16

A sequence that maps abstract code points of the Unicode character set to 16-bit integers (i.e. codes) for data storage or transmission.
Utf-16 has the advantage over UTF-8 in that most characters are stored in fixed-length bytes (2 bytes), but UTF-16 is not compatible with ASCII encoding.

UTF-32

A protocol for encoding Unicode characters using exactly 32 bits for each Unicode code point, while other Unicode encodings use variable length encodings.
Using 4 byte encoding, processing speed is relatively fast, but waste of space, slow transmission speed.

An example of understanding the look of coding and decoding

Hello Word is one of the most common words that we code programmers use. How can a computer present hello Word when it only knows 0 and 1?

In the previous section, we learned about encodings and character sets. We can use ASCII code to translate “Hello word” into a computer-aware 0 or 1. Those interested can check the ASCII comparison table

The computer stores the 0 and 1 binary code of Hello World. The binary code is decoded into the corresponding characters and rendered on the screen

How does garbled code come about?

There are two main reasons for garbled characters. One is that different encoding methods are used in the encoding process of text characters and the other is the garbled characters caused by the lack of certain font library.

Encoding and decoding use different encoding methods

In the example, utF-8 encoding is used and GBK decoding is used, resulting in garbled characters. Because in UTF-8, a Chinese character is encoded in three bytes, while in GBK, each Character is represented in two bytes, garbled characters are generated.

A character set that lacks a font library was used

We know that GB2312 does not support traditional Chinese characters, so using a character set encoding that lacks a font library will produce garbled characters.

What about garbled code

The garble problem can be solved by using a character set encoding that supports the font to be displayed and using the same encoding method as the codec.

Here are the classic scenarios and solutions for garbled code

IntelliJ Idea garbled characters

Chinese garble problem in IDE project? File-> Settings ->Editor->File Encodings to set utF-8 encoding mode

IDE console Chinese garble? To try this approach, open the IDE installation directory and find

Database garbled characters

To view the database encoding:

show variables like 'character_set%'
Copy the code

/ / the session scopeset character_set_server=utf8;
setcharacter_set_database=utf8; / / global scopeset global character_set_database=utf8;
set global character_set_server=utf8;
Copy the code

Mysql > alter mysql > alter mysql > alter mysql Modify or add the following in the my.ini configuration file for mysql (Windows)

[mysql]
default-character-set=utf8 
[mysqld]
default-character-set=utf8 
[client]
default-character-set=utf8
Copy the code

Encoding Angle of garbled code problem

Chinese garbled characters when writing code? Track the location to the encoding and decoding place, set the same encoding method.

Reference and thanks

Coding in principle (1) – first understanding of coding
Mysql > select * from ‘mysql’;

Personal public account

If you are a good boy who loves learning, you can follow my public account and study and discuss with me.
If you feel that this article is not correct, you can comment, you can also follow my public account, private chat me, we learn and progress together.