In-depth understanding of Java garble issues

preface

I am looking at TCP/IP network communication. I am curious about how my computer communicates with LAN computers and extranet computers. The Mysql related blog has been delayed.

I haven’t written a blog for a long time, so I sorted out my previous blog and released it to my website.

Garble problem is difficult to say, a word to solve, encoding and decoding using the same charset (character set).

If you dig deeper, there’s a lot to know:

The relationship between strings and Unicode in Java
Unicode and UTF-8, GBK relationship
Source code, inverse code, complement (because Java uses complement storage)

Strings in Java

Define strings in Java

@Test public void run1() { String a = "a"; System.out.println(a); String b = "\u261d"; System.out.println(b); //  String ding = "\u4E00"; System.out.println(ding); / / a}Copy the code

What is the relationship between strings and Unicode in Java?

Public class Str {public static void main(String[] args) {char char1 = '1 '; String str1 = "diwan "; System.out.println(char1); System.out.println(str1); }}Copy the code

The compiled class file uses the Uincode character set.

For example, if your source Java file is UTF-8, compile it using the Uincode character set.

Source file for GBK, after compiling using Uincode character set.

The above conclusion can be inferred from the results shown below.

Online base conversion

The Uincode point is 4E00, and the decimal value is 19968

The Uincode point corresponding to d is 4E01

The Uincode point of ten thousand is 4E07

ByteCodeViewer is a built-in plugin for IDEA and jclasslib is installed. You can see it in the navigation bar View.

Strings in Java memory use the Unicode character set, known as internal encodings. No matter how encoded or decoded, the final string will be in the Unicode character set. Right

Public class Str {public static void main (String [] args) throws UnsupportedEncodingException {String Str = "compiled as a String Unicode character set "; final String charsetName = "UTF-8"; Final byte[] bytes = str.getBytes(charsetName); // UtF-8 encoding final byte[] bytes = str.getBytes(charsetName); Utf-8 Final String x = new String(bytes, charsetName); Println (x.codepointat (0)); System.out.println(x); final String gbk = "GBK"; Final byte[] GBKS = str.getBytes(GBK); GBK final String x1 = new String(GBKS, GBK); Println (x1.codePointat (0)); // Println (x1.codePointat (0)); System.out.println(x1); }}Copy the code

Java garbled code thinking questions

Let’s take an example to experience how to solve the garble problem, this is a simulated garble solution scenario

@ Test public void run100 () throws UnsupportedEncodingException {String STR = "zhang qin climbing"; final byte[] gbks = str.getBytes("GBK"); final String s = new String(gbks, "UTF-8"); // Do you have a way to change s to non-garbled?Copy the code

The following is my verification of my understanding of Uincode character set and UTF-8, GBK.

Unicode and UTF-8 relationships

Uincode coding table

Uincode is a character set. It specifies the code point of the word or symbol we use. Code points are saved in hexadecimal format.

The code point of Uincode character set 1 is 4E00.

The Uincode character set specifies that the code point of the d is 4E01.

Computers can only recognize binary zeros and ones. Utf-8, on the other hand, refers to encoding rules that define how code points are saved in binary.

The decimal system	Unicode symbol range (hexadecimal)	Utf-8 Encoding mode (binary)
0-127.	0000 0000-0000 007F	0xxxxxxx
128-2047.	0000 0080-0000 07FF	110xxxxx 10xxxxxx
2048-65535.	0000 0800-0000 FFFF	1110xxxx 10xxxxxx 10xxxxxx
65536-1114111.	0001 0000-0010 FFFF	11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

The above table briefly describes the UTF-8 encoding of Unicode.

First, hexadecimal code points are converted to decimal by base
The decimal numbers are then used to find out which range the table is in, and the coding rules are derived.
Then the code points are converted to base 2, and the source code of word binary can be obtained by replacing x from low to high
To convert binary source code to complement code for storage.

Java basics – Source code inverse code complement

Code verification conjecture

Take Zhao as an example. Zhao’s code point is 8D75

Zhao’s hexadecimal code point is converted to base 10:36213

36213 is in the range of 2048-65535, and the corresponding UTF-8 encoding format is 1110XXXX 10XXXXXX 10XXXXXX

Zhao’s hexadecimal code point 8D75 is converted to binary 1000 110101 110101

Fill 1000 110101 110101 binary into x in 1110XXXX 10XXXXXX 10XXxxxx, and fill 0 for the deficiency.

Add: 11101000 10110101 10110101

The complement of the three bytes is:

Source code: 11101000 10110101 10110101

Complement code: 10011000 11001011 11001011

The first byte of each byte indicates whether it is a positive or negative number. 1 means a negative number, 0 means a positive number.

The byte array corresponding to the complement in Java is: {-24,-75,-75}

@ Test public void run454 () throws UnsupportedEncodingException {String STR = "zhao"; final byte[] bytes = str.getBytes("UTF-8"); StringBuilder stringBuilder =new StringBuilder(); for (byte aByte : bytes) { stringBuilder.append(aByte).append(","); } System.out.println(stringBuilder.toString()); }Copy the code

To double check that my logic is correct, I use word validation again

And code point: 4E14

Conversion of hexadecimal code points to base 10:19988

19988 is in 2048-65535, and the corresponding UTF-8 encoding format is 1110XXXX 10XXXXXX 10XXXXXX

Hexadecimal code points are converted to binary: 100111000010100.

Fill 100 111000 010100 with the x of the corresponding encoding format 1110XXXX 10XXXXXX 10XXXXXX, and replace the x that is not filled with 0.

Source code: 11100100 10111000 10010100

Complement code: 10011100 11001000 11101100

The byte array corresponding to the complement is: {-28,-72,-108}

@ Test public void run43 () throws UnsupportedEncodingException {/ / {- 28-72-108} String STR = "and"; final byte[] bytes = str.getBytes("UTF-8"); StringBuilder stringBuilder =new StringBuilder(); for (byte aByte : bytes) { stringBuilder.append(aByte).append(","); } System.out.println(stringBuilder.toString()); }Copy the code

Uinocde with GBK transcoding

GBK code table

Zhao’s GBK code point is D5D4

Hex code point conversion to binary: 11010101 11010100 source code: 11010101 11010100 Complement: 10101011 10101100

The byte array corresponding to the complement is: {-43,-44}

@ Test public void run454 () throws UnsupportedEncodingException {String STR = "zhao"; final byte[] bytes = str.getBytes("GBK"); StringBuilder stringBuilder =new StringBuilder(); for (byte aByte : bytes) { stringBuilder.append(aByte).append(","); } // -43,-44 System.out.println(stringBuilder.toString()); }Copy the code

The above thought question insight

@ Test public void run100 () throws UnsupportedEncodingException {String STR = "zhang qin climbing"; final byte[] gbks = str.getBytes("GBK"); final String s = new String(gbks, "UTF-8"); // Do you have a way to change s to non-garbled?Copy the code

You might think so

@ Test public void run100 () throws UnsupportedEncodingException {String STR = "zhang qin climbing"; final byte[] gbks = str.getBytes("GBK"); final String s = new String(gbks, "UTF-8"); Final byte[] error_s = s.getBytes(" utF-8 "); final byte[] error_s = s.getbytes (" utF-8 "); System.out.println(new String(error_s,"GBK")); }Copy the code

Encoding: string to byte.

Decode: byte to string.

When we read a file we’re actually reading bytes. The bytes are then decoded into strings based on the encoding format of the file. This is where the garble problem comes in.

Don’t try to turn a garbled string variable into a non-garbled string variable in Java. This line of thinking is wrong. You should start with the bytes that precede the garble.

The s here is actually a garbled string, and you can’t change it any way to get back to STR.

Because the bytes of s and STR are not the same thing anymore.

@ Test public void run100 () throws UnsupportedEncodingException {String STR = "zhang qin climbing"; final byte[] gbks = str.getBytes("GBK"); final String s = new String(gbks, "UTF-8"); System.out.println(s); }Copy the code

The way to solve the problem of garbled code is to determine where the garbled code started, and then start looking up the code to see where the wrong code rules are being decoded.

The above scenario actually uses the wrong decoding rules for non-garbled GBKS, so we just need to change the decoding rules.

@ Test public void run100 () throws UnsupportedEncodingException {String STR = "zhang qin climbing"; final byte[] gbks = str.getBytes("GBK"); final String s = new String(gbks, "GBK"); System.out.println(s); }Copy the code

This article was created by Zhang Panqin on his blog www.mflyyou.cn/. It can be reproduced and quoted freely, but the author must be signed and indicate the source of the article.

If reprinted to wechat official account, please add the author’s official qr code at the end of the article. Wechat official account name: Mflyyou