preface
I am looking at TCP/IP network communication. I am curious about how my computer communicates with LAN computers and extranet computers. The Mysql related blog has been delayed.
I haven’t written a blog for a long time, so I sorted out my previous blog and released it to my website.
Garble problem is difficult to say, a word to solve, encoding and decoding using the same charset (character set).
If you dig deeper, there’s a lot to know:
- The relationship between strings and Unicode in Java
- Unicode and UTF-8, GBK relationship
- Source code, inverse code, complement (because Java uses complement storage)
Strings in Java
Define strings in Java
@Test public void run1() { String a = "a"; System.out.println(a); String b = "\u261d"; System.out.println(b); // ☝ String ding = "\u4E00"; System.out.println(ding); / / a}Copy the code
What is the relationship between strings and Unicode in Java?
Public class Str {public static void main(String[] args) {char char1 = '1 '; String str1 = "diwan "; System.out.println(char1); System.out.println(str1); }}Copy the code
The compiled class file uses the Uincode character set.
For example, if your source Java file is UTF-8, compile it using the Uincode character set.
Source file for GBK, after compiling using Uincode character set.
The above conclusion can be inferred from the results shown below.
Online base conversion
The Uincode point is 4E00, and the decimal value is 19968
The Uincode point corresponding to d is 4E01
The Uincode point of ten thousand is 4E07
ByteCodeViewer is a built-in plugin for IDEA and jclasslib is installed. You can see it in the navigation bar View.
Strings in Java memory use the Unicode character set, known as internal encodings. No matter how encoded or decoded, the final string will be in the Unicode character set. Right
Public class Str {public static void main (String [] args) throws UnsupportedEncodingException {String Str = "compiled as a String Unicode character set "; final String charsetName = "UTF-8"; Final byte[] bytes = str.getBytes(charsetName); // UtF-8 encoding final byte[] bytes = str.getBytes(charsetName); Utf-8 Final String x = new String(bytes, charsetName); Println (x.codepointat (0)); System.out.println(x); final String gbk = "GBK"; Final byte[] GBKS = str.getBytes(GBK); GBK final String x1 = new String(GBKS, GBK); Println (x1.codePointat (0)); // Println (x1.codePointat (0)); System.out.println(x1); }}Copy the code
Java garbled code thinking questions
Let’s take an example to experience how to solve the garble problem, this is a simulated garble solution scenario
@ Test public void run100 () throws UnsupportedEncodingException {String STR = "zhang qin climbing"; final byte[] gbks = str.getBytes("GBK"); final String s = new String(gbks, "UTF-8"); // Do you have a way to change s to non-garbled?Copy the code
The following is my verification of my understanding of Uincode character set and UTF-8, GBK.
Unicode and UTF-8 relationships
Uincode coding table
Uincode is a character set. It specifies the code point of the word or symbol we use. Code points are saved in hexadecimal format.
The code point of Uincode character set 1 is 4E00.
The Uincode character set specifies that the code point of the d is 4E01.
Computers can only recognize binary zeros and ones. Utf-8, on the other hand, refers to encoding rules that define how code points are saved in binary.
The decimal system | Unicode symbol range (hexadecimal) | Utf-8 Encoding mode (binary) |
---|---|---|
0-127. | 0000 0000-0000 007F | 0xxxxxxx |
128-2047. | 0000 0080-0000 07FF | 110xxxxx 10xxxxxx |
2048-65535. | 0000 0800-0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx |
65536-1114111. | 0001 0000-0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx |
The above table briefly describes the UTF-8 encoding of Unicode.
- First, hexadecimal code points are converted to decimal by base
- The decimal numbers are then used to find out which range the table is in, and the coding rules are derived.
- Then the code points are converted to base 2, and the source code of word binary can be obtained by replacing x from low to high
- To convert binary source code to complement code for storage.
Java basics – Source code inverse code complement
Code verification conjecture
Take Zhao as an example. Zhao’s code point is 8D75
Zhao’s hexadecimal code point is converted to base 10:36213
36213 is in the range of 2048-65535, and the corresponding UTF-8 encoding format is 1110XXXX 10XXXXXX 10XXXXXX
Zhao’s hexadecimal code point 8D75 is converted to binary 1000 110101 110101
Fill 1000 110101 110101 binary into x in 1110XXXX 10XXXXXX 10XXxxxx, and fill 0 for the deficiency.
Add: 11101000 10110101 10110101
The complement of the three bytes is:
Source code: 11101000 10110101 10110101
Complement code: 10011000 11001011 11001011
The first byte of each byte indicates whether it is a positive or negative number. 1 means a negative number, 0 means a positive number.
The byte array corresponding to the complement in Java is: {-24,-75,-75}
@ Test public void run454 () throws UnsupportedEncodingException {String STR = "zhao"; final byte[] bytes = str.getBytes("UTF-8"); StringBuilder stringBuilder =new StringBuilder(); for (byte aByte : bytes) { stringBuilder.append(aByte).append(","); } System.out.println(stringBuilder.toString()); }Copy the code
To double check that my logic is correct, I use word validation again
And code point: 4E14
Conversion of hexadecimal code points to base 10:19988
19988 is in 2048-65535, and the corresponding UTF-8 encoding format is 1110XXXX 10XXXXXX 10XXXXXX
Hexadecimal code points are converted to binary: 100111000010100.
Fill 100 111000 010100 with the x of the corresponding encoding format 1110XXXX 10XXXXXX 10XXXXXX, and replace the x that is not filled with 0.
Source code: 11100100 10111000 10010100
Complement code: 10011100 11001000 11101100
The byte array corresponding to the complement is: {-28,-72,-108}
@ Test public void run43 () throws UnsupportedEncodingException {/ / {- 28-72-108} String STR = "and"; final byte[] bytes = str.getBytes("UTF-8"); StringBuilder stringBuilder =new StringBuilder(); for (byte aByte : bytes) { stringBuilder.append(aByte).append(","); } System.out.println(stringBuilder.toString()); }Copy the code
Uinocde with GBK transcoding
GBK code table
Zhao’s GBK code point is D5D4
Hex code point conversion to binary: 11010101 11010100 source code: 11010101 11010100 Complement: 10101011 10101100
The byte array corresponding to the complement is: {-43,-44}
@ Test public void run454 () throws UnsupportedEncodingException {String STR = "zhao"; final byte[] bytes = str.getBytes("GBK"); StringBuilder stringBuilder =new StringBuilder(); for (byte aByte : bytes) { stringBuilder.append(aByte).append(","); } // -43,-44 System.out.println(stringBuilder.toString()); }Copy the code
The above thought question insight
@ Test public void run100 () throws UnsupportedEncodingException {String STR = "zhang qin climbing"; final byte[] gbks = str.getBytes("GBK"); final String s = new String(gbks, "UTF-8"); // Do you have a way to change s to non-garbled?Copy the code
You might think so
@ Test public void run100 () throws UnsupportedEncodingException {String STR = "zhang qin climbing"; final byte[] gbks = str.getBytes("GBK"); final String s = new String(gbks, "UTF-8"); Final byte[] error_s = s.getBytes(" utF-8 "); final byte[] error_s = s.getbytes (" utF-8 "); System.out.println(new String(error_s,"GBK")); }Copy the code
Encoding: string to byte.
Decode: byte to string.
When we read a file we’re actually reading bytes. The bytes are then decoded into strings based on the encoding format of the file. This is where the garble problem comes in.
Don’t try to turn a garbled string variable into a non-garbled string variable in Java. This line of thinking is wrong. You should start with the bytes that precede the garble.
The s here is actually a garbled string, and you can’t change it any way to get back to STR.
Because the bytes of s and STR are not the same thing anymore.
@ Test public void run100 () throws UnsupportedEncodingException {String STR = "zhang qin climbing"; final byte[] gbks = str.getBytes("GBK"); final String s = new String(gbks, "UTF-8"); System.out.println(s); }Copy the code
The way to solve the problem of garbled code is to determine where the garbled code started, and then start looking up the code to see where the wrong code rules are being decoded.
The above scenario actually uses the wrong decoding rules for non-garbled GBKS, so we just need to change the decoding rules.
@ Test public void run100 () throws UnsupportedEncodingException {String STR = "zhang qin climbing"; final byte[] gbks = str.getBytes("GBK"); final String s = new String(gbks, "GBK"); System.out.println(s); }Copy the code
This article was created by Zhang Panqin on his blog www.mflyyou.cn/. It can be reproduced and quoted freely, but the author must be signed and indicate the source of the article.
If reprinted to wechat official account, please add the author’s official qr code at the end of the article. Wechat official account name: Mflyyou