Character sets and encodings in Java

The concepts of ASCII,Unicode, and UTF-8 are not discussed here, but you can see their definitions on Wikipedia by following the link.

For a quick start, I recommend reading Ruan Yifeng’s character Coding Notes: ASCII, Unicode and UTF-8.

This article focuses on how to store and read characters in Java

A series of problems caused by a Chinese character “Zhang”

Let’s start with the zhangUnicode codeAs hexadecimalU+5F20
hexadecimal5F20Convert to base 20101 1111 0010 0000

View Unicode and UTF-8 conversion rules
- Within the range of ASCII code, it is represented by one byte, and beyond the range of ASCII code it is represented by bytes, which forms the representation method of UTF-8 as we see above. The advantage of this is that when UNICODE files contain only ASCII codes, the stored files are all one byte, so it is no different from ordinary ASCII files. This is also true when reading, so it is compatible with previous ASCII files.
- If it is larger than ASCII, the first few bits of the first byte above indicate the length of the Unicode character, such as 110xxxxx. The first three bits of the binary representation tell us that the character is a 2-byte Unicode character. 1110XXXX is a three-digit UNICODE character, and so on; The position of XXX is filled in by bits in the binary representation of the character encoding number. The farther to the right the x has less special meaning. Use only the shortest multi-byte string that is sufficient to represent a character encoding number. Note that in multi-byte strings, the number of bytes in the first byte beginning with “1” is the number of bytes in the entire string.

Because U+5F20 falls between U+0800 and U+FFFF, it can be determined that the Chinese character “Zhang” consists of 3 bytes after being converted into UTF-8 format. According to the above conversion rules, the UTF-8 encoding of Chinese character “Zhang” is 1110 0101 1011 1100 1010 0000

This is consistent with looking directly at the encoding of “Zhang” (the default encoding is UTF-8)

Read characters with FileReader

File file = new File(System.getProperty("user.dir") + "/src/main/java/com/dsying/IO/a.txt");

Reader reader = new FileReader(file);
System.out.println(reader.read()); / / 24352
Copy the code

24352 is indeed the decimal code point for the Chinese character zhang

Read bytes with FileInputStream

File file = new File(System.getProperty("user.dir") + "/src/main/java/com/dsying/IO/a.txt");

FileInputStream is = new FileInputStream(file);
System.out.println(is.read()); / / 229
Copy the code

The hexadecimal notation for 229 indicates that it is indeed E5, the first byte of the Chinese character “Zhang”

Read the byte array with FileInputStream

File file = new File(System.getProperty("user.dir") + "/src/main/java/com/dsying/IO/a.txt");

FileInputStream is = new FileInputStream(file);
// Read 3 bytes at a time
byte[] bytes = new byte[3];
is.read(bytes);
System.out.println(Arrays.toString(bytes)); / / / - 27-68-96
Copy the code

Why bytes contain negative numbers when the first byte E5 should be 229 in decimal? Why -27?

This is because byte can only store 1 byte, that is, 8 binary bits, that is, the number between -128 and 127. Byte obviously cannot store 229 and overflow occurs

-27 how to turn into 229?

The binary of a negative number can be obtained from the binary of a positive number by adding one inversely

Binary of 270001, 1011,
After the not to1110, 0100,(All high values are 1)
Add a1110, 0101,(All high values are 1)
So the binary of minus 27 is zero1110, 0101,(All high values are 1)

And then 1110 0101&0xFF will give you 1110 0101, which is 229

How many bytes is the char type in Java? If it is 2 bytes, why is getBytes().length sometimes > 2

Put up a screenshot of me consulting someone

Character sets and encodings in Java

A series of problems caused by a Chinese character “Zhang”

Related Posts

Five tips for improving Linux Web browser security

Introduction and selection thinking of Dubbo3 Triple protocol

Several online file sharing small tools, used by small partners are good!