Brief analysis of String byte and character changes

This is the sixth day of my participation in the More text Challenge. For details, see more text Challenge

Recently, when parsing files, I always see FILE, BUFFER, STREAM, so I want to see how they parse files from their encoding format;

Byte, character, ah!! To arouse interest; I’m going to start with a JDK optimization issue that some people may or may not have noticed.

In jdk1.8, look at the String source code, it is a char array to store data;

private final char value[];
Copy the code

In jdk11, it stores data in bytes (JDK9 starts)

@stable // @stable means that the array will not be null, and final means that the string will not be changed once initialized. private final byte[] value;Copy the code

Why did the String make this change?

Let’s look at char and byte first

One character =2 bytes, one byte takes 8 bits, so a char takes 16 bits =16 bits;

Let’s take a look at jdk8’s description of String characters:

String represents a String in UTF-16 format, where supplementary characters are represented by surrogate pairs (see the Unicode Character Representation section in the Character class for more information). Index values refer to char code units,So the supplementary character uses the String position. In addition to methods for working with Unicode code units, or char values, the String class also provides methods for working with Unicode code points, or characters

Char is called a code unit. Since String is a _UTF-16 String, let’s see how many bytes it stores under _UTF-16

String name="name";
Copy the code

Many people estimate that will think, oh, that can’t eight, a character occupy two bytes, hit out I was dumbstruck; It takes up ten bytes. That’s two too many. Did C come along and add \n? The reason is that the first 2 bytes of a UTF-16 file are marked LE [0xFF, 0xFE], BE [0xFE, 0xFF]. If you use UTF_16BE or UTF_16LE, the number of bytes typed is 8. A little off topic

Let’s take a look at jdk9’s official interpretation of byte processing. It supports two encodings: _LATIN1 (ISO-8859-1) and UTF-16. _ The virtual machine chooses which encoding to use based on the content. So how do you choose which code? The String class has a decoder bit, which is used to indicate whether the encoding is UTF-16 or Latin-1

private final byte coder;
Copy the code

The characters included in ISO-8859-1 are not only ASCII characters, but also characters of Western European languages, Greek, Thai, Arabic, and Hebrew. Note that characters such as Chinese characters are not included in isO-8859-1, so utF-16 must be retained to adapt to different characters.

In our daily development, many of the String reserves are single-byte, and we don’t define a Chinese one for no reason. Generally, they are alphabetic or numeric, so they only need _LATIN1_ single byte, so there is no need to waste char, but for a small number of Chinese assignments only char can be stored;

While writing this article, there was an incident where the Java designers wanted to use UCS-2, the predecessor of UTF-16, to represent all the characters in Unicode (Unicode, UNICODE, unicode), but Unicode became so large that two bytes were not enough, so the definition of char was changed. It was no longer a character. Instead, it was a unit of code. The first included characters were later called Basic Multilingual Plane(BMP), and the encoding table was then used in UTF-16. , so a single char can only describe code points in the BMP range of Unicode, char outside the BMP cannot represent; So the current UTF-16 is variable in length

Since char accounted for two bytes, fixed, so how to represent the four bytes of characters, such as expression emoji mood, etc., that is to use two char (note must receive yo String), * * points to say in front of char is code units, * * is four bytes in can represent all now! In Java, you can’t use char to represent anything outside of BMP. Although in most cases we don’t think about that;

All of a sudden, I’m not off topic

The single character _LATIN1_ does not affect the original use, so utF-16 is retained. When we call the length() method, we’re still calculating the char(code unit),

*** Do not confuse, but the encoding is changed, and the encoding determines how many bytes to use, how long to read or how long to read ***

Name is ISO-8859-1, four bytes, so it's four we're using UTF-16 and in the BMP range, four bytes, two char's so it's two name we're using UTF-16 12 bytes, notice here, because it's UTF-16 encoding, So each character in name is still two bytes, just like Chinese characters, 6 char, so the length is 6Copy the code

Brief analysis of String byte and character changes

Related Posts

Got it! Time complexity O(1), O(logn), O(n), O(nlogn)…

Are we designing microservices or small unit applications

Data structure – hash table