Concern public number: IT elder brother, read a dry goods technical article every day, a year later you will find a different self.
If you’re not a Java8 nerd, you should have noticed that the source code for the String class has been optimized from char[] to byte[] to store String content. Why?
To cut to the chase, the primary purpose of char[] to byte[] is to save memory for strings. Another benefit of the reduced memory footprint is the reduced GC count.
Why optimize String to save memory space
We use the jmap – histo: live pid | head -n 10 orders can view to heap object within the sample statistics, check this information and finalizers queue.
In the case of the programming Meow Meow project I’m running (based on Java 8), the result looks like this.
There are 17,638 Strings, occupying 423,312 bytes of memory, ranking third.
Since Java 8’s internal implementation of String is still char[], we can see that the number one memory footprint is the char array.
The char[] object has 17,673 objects, occupying 1621,352 bytes of memory, ranking first.
That said, optimizing String is necessary to save memory space, and optimizing a library that is not used as often as String would be a weak point.
Second,byte[]
Why do you save memory?
As we all know, char data takes up two bytes in the JVM and uses UTF-8 encoding, with values between ‘\u0000’ (0) and ‘\ UFFFF ‘(65,535) (included).
That is, using char[] to represent String results in taking up two bytes even though the characters in String can be represented by only one byte.
In practice, single-byte characters are still used more often than double-byte characters.
Of course, it’s not enough to simply optimize char[] to byte[], but it also works with Latin-1 encoding, which represents characters in single bytes, saving more space than UTF-8 encoding.
In other words, for:
String name = "jack";
Copy the code
In this case, using Latin-1 encoding, four bytes is enough.
But for:
String name = "xiao2 ";Copy the code
This, in fact, can only be encoded using UTF16.
For JDK 9 String source code, in order to distinguish the encoding method, added a coder field to distinguish.
/**
* The identifier of the encoding used to encode the bytes in
* {@code value}. The supported values in this implementation are
*
* LATIN1
* UTF16
*
* @implNote This field is trusted by the VM, and is a subject to
* constant folding if String instance is constant. Overwriting this
* field after construction will cause problems.
*/
private final byte coder;
Copy the code
Java automatically sets the encoding to the appropriate encoding, either Latin-1 or UTF16, depending on the content of the string.
That is, from char[] to byte[], Chinese is two bytes, pure English is one byte, and before that, Chinese is two bytes, English is two bytes.
Why use UTF-16 instead of UTF-8?
In UTF-8, characters 0-127 are represented by 1 byte, using the same encoding as ASCII. Only characters 128 and above are represented by two, three, or four bytes.
-
If there is only one byte, the highest bit is 0;
-
If there are more than one byte, then the first byte, starting with the highest bit, is encoded with several bits that have a value of 1, and the remaining bytes all start with 10.
The specific manifestations are as follows:
-
0xxxxxxx: one byte;
-
110XXXXX 10XXXXXX: two byte encoding form (start with two 1’s); -1110XXXX 10XXXXXX 10XXXXXX: three-byte encoding format (the first three ones);
-
11110XXX 10XXXXXX 10XXXXXX 10XXXXXX: Four-byte encoding format (start with four 1s).
That is, UTF-8 is variable-length, which is inconvenient for classes with random access methods like String. Random access is a method like charAt or subString, where you specify a random number and String gives you the result. If each character in the string takes up an indefinite amount of memory, then for random access, you need to count each character from scratch to find the character you want.
Utf-16 is also getting longer. Can a character take up to four bytes?
Indeed, UTF-16 uses two or four bytes to store characters.
-
Utf-16 uses two-byte storage for Unicode numbers between 0 and FFFF.
-
Utf-16 uses four bytes for characters with Unicode numbers between 10000 and 10FFFF, specifically: All bits of a character number are divided into two parts, the higher bits are stored with a double byte value between D800 DBFF, and the lower bits (the remaining bits) are stored with a double byte value between DC00 DFFF.
In Java, a char is a four-byte character. In Java, a String is stored in two chars. All operations on a String are in Units of Java char. SubString takes the number from char to char, and even length returns the number of chars.
So UTF-16 in the Java world can be considered a fixed-length code.
Concern public number: IT elder brother, read a dry goods technical article every day, a year later you will find a different self.