String is a very basic and important class in the Java language that provides all the basic logic for constructing and managing strings. It is a classic Immutable class, declared as final, and all properties are final. Also because of its immutability, actions such as concatenating and clipping strings produce new strings. Because of the ubiquity of string operations, the efficiency of related operations often has a significant impact on application performance.

StringBufer is a class that solves the above mentioned problem of concatenating too many intermediate objects. We can use the append or add methods to add a string to the end of an existing sequence or at a specified location. StringBufer is essentially a thread-safe sequence of modifiable characters, which is thread-safe but carries an additional performance overhead, so its successor, StringBuilder, is recommended unless thread-safe is necessary.

StringBuilder is a new addition to Java 1.5 and is no different in its capabilities from StringBufer, but it does away with thread-safety and reduces overhead, making it the preferred choice for string concatenation in most cases.

String design and implementation considerations

As I mentioned earlier, String is a typical implementation of the Immutable class, which is thread-safe natively because you can’t make any changes to its internal data. This convenience is even present in copy constructors. Immutable objects do not require additional copies of data when they are copied.

Let’s take a look at some of the details of StringBufer’s implementation, which is thread-safe by adding the synchronized keyword to every method used to modify data, quite frankly. In fact, this kind of simple and rough implementation is very suitable for our common thread-safe class implementation, and does not have to worry about synchronized performance. Some people say that “premature optimization is the root of all evil”, considering reliability, correctness and code readability is the most important factor in most application development.

Also, how big should the internal array be? If it is too small, you may have to recreate an array large enough for concatenation. If it’s too big, it wastes space. The current implementation is to add 16 to the initial string length at build time (which means that if the original string is not entered when the object is built, the initial value is 16). If we are sure that splicing will occur many times and is probably predictable, we can specify the appropriate size and avoid the overhead of many expansions. There is a lot of overhead associated with scaling up, as you throw away the old array, create new (easily thought of as multiples) arrays, and perform an ArrayCopy.

String caching

We did some rough statistics, DumpHeap common applications, and then analyze object composition. On average, 25% of objects are strings, and about half of them are duplications. If you can avoid creating duplicate strings, you can significantly reduce memory consumption and object creation overhead.

Since Java 6, String has provided the intern() method, which prompts the JVM to cache the corresponding String for reuse. When we create the string object and call intern(), we return the cached instance if there is already a cached string, otherwise we cache it. Typically, the JVM caches all text strings like “ABC” or string constants.

Looks pretty good, doesn’t it? But the reality may surprise you. In general, historical versions like Java6 are used, and extensive use of intern is not recommended. Why? The devil is in the details. Cached strings are stored in what is known as PermGen, the infamous “permanent generation”, which has limited space and is rarely cared for by garbage collection outside of FullGC. So, if you don’t use it properly, OOM will.

In later versions, this cache is placed in the heap, which greatly avoids the problem of permanent generation filling, and even the permanent generation is replaced by MetaSpace (metadata section) in JDK8. Also, the default cache size has been constantly increased from 1009 to 60013 after 7u40.

Intern is an explicit reassignment mechanism, but it also has some side effects because it requires developers to call it explicitly when writing code. On the one hand, it is inconvenient to call each one explicitly. The other is that it’s hard to be efficient, and it’s hard to clearly predict string duplication during application development, which some consider a practice of polluting code.

Fortunately, after the Oracle JDK 8U20, a new feature was introduced, namely string rearrangement under G1GC. It does this by pointing strings of the same data to the same data, and is an underlying change to the JVM that requires no modification from the Java class library.

The evolution of String itself

In Java 9, we introduced the Compact Strings design, which dramatically improved Strings. I’ve changed the data storage from a char array to a byte array with an identity code called coder, and I’ve changed all the string manipulation classes. In addition, all related intrinsics and the like have been rewritten to ensure that there is no performance penalty.

Of course, in extreme cases, strings also degrade some capabilities, such as the size of the maximum string. If you think about the implementation of the original char array, the maximum length of the string is the limit of the array itself, but replace the byte array with the same array length, the storage capacity is reduced by twice! Fortunately, this is a theoretical limit and no real-world application has been found to be affected by it.