V8. Dev /blog/pointe…

There is always a struggle between memory and performance. As users, we want to be fast and use less memory. In general, however, improving performance requires more memory consumption (and vice versa).

Back in 2014, Chrome switched from 32-bit to 64-bit. This change brings Chrome better security, stability, and performance, but also more memory consumption, as each pointer takes up 8 bytes instead of 4 bytes. We faced the challenge of minimizing this 4-byte overhead in V8.

Before implementing improvements, we need to know where we are so that we can properly assess how to improve. To measure current memory and performance, we use a set of pages that can represent current popular sites. The data shows that V8 takes up 60% of the memory footprint of Chrome’s rendering process on the desktop, with an average of 40%.

Pointer compression is one of several efforts to improve V8’s memory footprint. The idea is simple: instead of storing 64-bit Pointers, we can store 32-bit offsets of some “base” address. Such a simple idea, how much benefit can we get from this compression in V8?

The V8 heap contains a large number of items, such as floating point values, string characters, The interpreter Bytecode and tagged values. When we examined the heap, we found that these tag values made up 70% of the V8 heap on a real-world site!

Now let’s look at what these tag values are.

The token value in V8

In V8 JavaScript objects, arrays, numbers, or strings are represented as objects and allocated to V8 heap areas. This allows us to represent any value with a pointer to an object.

Many JavaScript programs evaluate integers, such as adding an index in a loop. To avoid reassigning a new number object each time an integer increment occurs, V8 uses the well-known pointer tagging technique to store other or alternate data in V8’s heap pointer.

Tag bits serve a dual purpose: a signal indicating a strong/weak pointer or a small integer to an object in the V8 heap. Therefore, integers can be stored directly in the tag value without having to allocate additional storage space for them.

V8 allocates objects to word-aligned addresses in the heap, which allows it to be marked with 2 (or 3, depending on the machine word size) least-significant bits. In the 32-bit architecture, V8 uses the least significant bits to distinguish between Smis and heap object Pointers. For the heap pointer, it uses the second least significant bit to distinguish between strong and weak references:

                        |----- 32 bits -----|
Pointer:                |_____address_____w1|
Smi:                    |___int31_value____0|
Copy the code

W is used here to distinguish between strong and weak Pointers.

* Note: * An Smi value can carry only one 31bit payload, including symbol bits. For Pointers, we have 30 bits for the heap object address payload. Due to word alignment, the allocation granularity is 4 bytes, which gives us 4GB of addressing space.

In a 64-bit architecture, the value of V8 looks like this:

            |----- 32 bits -----|----- 32 bits -----|
Pointer:    |________________address______________w1|
Smi:        |____int32_value____|0000000000000000000|
Copy the code

Unlike 32-bit architectures, V8 can use 32 bits for Smi value payloads in 64-bit architectures. The following sections discuss the effect of 32-bit Smis on pointer compression.

Tagged values and the new heap layout

With pointer compression, our goal is to somehow convert two tag values to 32 bits in a 64-bit architecture. We adjust the pointer to 32 bits by:

  • Ensure that all V8 objects are allocated within the 4GB range
  • Represents the pointer as an offset within this range

Such a tight limit is unfortunate, but V8 in Chrome already limits the heap to 2GB or 4GB (depending on the device), even on 64-bit architectures. Other V8 plug-ins, such as Node.js, may require a larger heap. If we add a maximum limit of 4GB, we will make pointer compression impossible for these V8 embedded programs.

The problem now is how to update the heap layout so that 32-bit Pointers uniquely identify V8 objects.

A Simple Heap Layout

A simple compression scheme allocates objects in the first 4GB of address space.

Unfortunately V8 cannot do this, as Chrome renderers may need to create multiple instances of V8 in the same renderer process, for example for Web/Service Workers. In addition, using this scheme would cause all V8 instances to compete for the same 4GB address space, resulting in all V8 instances being limited to 4GB of memory.

Heap memory layout, v1

If we place the V8 heap in a contiguously 4GB address space somewhere else, an unsigned 32-bit offset starting from base uniquely identifies a pointer.

If we make sure base is 4GB aligned (4-GB-aligned), all Pointers have the same high 32 bits.

            |----- 32 bits -----|----- 32 bits -----|
Pointer:    |________base_______|______offset_____w1|
Copy the code

We can also compress Smis by limiting the Smi’s payload to 31 bits and placing it in the lower 32 bits. Basically, make it similar to what it would be in a 32-bit architecture.

         |----- 32 bits -----|----- 32 bits -----|
Smi:     |sssssssssssssssssss|____int31_value___0|
Copy the code

Here s is the symbolic value of the Smi payload. With symbolic extension representations, we can compress and uncompress Smis with just a one-bit arithmetic shift of 64-bit words.

Now we can see that the pointer and the upper half-word of Smis are completely defined by the lower half. This way, we can store the latter only in memory, reducing the memory required to store the tag value by half.

                    |----- 32 bits -----|----- 32 bits -----|
Compressed pointer:                     |______offset_____w1|
Compressed Smi:                         |____int31_value___0|
Copy the code

Assuming base is 4GB aligned, then compression is truncation:

uint64_t uncompressed_tagged;
uint32_t compressed_tagged = uint32_t(uncompressed_tagged);
Copy the code

But unpacking the code is a little more complicated. We need to make a distinction between the sign-extending Smi and the zero-extending pointer, and whether or not to extend a base.

uint32_t compressed_tagged;

uint64_t uncompressed_tagged;
if (compressed_tagged & 1) {
  // pointer case
  uncompressed_tagged = base + uint64_t(compressed_tagged);
} else {
  // Smi case
  uncompressed_tagged = int64_t(compressed_tagged);
}
Copy the code

Try changing the compression scheme to simplify the decompression code.

Heap memory layout, v2

If you place base not at the beginning of 4GB, but in the middle, you can treat the compressed value as a signed 32-bit offset starting from base. Note that the overall retention is no longer 4GB aligned (4-GB-aligned), but base is still aligned.

In this new layout, the compression code is the same as in the heap memory layout V1 above.

Unpacking the code, however, gets better. The symbol extension is now the same for Smi and Pointers, with the only branch being to add base in the case of Pointers.

int32_t compressed_tagged;

// Common code for both pointer and Smi cases
int64_t uncompressed_tagged = int64_t(compressed_tagged);
if (uncompressed_tagged & 1) {
  // pointer case
  uncompressed_tagged += base;
}
Copy the code

The performance of branches in code depends on the branch prediction unit in the CPU. If we perform decompression in a branching way, we can get better performance. With a little magic, we can write a branching version of the code:

int32_t compressed_tagged;

// Same code for both pointer and Smi cases
int64_t sign_extended_tagged = int64_t(compressed_tagged);
int64_t selector_mask = -(sign_extended_tagged & 1);
// Mask is 0 in case of Smi or all 1s in case of pointer
int64_t uncompressed_tagged =
    sign_extended_tagged + (base & selector_mask);
Copy the code

We then decided to start with a branchless implementation.

Performance evolution

The initial performance

We test performance using Octane, which is the performance benchmark we used in the past. Although we no longer focus on improving Peak Performance in our daily work, we don’t want to degrade it either, especially for something as performance-sensitive as Pointers. Octane is still a good benchmark for this task.

The graph shows Octane’s score on the X64 architecture when using pointer compression. In the figure, the higher the line, the better. The red line is the X64 build of the uncompressed pointer, and the green line is the compressed version of the pointer.

In the first scenario, our regression difference was about 35%.

Bump(1), +7%

First, we verify the hypothesis that “no branching is faster” by comparing no branching decompression with branching decompression. Our assumptions turned out to be wrong, and on the X64, the branching version was 7% faster. That’s a big difference!

Let’s take a look at x64 assembly

R13 is a special register for base values. Notice that branching code is more code and requires more registers.

In Arm64, we observed the same phenomenon – on powerful cpus, the branching version was significantly faster (even though the code size was the same in both cases).

On low-end Arm64 devices we found very little performance difference in either direction.

Our lesson is that branch predictors are very good on modern cpus, and code size (especially the length of the execution path) has a greater impact on performance.

Bump(2), +2%

TurboFan is an optimized compiler for V8, built around the concept of “Sea of Nodes”. Basically, each operation is represented by a Node in the graph (see this blog post for a more detailed explanation). These nodes have various dependencies, including data flow and control flow.

There are two operations that are critical to pointer compression: load and store, because they connect V8 heap memory to the rest of the pipeline. If we unzip the compression value each time it is loaded from heap memory and compress it before storage, the pipeline works as if in full-pointer mode. So we added new explicit operations to the node diagram – compression and decompression.

Decompression is not required in some cases, for example, if a compressed value is simply loaded from one location and stored in a new location.

To optimize unnecessary operations, we implemented a new “undecompression” phase in TurboFan. Its job is to eliminate decompression after direct compression. Since these nodes may not be directly connected, it tries to unzip through Graph propagation in hopes of encountering compression problems and eliminating them. This increases the value of our Octane by 2%.

Bump(3), +2%

When looking at the generated code, we noticed that unpacking a value that had just been loaded led to verbosity:

movl rax, <mem>   // load
movlsxlq rax, rax // sign extend
Copy the code

Once we fix the flag extension issue, the value can be loaded directly from memory.

movlsxlq rax, <mem>
Copy the code

We got another 2% improvement.

Bump(4), +11%

The TurboFan optimization phase works by using pattern matching on a graph: once a sub-GARPH matches a particular pattern, it is replaced with a semantically equivalent (but better) sub-graph or instruction.

An unsuccessful attempt to match does not give a clear indication of failure. Explicit compression/decompression operations in graph cause previously successful pattern matching attempts to fail, resulting in optimization failures and no prompts.

One example of an interrupt optimization is allocation Preternuring. Once we updated the Pattern matching to match the new compression/decompression nodes, we were able to get another 11% improvement.

Bump (5), + 0.5%

We learned a lot using Decompression Elimination in TurboFan. Explicit decompression/compression of Node has the following features:

Advantages:

  • It is obvious that we can optimize unnecessary decompression by matching sub-Graphs canonical patterns.

However, as we further implemented, we found shortcomings:

  • The representation of the new internal values can cause the transformation operation to become unmanageable. In addition to the existing representation set (tagged Smi, tagged Pointer, tagged Any, Word8, Word16, Word32, float32, float64, simd128), we also have compression pointer, compression Smi, Compress any value (the compressed value can be a pointer or Smi).

  • The existing graph-based pattern-based optimization did not work, which led to regressions in some places. While we found and fixed the problems, TurboFan continues to grow in complexity.

  • The Register allocator became increasingly dissatisfied with the number of nodes in the Graph and often generated incorrect code.

  • A large Node Graph slows down the TurboFan optimization phase and increases memory consumption during compilation.

We decided to take a step back and consider implementing a simpler approach to pointer compression in TurboFan. The new approach is to remove the compression pointer /Smi/ any representation, then have all explicit compression/decompression nodes hidden in store and load, and assume that we always compress before load and uncompress before store.

We’re also adding a new phase to TurboFan that replaces Decompression Elimination. This new phase recognizes when we don’t need to compress or uncompress and updates “load and store” accordingly. This approach significantly reduces the complexity of pointer compression in TurboFan and improves the quality of the generated code.

The new operation works just as well as the original, with an additional 0.5% performance improvement.

Bump (6), + 2.5%

We are close to average performance, but there is still a gap. We have to have a better idea. One idea was: What if we made sure that any code that handles Smi values doesn’t handle high 32 bits?

Previous decompression implementation:

// Old decompression implementation
int64_t uncompressed_tagged = int64_t(compressed_tagged);
if (uncompressed_tagged & 1) {
  // pointer case
  uncompressed_tagged += base;
}
Copy the code

An Smi can be assumed to be undefined if we ignore the upper 32 bits. This way, we can avoid special cases between Pointers and Smis, and can add base unconditionally when unzipping, even for Smis! We call this method “smI-corrupting”.

// New decompression implementation
int64_t uncompressed_tagged = base + int64_t(compressed_tagged);
Copy the code

Since we are not concerned with the sign extending of Smi, this change allows us to return to the heap memory layout V1. This is a base that points to the beginning of the 4GB reserved space.

In terms of unpacking the code, this change turns the sign-extension into a zero-extension, which is just as simple. But this simplifies the run-time (C++) side. For example, the address space area preserves code (see the implementation section for some details).

This is a compilation for comparison:

Therefore, we adjusted all the Smi code blocks in 8 to the new compression scheme, which gave us another 2.5% performance improvement.

Remaining Gap

The remaining performance gap can be explained by two optimizations for 64-bit builds that were disabled due to incompatibility with pointer compression.

32-bit Smi optimization (7), -1%

To recap, Smis looks like this in 64-bit architecture all-pointer mode:

        |----- 32 bits -----|----- 32 bits -----|
Smi:    |____int32_value____|0000000000000000000|
Copy the code

32-bit Smi has the following benefits:

  • It can have a larger integer range and does not need to be wrapped as an integer object
  • This form allows direct access to 32-bit values while reading/writing

This optimization cannot be used because there is no space in the 32-bit compressed pointer because there is a bit to distinguish the pointer from Smis after pointer compression. If we disable 32-bit SMis in the 64-bit version, we will see the Octane value drop by 1%.

Double field unboxing (8), -3%

Boxing is when the compiler automatically converts a base datatype value to an object of the corresponding wrapper class. Unboxing is the other way around.

Under some assumptions, this optimization attempts to store floating-point values directly in the fields of an object. The goal is to reduce the number of digital object allocations more than Smis alone can.

Imagine the following code:

function Point(x, y) {
  this.x = x;
  this.y = y;
}
const p = new Point(3.1.5.3);
Copy the code

In general, object P looks like this in memory:

Read this article for more on hidden classes, attributes, and elements in storage

In 64-bit architectures, double values and Pointers have the same size. So if we assume that the Point field always contains number values, we can store them directly in the object.

If a field causes the hypothesis to fail, for example, execute the following code:

const q = new Point(2.'ab');
Copy the code

The number value of the Y attribute must be stored boxed instead. In addition, if the optimized code somewhere relies on this assumption, the optimization must be discarded. The reason for these “field type” generalizations is to minimize the Shapes of objects created by the same constructor. In a JavaScript application, multiple objects have the same key. The JS engine optimizes storage by storing these keys in a single place. Shapes and Inline Caches), which in turn is necessary for stable performance.

If this optimization is applied, double precision field unpacking gives us the following benefits:

  • Direct access to floating-point data is provided through object Pointers, avoiding additional dereference through the number object.
  • Allows us to generate smaller, faster optimization code for Tight loops that allows us to do a lot of double field access. (for example, in a number-crunching application)

When pointer compression is enabled, double values are no longer suitable for compression fields. However, we may adapt this optimization for pointer compression in the future.

Note that even without double-precision field unpacking optimization (in a manner compatible with pointer compression), it is possible to override high-throughput numeric manipulation code by storing the data in Float64 TypedArrays or even using Wasm.

More optimization (9), 1%

Finally, tweaking the decompression elimination optimization in TurboFan resulted in another 1% performance increase.

Some optimization details

To simplify the integration of pointer compression into existing code, we decided to unpack values each time they are loaded and compress them each time they are stored. Therefore, only the storage format of the flag value is changed, and the execution format remains the same.

The Native code

To generate valid code when decompressed, you must ensure that the base value is always provided. Fortunately V8 already has a dedicated register pointing to a “roots table” that contains references to JavaScript and V8 internal objects that must always be available (e.g., undefined, NULL, true, false, etc.). This register is called the root register, and it is used to generate smaller, shareable internal code.

So, we put the root table in the V8 heap reservation, and the root register can be used for two purposes at once:

  • As the root pointer
  • As the base value for decompression

C + + side

The V8 runtime accesses objects in the V8 heap through C++ classes, providing easy access to the data stored in the heap. Note that V8 objects are more similar to the POD structure than C++ objects. The “view” class simply contains a uintptr_t field with the corresponding uintptr_t tag values. Since the view class is word-size, we can pass it by value with zero overhead (thanks to modern C++ compilers).

Here is pseudocode for a helper class:

// Hidden class class Map { ... inline DescriptorArray instance_descriptors() const; . // The actual tagged pointer value storedin the Map view object.
  cosnt uintptr_t ptr_;
}

DescriptorArray Map::instance_descriptors() const {
  uintptr_t field_address = FieldAddress(ptr_, kInstanceDescriptorsOffset);

  uintptr_t da = *reinterpret_cast<uintptr_t*>(field_address);
  return DescriptorArray(da);
}
Copy the code

To minimize the number of changes required to run the compressed version of the pointer the first time, we integrate the calculation of the base values required to extract into the getter.

inline uintptr_t GetBaseForPointerCompression(uintptr_t address) {
  // Round address down to 4 GB
  const uintptr_t kBaseAlignment = 1 << 32;
  return address & -kBaseAlignment;
}

DescriptorArray Map::instance_descriptors() const {
  uintptr_t field_address = FieldAddress(ptr_, kInstanceDescriptorsOffset);

  uint32_t compressed_da = *reinterpret_cast<uint32_t*>(field_address);

  uintptr_t base = GetBaseForPointerCompression(ptr_);
  uintptr_t da = base + compressed_da;
  return DescriptorArray(da);
}
Copy the code

Performance measurements confirm that calculating the base value at each load will affect performance. The reason is that I don’t know, c + + compiler to any address call V8 heap GetBaseForPointerCompression (), the result is the same, so the compiler can’t merge the base value calculation. Given that the code contains multiple instructions and a 64-bit constant, this causes the code to swell significantly.

To deal with this, we reuse the V8 instance pointer as the base for decompression (remember, the V8 instance data is in the heap layout). This pointer is usually available in runtime functions, so we simplify getters code and restore performance by requiring the use of V8 instance Pointers:

DescriptorArray Map::instance_descriptors(const Isolate* isolate) const {
  uintptr_t field_address =
      FieldAddress(ptr_, kInstanceDescriptorsOffset);

  uint32_t compressed_da = *reinterpret_cast<uint32_t*>(field_address);

  // No rounding is needed since the Isolate pointer is already the base.
  uintptr_t base = reinterpret_cast<uintptr_t>(isolate);
  uintptr_t da = DecompressTagged(base, compressed_value);
  return DescriptorArray(da);
}
Copy the code

The results of

Let’s take a look at the end result of pointer compression! For these results, we used the same site tests described at the beginning of this article. As a reminder, they represent users in real-world site usage.

We found that pointer compression reduced the V8 heap size by 43%! In turn, it reduces the memory footprint of the Chrome rendering process on the desktop by 20%.

Another important thing is that not every site will have the same improvements. For example, Facebook uses more V8 heap memory than the New York Times without pointer compression, but with this optimization, the heap usage is reversed. This difference can be explained by the fact that some sites have more Tagged values than others.

In addition to these memory improvements, we also see real performance improvements. On a real site, we use less CPU and garbage collection time!

conclusion

The journey, though no birds, but worth it. After 300+ commits, pointer compression gives V8 the performance of 64-bit applications with a 32-bit footprint.

We are always looking for performance improvements and accomplish the following tasks in the process:

  • Improve the quality of generated assembly code. We know that in some cases we can generate less code to improve performance.
  • Address the associated performance degradation, including a mechanism that unboxes doble fields again in a pointer compressing friendly manner.
  • Explore a larger range of ideas that support 8 to 16 gigabytes.