Translator: is_january

The original link

With the recent addition of SharedArrayBuffer, high concurrency is finding its way into the Javascript language. This additional feature allows Javascript programs to perform high concurrency access to SharedArrayBuffer objects. WebKit is supporting SharedArrayBuffer, and it already has full optimization support in our compiler pipeline. Unfortunately, Javascript does not allow objects other than SharedArrayBuffer to be shared.

This article considers a fanciful experiment: what would it take to extend concurrency to the entire Javascript heap? In a world where any object can be shared by other threads, this wouldn’t be a small change. Current optimization of Javascript virtual machines (VMS) takes advantage of the basic fact that there is only one thread of execution, so high concurrency certainly presents some performance issues. The question this article considers is whether this is technically possible, and if so, at what cost?

We provide a basic strawman API to illustrate what we mean by high concurrency, and most of this article will focus on how WebKit’s Javascript VM (called JavascriptCore or JSC for short) implements strawman. Implementing such a strawman would take a lot of work, and we thought that the implementation we wanted would satisfy the following goals:

  • There is no performance degradation for code that does not use high concurrency

  • Linear extensibility is desirable when programs run in parallel and do not intentionally share any objects. We don’t expect two threads to be exactly twice as fast as a single thread, since we judge that there might be some concurrency overhead ———— but if you have two cpus, two threads should be faster than a single thread, and preferably close to twice as fast as a single thread.

  • Linear extensions in parallel code bases for some truly shared objects ———— include an acceleration of the Best Serial baseline

  • Reasonable semantics, including a memory model, should be no weaker than that provided by any modern hardware.

  • Compatibility. For example, highly concurrent Javascript programs should know how to use the DOM without having to rewrite the DOM’s implementation logic.

Our target implementation plan is based on 64-bit systems, but that’s mainly because our engine is already 64-bit centric.

There is no way to evaluate the performance of this solution empirically, but our implementation ideas can help give a sense of what performance might look like. Specifically, our scheme guarantees that most attribute access costs at most about the same as a single arithmetic instruction (i.e., negligible), and some extremely complex (and rare) ones cost about seven times as much. At the end of this article, we compare our scheme with other technologies that achieve high concurrency.

Strawman high concurrency JS

Our strawman high-concurrency JS proposal is simply for adding threads, which can get separate stacks but share anything else. Threads are great for our experiment because they’re fairly universal. We can imagine implementing many other types of highly concurrent programming models on top of threads. Therefore, if we could get our VM to support threads, we might be able to get it to support many other highly concurrent, parallel programming models. This article does not consider the technical feasibility of adding any highly concurrent programming model to Javascript.

This part gives strawman some specificity, and for the most part provides usage situations that are difficult or easy to implement. Knowing what an API looks like is also useful for understanding what limitations it imposes.

Each program is started by a thread, which can start other threads, which reside in the same heap, and which can share objects with each other. This section is used to show the API, describe the memory model, and show how the high-concurrency model interacts with the DOM.

May the API

We recommend:

  • A simple API for creating threads,

  • Modify Atomic objects to support building lock objects,

  • An API for locks and condition variables based on atomized premises,

  • A way to create thread-local variables, and

  • Some helper methods to allow incremental applications

This article uses Strawman to understand exactly what implementing threads in JavascriptCore does

We want to create threads easily:

new Thread(function() { console.log(""Hello, threads!" "); });Copy the code

It can start a new thread, which will eventually print “”Hello, threads””. Note that even this simple example shares a lot of things among threads. For example, function objects are caught by the lexical scope created by the thread, so when the thread accesses the console variable, it is an access to the shared object.

Threads can be joined to wait for them to complete and get results:

let result = new Thread(() => 42).join(); // returns 42

Copy the code

In browsers, the main thread does not block, so join() throws an exception when the above code runs on the main thread. We can support an asynchronous version of blocking operations:

new Thread(() => 42).asyncJoin().then((result) => /* result is 42 */)
Copy the code

You can always get the thread object of the current thread:

let myThread = Thread.current;
Copy the code

Threads may need to wait on each other to prevent contention, which can be achieved through a locking mechanism. We want to extend the Atomics API(the Atomics API mentioned earlier) to allow users to build whatever locking mechanism they prefer, rather than simply providing a set of locking apis. We have a good lock implementation in the API, but we want to encourage the creation of other types of basic synchronization prototypes. The current SharedArrayBuffer feature allows developers to create custom locking mechanisms using the Atomics API. The API lets you do this:

Atomics.wait(array, index, expectedValue);
Copy the code

And:

Atomics.wake(array, index, numThreadsToWake);
Copy the code

Currently, arrays must be integer arrays that rely on SharedArrayBuffer, and we propose extending all Atomics methods from arrays/indexes to object and attribute names. Because an index is a property name, these changes do not change the behavior of the code that used the SharedArrayBuffer API. Also, this means that the Atomics method of integer values currently used (for storing and comparing elements in typed arrays) will now be able to use any Javascript value when used with regular Javascript attributes. Atomics.wake, atomics. wait, and Atomics.compareExchange are sufficient for any locking mechanism with a single Javascript attribute

In addition, we propose to add a lock API:

let lock = new Lock(); lock.hold(function() { /* ... perform work with lock held... * /});Copy the code

A lock on the main thread might do the promise:

lock.asyncHold().then(function() { /* ... perform work with lock held... * /});Copy the code

This is useful because each thread has its own runloop, and we can also add a Condition API:

let cond = new Condition(); cond.wait(lock); // Wait for a notification while the lock is temporarily released. // ... cond.asyncWait(lock).then(function() { /* ... perform work with lock reacquired... * /}); / /... cond.notify(); // Notify one thread or promise. // ... cond.notifyAll(); // Notify all threads and promises.Copy the code

Condition.prototype.wait will release the lock you passed to it before waiting and then retrieve it again before returning. This asynchronous transformation associates the promise of the result with the condition variable, so that if the condition is notified, the promise is implemented in the current thread.

With Thread.current and WeakMap, anyone can implement thread-local variables. But sometimes it is possible to do something smarter with the underlying runtime. We don’t want all Javascript programmers to have to know how to implement threadlocal variables using WeakMap. Therefore, we came up with a simple API:

let threadLocal = new ThreadLocal();
function foo()
{
    return threadLocal.value;
}
new Thread(function() {
    threadLocal.value = 43;
    print(""Thread sees "" + foo()); // Will always print 43.
});
threadLocal.value = 42;
print(""Main thread sees "" + foo()); // Will always print 42.
Copy the code

Finally, if the user wants to declare that objects are only kept on a thread, we want to be able to make this simple:

var o = {f: ""hello""}; // Could be any object.
Thread.restrict(o);
new Thread(function() {
    console.log(o.f); // Throws ConcurrentAccessError
});

Copy the code

Any object that is Thread.restrict should raise ConcurrencyAccessError for any proxy operation that does not call the Thread.restrict Thread

The memory model

Processors and compilers both like to reorder memory access, and they also like to hoist the load from memory. This happens because of the following code:

let x = o.f
o.g = 42
let y = o.f
Copy the code

May be converted by the compiler or processor as follows:

let x = o.f
o.g = 42
let y = x
Copy the code

In effect, the load entering Y is raised before storing O.G., and the processor dynamically does the same optimization by caching O.F and reusing the cache results from the second load.

Processors and compilers like to sink _ into _ memory because of the following code:

o.f = 42
let tmp = o.g
o.f = 43
Copy the code

May be converted by the compiler or processor as follows:

let tmp = o.g
o.f = 43
Copy the code

This looks like moving the o.F = 42 statement before o.F = 43, although the compiler does not do this conversion, but the processor may buffer the storage and execute it at a more convenient time. This can also be interpreted as if o.f = 42 might be executed after let TMP = o.g.

For our strawman, it makes the most sense to follow the current SharedArrayBuffer memory model. It is now possible to use SharedArrayBuffer to write multithreaded code with a limited range of nondeterministic behavior, and we do not have to avoid such code entirely. But Javascript objects are much more complex than a buffer, so it’s not easy to keep the JS object model essentially immutable when it faces competition. We propose that changes to Javascript object stores should be performed atomically. The effect of atomic operations is that if many threads execute these operations concurrently, each thread will behave as if they were all executed in some order without concurrency. Every low-level operation that JS can do to an object should be atomized:

  • Add attributes

  • Delete the properties

  • Gets the value of an attribute

  • Sets a property value

  • Change the configuration of a property

  • Taking a snapshot of a collection of attribute names

This does not always mean that the expression for O.F is atomized, because the expression may be more than just loading a property value. To be specific:

  • If O.f is a generic property directly on O, then this is an atomization operation

  • If O.F is a prototype access, then loading the prototype is independent of loading F from the prototype.

  • If O.F is a getter, then loading the read method is a step (atomized if the read method is defined directly on O), but calling the read method is not atomized, because the read method may execute any code.

We propose that low-level object operations be atomized. Some operations, such as reading and writing property values, may be implemented by using hardware raw instructions that allow reordering. Limited to the same memory model that has read/write permissions on SharedArrayBuffer, we recommend allowing mutual reordering of read/write permissions. While our strawman does allow contention and some memory model singularity, it doesn’t allow Javascript object model invariance to be invalid. For any heap created by a highly concurrent JS program, it should be possible to make a serialized JS program that creates an immutable heap.

Finally, we argue that memory management of the highly concurrent JS heap is just as it happens in other multithreaded languages with garbage collection mechanisms. Frankly, garbage collection must be done atomically, and should only be done at appropriate safe points, such as when iterating boundaries, allocating addresses, or calling code that doesn’t use the heap locally. This need is met by popular garbage collection algorithms, such as those in HotSpot or MMTk. It can also be satisfied by empirical algorithms such as Mark-sweep, semi-space, and even garbage collectors with no global termination phase, including largely lock-free, So much so that the garbage collector doesn’t even stop threads to scan the heap. Webkit’s Riptide GC already supports multithreading in most cases because our JIT threads can access the heap.

Interact with the DOM

It’s hard to scale high concurrency for all Javascript; Extending this to all DOM is even more difficult. We extend it far enough to derive threads for the DOM to make this set of strawmen useful (also for the DOM).

We plan that by default, the DOM object will raise ConcurrentAccessError in response to any proxy operation from any Thread other than the main Thread, just as Thread.RESTRICT is called from the main Thread to apply to other threads.

However, some objects need to have concurrency permissions open for the language itself to run stably. Some obvious situations, such as:

new Thread(function() { console.log(""Hello, threads!" "); });Copy the code

You need to have concurrent access to a DOM object. In Webkit, the _DOM global object _ is responsible for storing the state of variables in the window. The JS properties of ordinary global objects (including console, Object, and so on) need to be accessible by concurrent threads, because those properties are the global variables of the script. Access from other threads to other exotic properties of the global object, or properties of the global object prototype chain, will be thrown. Attempts to add or delete new properties from a non-main thread to a global object are also thrown. These constraints mean that dealing with limited types of concurrent attribute access rights for DOM global objects in Webkit is not much more difficult than dealing with the same rights for ordinary JS objects.

Also, namespace objects, like console, can be accessed by concurrent threads because they happen to be able to use the JS object model. There’s no reason to limit them to the main thread, and it’s important for console to be accessible because it has debug functionality.

Strawman summary

This section mainly proposes the Strawman API for Javascript language threads, which is sufficient to implement custom synchronous operations, and powerful enough to allow the execution of competing programs, but it doesn’t allow contention to the point where it breaks the language. Threads are applicable enough to implement many other programming models based on these capabilities.

Implement concurrent JS in Webkit

This section shows that we can implement our own thread-based strawman without having to turn off any JavascriptCore for basic performance optimization. Our plan aims to achieve near-zero overhead, even when multiple threads are simultaneously reading and writing to the same object.

Javascript has a lot in common with languages like Java and.NET in that they already support threading. Here are some commonalities between these languages and Javscript:

  • Like Javascript, those languages use tracing-based garbage collectors. Our garbage collection typically supports multithreading because JIT threads already allow reading of the heap. The biggest missing piece is Thread local allocation, which makes concurrent allocation possible. For our garbage collection mechanism, this means that each allocator has only one independent FreeList per thread. The garbage collector has a fixed number of allocators, and we already have fast thread-local storage, so this will be a mechanical change.

  • Like Javascript, those languages are implemented by multiple layers of JIT mechanisms and perhaps an interpreter. Our WebAssembly VM already supports multi-threaded threads from BUILD Bytecode Quickly to OMG(Optimized Machinecode Generation) Tier – up). On Javascript, this works.

  • Like Javascript implementations, these languages use inline caching to speed up dynamic operations. We need to make some changes to our inline cache to support concurrency, and we will describe these changes later in this article. This is not the hardest part of adding concurrency, because it’s not something new ———— has been implemented before.

These similarities imply that many of the techniques already used to support concurrency in the Java virtual machine can be reused to implement concurrency in Javascript.

The really hard part is Javascript’s ability to dynamically reconfigure objects. In the case of Java and.NET, they have fixed-size objects (which do not change size once allocated), whereas Javascript objects become variable size. Concurrency in statically typed languages relies on the default atomization of concurrent access to fixed-size objects, which is determined by the length of the machine pointer (thus, 64-bit systems perform atomized 64-bit atom access by default). Pointer values in Java and.NET are contiguic slices of memory that store object data, do arithmetic on addresses (such as adding an offset), and only have a single memory instruction read or write a field. Even though the memory model allows for accidental reordering, competing access instructions to the same field (or different fields of the same object) never pollute the entire object or cause a crash. However, Javascript’s variable size objects mean that, in some cases, object access requires multiple memory access instructions, and a series of operations containing multiple accesses to memory is not atomic by default. In the worst case, the virtual machine crashes because the internal object state is contaminated. Even if this does not happen, contention causes write losses or time travel (writing A before B to A field causes reads to see A, then B, and then A again)

In our Strawman proposal, concurrent Javascript represents basic operations, such as adding a new attribute to an object, changing the value of the attribute, reading the attribute, and deleting the attribute, all done atomically. No contention should cause the virtual machine to crash, lose writes, or time travel of property values

We propose an algorithm that allows most Javascript object access to be wait-free and requires minimal overhead compared to our existing serialized JS implementations. Actions that do not wait are executed without blocking and complete in a finite number of steps without regard to contention. The algorithm borrows from real-time garbage collection, locking algorithms, and type references. We planned to use tiered defense to prevent the overhead of concurrency:

  1. We propose the use of an existing polymorphic inline-cache-based type reference system to get a loose concept of objects (and their types) in the thread domain. These objects are called transition-Thread-locality (TTL). TTL objects would use the same object model as today, and access to these objects would be very close to zero overhead. TTL does not represent a true thread-locality, for example, even objects that are read and written by multiple threads can be treated as TTL.

  2. We proposed to use the segmented _(segmented) object model, which was suitable for objects that did not meet the TTL inference. This model introduces an additional load instruction along the fast path for property access, as well as some arithmetic operations. We are concerned that this technique by itself is not fast enough to meet our performance goals ———— but because the fragmented object model will only be applied precisely to objects (or their types) that do not satisfy the TTL corollary, we suspect that the _ network _ cost will be small enough to be applied in real-world cases

  3. Locking is used for operations on non-TTL objects that do not benefit from the fragmented object model. During the time we were developing Riptide high concurrent garbage collection, we added an internal lock to each object. This lock takes only two bytes and allows lock/unlock fast paths to require an atomized compare-and-swap (CAS) and a branch.

Before we go into more details, we will first present some hardware requirements, and our proposed system will run on such a set of hardware. We will describe this proposal in reverse, considering only the cost of using pre-object locking. We will then review the existing JSC object model and explain what operations happen to be atomized in this model. Next, we introduce _segmented Butterflies_, which is the key point to allow dynamic reconfiguration of objects. We’ll go on to show how the TTL model allows us to use the existing object model on most Javascript heaps. This part will consider some open follow-up, such as how to do some inline cache concurrent, how to apply the variable size array in TTL and segmented Butterflies, how to use the local Optimizer lock, LOL) to handle large thread counts and how to handle local states that are not thread-safe.

Hardware requirements

This article explains how to convert JavascriptCore into Javascript that supports concurrency. JSC is currently optimized for 64-bit systems, and most of the concurrency support (concurrent JIT, concurrent garbage collection) only works on 64-bit systems. This section briefly summarizes what we hope to gain from 64-bit systems in order to be able to apply our plans.

JSC is optimized for x86-64 and ARM-64. This plan assumes that the following atomization instructions are inexpensive enough to be applied to optimizations that use them as locks in practice, based on the weaker memory model of both (ARM-64) :

  • 64-bit reads and writes are atomic by default, and we expect these accesses to be logged around other accesses. But we also expect that if memory access instruction B has A dependency on memory access instruction A on the data flow, then A will always come out before B. For example, on a dependent chain lock such as a->f-> G, a->f will always execute before _->g.

  • 64-bit CAS(compare-and-swap). JSC uses 64-bit words as Javascript properties and two important metadata fields in object headers: Type Header and Butterfly Pointer. We want to atomically compare and exchange (CAS) any JS attribute, as well as any unary data attribute in the object header.

  • 128-bit DCAS(double-word compare-and-swap). Sometimes we will want to compare and exchange all the metadata in the (CAS) JS object, that is, compare and exchange the (CAS) 64-bit Type header with the adjacent 64-bit Butterfly Pointer.

Per-object Locking

Each object in the JSC already has a lock that we use to synchronize some basic Javscript operations with garbage collection. We can use this lock to protect any operation that needs to be synchronized between Javascript threads. The rest of this article looks at how optimization can be used to avoid such locks, but first let’s think about the cost of such locks. Our internal object locking algorithm requires that one CAS and branch be prepared for locking and another CAS and branch for unlocking. In the best case, the CAS will execute at least 10 cycles, which is the average overhead assuming that the CAS succeeds. On some hardware devices, the overhead is even higher. Assuming that a branch is a cycle, that is, there are at least 22 additional cycles for each object operation that requires locking. While the overhead of some rare operations is large enough that 22 cycles are relatively fine, many fast path operations will not be able to handle this overhead load. Here we think of some operations and how they might be affected if they were to use locks:

  • Currently adding a new attribute requires only one load, one branch, and two stores on the optimized fast path. Each operation is a single cycle, so there will be a total of four cycles in the fastest case. After adding the locking mechanism, it will soar to 26 ———— and slow down about 7 times. In some cases, adding a new attribute can be optimized to require a single store, in this case locking is 23 times slower

  • Changing the value of an existing property requires one load, one branch, and one store, which can be as short as three cycles. Adding two CAS and two branches makes 25 cycles ———— 8 times slower. In some cases, we can optimize the storage of attributes to require only one storage instruction, which is nearly 23 times slower after locking.

  • Loading the value of an existing attribute requires one load, one branch, and one extra load, which is up to 8 times slower when locked, or up to 23 times slower when optimized to a single load instruction in some cases.

  • Deleting properties is already slow, so locking is actually acceptable, and deleting properties is relatively uncommon, but we need to support it anyway.

  • Dictionary queries ———— accessing jSC-speak ———— without any inline caching or compiling intelligent dynamically executed properties will see a small overhead in the locking mechanism. That part of the code path is already quite expensive, so locking is unlikely to be an additional problem.

  • Taking a snapshot for the property set does not change. It is entirely possible to have concurrent threads take snapshots of the property set of any object in the JSC because we have implemented locking mechanisms for concurrent garbage collection.

Although the overhead of some operations, such as deletion, is so large that the locking mechanism is not an additional problem, we think it is impractical to impose such an additional cost in the fast case of accessing Javascript objects. As a result, the locked programming language is too slow.

Designing a fast, high-concurrency JS implementation requires the introduction of new algorithms for property access that can run concurrently on individual threads without any locking mechanism, except in rare cases. In the next section, we will describe such an algorithm. First, we review the object model of JavascriptCore. Then we’ll show examples where JavascriptCore’s object model works as if it had fixed size objects, and locks are never required; Then we’ll show a technique called _Segmented_ (Fragment) Butterflies, which allows most concurrent objects to be accessed without waiting. Using this technique alone is still too expensive for our purposes, so we’ll show you how to determine which object types are transition-thread-local(TTL) to avoid synchronization when passing objects. It allows us to use flat butterflies (the existing object model) for most objects. Next, we’ll deal with deletes, dictionary queries, inline caches, arrays, and thread-unsafe objects.

JavascriptCore object model

JavaScriptCore object model. Javascript values and object Pointers are implemented by Pointers to cells that do not change position or size, but contain a pointer to butterfly that can be left (for named properties, named properties), Or to the right (for array elements).

JavascriptCore’s object model allows for four states, each of which is optional:

  • Local state means using normal C++ objects

  • Data for named attributes that the object may contain (for example, O.F)

  • Dynamically added data for the object’s named properties, forcing changes to the object’s size

  • Object index properties, which can resize objects in a variety of ways

The first two types of state do not change the size of the object, while the latter two do. In JSC, fixed-size state is stored directly in object _cell_, where the cell object pointer points to something. Inside the cell, there is a Butterfly pointer that can be used to store dynamically allocated and variably sized states in an _butterfly_. Butterfly stores the butterfly pointer to the named attribute to the left of the out-of-line slots location, as well as the index attribute to the right of the _ array element. Each location may store a labeled Javascript value, which can be a number, a pointer to another cell (representing a string, symbol, or object), or a special value (true, false, null, or undefined).

Each object has a structure that contains the hash table used to map names to slots in the object. Objects typically share a structure with similar other objects that have the same properties in the same order. Structure is referenced in a structure table using 32-bit indexes, which we do to save space. The index and the _cell state _ byte will have several extra bits. We use two extra bits in the _ index _ byte to support a single object lock, and we also use the extra bits in the _butterfly pointer _, because Pointers don’t need all the 64-bit bits on a 64-bit system.

Butterfly is optional. Many objects do not have butterflies. When a Butterfly is assigned, either side of it is optional. Butterfly to the left accommodates out-of-line slots, so you can just assign that out-of line slots; In this case, the Butterfly pointer points 8 bytes to the right of the end of Butterfly memory. The right hand side of a butterfly contains an array element and an array header. Translator’s Note). The _ common length _ is the length from array.length, and the _ vector length _ is the number of slots allocated to the index group elements.

Freebies

In the object model, access to anything in a Cell is atomic by default, just as access to Java objects is atomic by default. This is important because our optimization strategy was largely successful in the sense of putting the most important object fields into the cell. Most concurrent JS programs that rely on data terminated in the cell will experience almost zero additional overhead relative to their serialization equivalents.

Also, if we know that butterfly will not be redistributed (for example, butterfly Pointers are immutable), then we can access it directly without any problems. Those accesses are naturally atomized by default.

Alternatively, if we know that only the current thread will change the size of the Butterfly and all other threads will only read the Butterfly, then access to the Butterfly is also atomized by default.

The problem arises when a thread tries to transition an object (for example, adding properties, and/or reconfiguring butterfly) while another thread is writing to (or also transitioning from) it. If we use the current implementation of object access in the concurrency setting, transition on a thread can cause contention, which can cause behavior that doesn’t match the specific behavior we expect.

  • A write initiated by another thread disappears, or time travel occurs as mentioned earlier. The race occurs because when another thread writes between these two steps, making transition to the thread makes a copy of butterfly first, and then the Transition object. If you have a third thread watching what’s going on, reading and reading the state of the write field over and over again, it will first see the state before any writes, then the write, and then once the transition is complete, it will see the original state again.

  • Executing transition twice concurrently can cause a type mismatch between butterfly Pointers and objects. Butterfly formats are determined by fields in the object header, which are not stored in the same 64-bit word as butterfly Pointers. We can’t allow butterflies to be out of sync with the head, because that leads to memory corruption.

  • Executing transition twice simultaneously causes some tiny heap contamination, even without Butterfly. If a transition splits inline properties into two steps ————, first changing the object type, and then storing new properties ————, then two competing transitions that add two different properties may result in one property being added, its value ending with the expected value of the other property. For example, a race between o.f = 1 and O.g = 2 could result in O.F == 2 and “”g”” in o == false.

In the next section, we’ll show you how to create an object model without this contention, but at some cost. Later, we’ll show how to use TTL inference to use our existing object model in most cases, even in programs that share objects between threads.

Segmented Butterflies

Transition to an object means assigning a new Butterfly and copying the contents of the previous Butterfly into the new one while other threads might be reading or writing on the old one. We want to make a copy that looks like it happened in an atomized operation. Many studies have begun to support concurrent copying of objects in the context of real-time garbage collection. Our proposed approach is based on the Arraylet object model of the Schism Real-time garbage collector. This section reviews the Arraylet object model and then shows how to use it as a Butterfly transition.

Schism has been trying to solve the problem of implementing a copy garbage collector whose copy phase should be fully concurrent. This is based on the observation that copying the object concurrently to other threads reading it is easy if the object is immutable, but becomes difficult only if the object is written to another thread. Schism solves the problem of copying mutable state concurrently by encapsulating mutable state in a small fixed-size fragment (32 bits). The copied object is actually a spine that queries the index of fragments. All object access uses an additional indirect way to query the fragment that contains the data being accessed. Spines here are immutable objects because fragments they point to never change position.

Webkit already uses things like Arraylets, which are in the WTF::SegmentedVector class template. We use it to keep C++ objects from changing position when the vector size changes. We also use it to implement variable storage for JSGlobalObject (the global var statement). The term Arraylets comes from Bacon, Cheng, and Rajan, who use it to control the fragmentation of the Metronome garbage collector. Many arraylet studies have shown that additional indirect instructions incur significant overhead (typically 10% or more). That is, if you double the cost of array access with an additional indirect instruction, you increase the total running time of your program by 10%.

Segmented Butterflies consisted of a spine (which changed size but only included invariable states) and zero or more fragments (which did not change size but were variable). Separating resized state from mutable state makes concurrent layout changes a credible technique in concurrent garbage collection. The fragment shape can match the shape of unsegemented Butterfly, we will use it to do some additional optimization.

We can apply this technique to Butterfly. A Segemented Butterfly has a spine that contains Pointers to fragments containing mutable data. If we need to add new properties and those properties do not fit the existing fragment, we can extend spine and allocate more fragments. To expand a spine means to reallocate it. It is safe to assign a new spine and insert the old memcpy, because the contents of the spine never change. In this world, when a Butterfly is transitioned, no writes are lost or travel through time. And the transition is competing with either an old spine into a fragment or a new spine into a fragment. Because both spines contain the same fragment attribute address, these attributes already existed before the transition. Every possible interleaving would cause the thread to read and write the same fragment.

The most natural way to implement segmented Butterfly was probably to use a 32-bit slice like Schism, as this was also the most efficient part of our garbage recycling. The _ vector length _ is in butterfly spine because it is an immutable property of that spine. The vector length allows spine itself to identify how large its right side is. The common length _ is variable, so we want to put it in one of the fragments. Note that the first fragment also includes the _ old vector length, which is not used when we use segemented Butterfly. It made the spine point to some sections of the whole flat Butterfly, making us convert the whole butterfly into segmented Butterfly; We will discuss this optimization in a later section.

Again, there is no discussion of concurrent transition. The transition must reallocate the spine, change some data in the type header, and then store a property value. Reallocating a spine is optional because typically, a fragment has extra slots. Changing the type is optional, such as when an array is resized. Type headers and butterfly can be DCAS(double-world compare-and-swap; 64-bit systems generally support 128-bit CAS) for de-atomization, but this is not enough because we will have a race whether or not we set property values before or after DCAS. The attribute slot can be anywhere in memory, so it is not possible to complete the transition with CAS synchronously.

If we set the newly added value slot before changing anything, we run the risk of a race in which one thread tries to use an object slot to add field F and another thread tries to use field G in the same slot. If you are not careful, the second thread may win the type race (everyone thinks the property G is added), while the first thread wins the property race (that is, the value that thread 1 wants to change to property F appears in thread 2, just as it did to property G in thread 2).

If we set the added value after all the modifications are complete, all the loading will have to prevent the possibility that the slot will have “”holes””. In other words, at any time, an object may declare a type containing the attribute f, even if the object does not yet have a value for f. We can tweak the language to allow this: putting a new property into an object will first define it as undefined, and then store the real value.

We chose to wrap the message transtion with a lock. This lock is not meant to protect transition from competing with other access, because by segemented Butterfly, They’re atomized by default ———— to protect the transitions. To ensure that the object does not hole, Transition stores the value of the new property before changing the type. Transition uses an algorithm like this:

  1. Do whatever memory needs to be allocated

  2. Request a lock

  3. Determine if the correct amount of memory is allocated; If not, release the lock and return to Step 1

  4. Store the value of the new property

  5. Change types and butterfly

  6. Release the lock

This does not result in the following overhead model:

  • The transition costs about seven times as much, because now they need to be locked

  • Reading and writing the slot attribute requires an additional load

  • Delete and dictionary queries still work (we’ll describe the details in a later section)

Recall that the Arraylet study showed that the cost would increase by 10% or more if the array used the segmented object model. We also want to use it for ordinary properties, if those properties are added in a subtle enough way that we can’t inline them in the cell. Assuming we are literally implementing this part of the functionality, we suffer a performance slowdown of at least 10%. What we want is close to zero overhead, and the next section describes how we did that.

Transition Thread Locality(TTL) and Flat Butterflies

Segmented butterflies were only necessary at these times:

  • When I try to transition an object from a thread other than assigning it

  • The thread that allocates the object tries to transition the object, and other threads may be writing to it

We can use the whole flat Butterfly ———— our current object model ———— when we know that these possibilities have not yet occurred. This part talked about a mixed object model, which used flat or segmented Butterfly, depending on whether we detected possible write-transition races. This object model also allows us to avoid locking by executing transition multiple times.

On 64-bit systems, butterfly Pointers have 48 bits of pointer information, all zeroes in the high 16 bits. 16 bits is enough to store:

  • The thread ID of the assigned object. We call it Butterfly TID.

  • A bit is used to indicate whether any threads besides the allocation thread are trying to write to Butterfly. We call this butterfly shared-write(SW) bit

Some systems have more than 48 pointer bits. And in a later part, we’ll talk about how to make this work even though we can use fewer bits. Also, we would limit butterfly allocations to less than 248 addresses if we really wanted 16 extra bits.

Suppose it is encoded as follows:

static const uint16_t mainThreadTID = 0; Static const uint16_t notTTLTID = 0x7fff; Static uint64_t encodeButterflyHeader(uint16_t tid, bool sharedWrite) { ASSERT(tid <= notTTLTID); // Only support 2^15 tids. return (static_cast<uint64_t>(tid) << 48) | (static_cast<uint64_t>(sharedWrite) << 63); } static inline uint64_t encodeButterfly(Butterfly* butterfly, uint16_t tid, bool sharedWrite) { return static_cast<uint64_t>(bitwise_cast<uintptr_t>(butterfly)) | encodeButterflyHeader(tid, sharedWrite); }Copy the code

We can use the existing object model ———— of Flat Butterfly ———— whenever any TID matches the current thread. We set the SW bit and used a Magic TID value (all the bits were set) to indicate that Butterfly was segmented. This section shows how these bits can be used to allow the use of a Flat Butterfly at any time we know no transition competition is taking place. We call this transition Thread Locality inference.

Any time a thread tries to write to Butterfly and the TID matches the current thread, it can simply write to Butterfly without doing anything special.

Any time a thread tries to write to Butterfly, the TID doesn’t match, but the SW bit is set, it can also simply write to Butterfly.

Any time a thread tries to read butterfly, it only needs to check the TID to confirm that the read should be segmented (extra indirect instruction).

Any time a read or write operation encounters TID = notTTLTID and SW = true, it knows it’s time to use the Segemnted object model.

The following cases require special treatment:

  • A thread, other than the one identified by the TID, is trying to write to Butterfly, and the SW bit has not yet been set, in which case it needs to set the SW bit. This doesn’t mean that butterfly needs to be seded, it just means that SW bits have to be set, So any future attempt to make a transition butterfly from a running thread would trigger the transition to the segmented Butterfly object model.

  • Some thread, other than the one identified by the TID, tries the transition Butterfly. In this case, it needed to set SW bit and all TID bits, and perform a segmented Butterfly transition. We describe this process in detail below.

  • One thread tried to transition from setting SW butterfly, which also forced a segmented Butterfly transition.

  • Even transition with TID = current and SW = false needs to be locked to ensure that assumptions are not violated during transition.

The next section will show further refinements that reduce overhead even more. First, we thought about how to avoid checking the TID and SW bits in every operation. Then we show how to use structure checkpoints to avoid locking on the most commonly used transition. Next I show you how to handle array resize. At the end, we will explain how flat Butterfly can be transformed into segmented in more details.

Quick check for TID and SW bits

This section shows how to make concurrent JS as fast as serial JS in most important cases by simply extending the optimization scheme already used by JavascriptCore.

JSC uses _ inline cache _ to optimize code for accessing the heap. We use inline caching to access named attributes (like O.f) and arrays (like a[I]).

Objects with the same attributes at the same offset share the same structure, which is identified by the 32 bits of the object type header. If a property access is likely to see the same structure, we generate new machine code for that property access, which checks if the object has the desired structure, and then accesses the property directly, a misjudgment that causes a recompilation of the inline cache. If the inline cache has been stable for a long time (without recompiling too many times), and the function containing it meets optimized JIT compilation conditions, the best JIT compiler may express the inline cached code directly on its IR, with two results: If it fails (causing an abrupt end to the execution of the optimized code), the structure check is shunted to the OSR outlet, And all inline cached code (structure checks, memory access, other steps) will be eligible for low-level compilation using our DFG and B3 JIT compiler pipelines. Our compiler is excellent at performing property access for type-stable Javascript.

Array access uses a similar technique to detect whether an object has array elements and, if so, how they are formatted. We support multiple storage of array elements, depending on how they are used. The inline cache detects which array each array is accessing, and then we can send code that deduces that array access.

Another technique we already use is virtual memory. For example, our WebAssembly implementation uses virtual memory techniques to check the boundaries of linear memory access for free. We can use POSIX signal handlers or Mach exceptions to grab page errors. The processor knows the exact state of the machine at the point of error and has the ability to convert execution to whatever state it likes. WebAssembly throws exceptions in this way, but we can use it to shift control flow to slow paths that can handle the general case of memory access. In essence, this means that we can save cycles on security checks if we can express the check conditions, which in this case mean something that causes the virtual memory system to publish pages incorrectly. Concurrent JS requires a combination of inline caching and virtual memory tricks to keep TID and SW checking inexpensive.

Inline caching means that we emit different code for each property access, and then we may recompile each property access multiple times as we learn what the property access can do. Inline caching is able to access extremely precise information about the behavior of each attribute, because we’re delivering a fast path access that will only handle the real-world examples we’ve seen so far. We get new information by logging everything we know that failed along this path. The complete log of access failures will then be bounded by luBS (least-upper-bounded) to create a minimal set of accessCases. We can optimize for new types of property access ———— (those property access) such as those that must check for TID and SW bits ———— (to optimize for this) we should consider how a particular access address might appear special and optimize it specifically. Here is a list of possible strategies for how inline caching tests this condition and handles it:

  • Many object access problems may find TID = current and SW = false. These accesses simply subtract the encodeButterflyHeader(currentTID, 0) from butterfly before proceeding. If the inference is wrong, the virtual memory subsystem will issue a page error due to non-zero high values. We can catch this error as either a Mach exception or a POSIX signal and move execution to the slow path. The deduction of this constant is often coded directly into subsequent Butterfly access instructions, so these types of access do not experience any overhead in concurrent JS compared to what is currently available. Note that this optimization does not apply to transition, which must install a new butterfly ———— when atomically declaring that nothing bad happens. We’ll consider those issues further below. Note that this optimization slightly favors the main thread (and all single-threaded programs), because if currentTID = 0, then we don’t have to add anything.

  • Probably many object accesses will always see TID = current and SW = true. These problems can be optimized in the same way as in the previous case. We can write from many threads to an object state we already know about, but that only happens after the last time the current thread has transitioned the object.

  • Probably many object reads will find TID! = notTTLTID and SW = any value. In this case, we just need to check that butterfly’s high order is not notTTLTID, which can be done by a single compare/branch. In addition, they can continue the way we perform in the current engine.

  • Write operations on many objects may find TID! = notTTLTID and SW = true This means that the object is being written by many threads, and we’re writing to it, but we don’t need to set the SW bit because it’s already set. This check can also be done by compare/branch. Together with the read optimizations above, this indicates that JS programs that share objects usually only encounter one extra cycle for Butterfly access (this cycle is for the fused Compare/Branch).

  • Sometimes a write to an object will find TID! = notTTLTID and SW = false. This means that the SW bit needs to be set using DCAS, which sets the SW bit when the type header is declared unchanged. These accesses can be more costly than their friends, but this only happens the first time any thread writes to the shared object. Our inline cache facility might do this so that writes are branchy, where writes sometimes see SW = false and sometimes SW = true(and not necessarily TID = current, but never notTTLTID) : There will be a fast path to SW = true, and a slightly slower but inline path to SW = false. In the next section, we describe the optimizations that need to wake up functions on a structure when the SW bit of any object of that structure is first set. Because the inline cache is generated knowing its structure, it ensures that the structure has been told that objects of its type may be set to SW bits.

  • Perhaps many object accesses will find TID = notTTLTID and SW = true. As usual, we think it unlikely that TID = notTTLTID and SW = false will happen (inside our VM, it is technically possible to transition an object without “writing”, but we pretend they are writing). This partial access deducts the encodeButterflyHeader(notTTLTID, true) from Butterfly before proceeding, and then they must perform an additional load when reading from Butterfly, this additional load being necessary, Because the segmented object model: Butterfly pointer pointed to a spine, which we could use to query fragments containing the values we cared about. This is slightly more expensive than an extra load because it introduces a load-load dependency.

These optimizations leave an unresolved issue: Transitions. Transition now needs to request and release locks. We must be able to do this all the time we add any attributes. In the next section, we’ll show how to solve this problem using _watchPoints_. Then we will talk about how to implement the method of creating transition from flat Butterfly to segmented Butterfly.

Watchpoint optimization

Transition is hard to optimize. Every transition, including those that find TID = current, requires an internal lock on the object to confirm that butterfly is set, the type header of the object is adjusted, and new property values are stored in an atomic action step. Fortunately, we can greatly improve the transition’s performance by using our engine structure watchpointing.

Every object has a structure. Many objects will share the same structure. Most inline cache optimizations start with a check to see if the structure of the object matches the expectations of the inline cache. This means that when the inline cache is compiled, it has a pointer to the Structure of the current object. Each structure has an arbitrary number of WatchPoint sets. A watchpoint set is a field of type BOOL (valid, invalid) and a set of watchpoints. When this collection is triggered (the field changes from VALID to invalid), all watchpoints are woken up. We can add two watchpoint sets to the Structure:

  • TransitionThreadLocal. As long as all objects that contain this structure have tiDs! = notTTLTID, this will remain valid

  • WriteThreadLocal. This remains valid as long as all objects that contain this structure have SW = false

This makes the following optimization the best strategy described above among the optimization strategies for inline caching:

  • Transition that satisfies TID = currrent and SW = false will work just as they do in existing engines, Only the transitionThreadLocal and writeThreadLocal Watchpoint sets of the structure are still valid. This means that the Transition object does not require any additional locks or CAS, even if other threads are concurrently reading the object. This is useful even when assembling an object that will eventually be read or written, because in that case the writeThreadLocal WatchPoint set will only trigger the structure of the final transition target used to assemble the object. The final Transition can continue to work without locking because the Watchpoint set of the Transition source is still valid

  • If the Structure transitionThreadLocal Watchpoint set is still in effect and a Watchpoint is already installed, any check TID! All parts of = notTTLTID can be deleted

  • If the Structure’s writeThreadLocal WatchPoint set is still valid and WatchPoint is installed, all checks for SW == false can be removed

For can delete TID! To check for results like = notTTLTID and SW == false, this means that reads and writes to these objects don’t actually check for anything, they just need to mask the butterfly’s high level

Most importantly, this means that as long as the included structure has a valid transitionThreadLocal and writeThreadLocal watchpoint set, the transition that occurs on the same thread that the object was allocated to, No locking mechanism is required. Structure is inferred dynamically by our engine in order to be able to stick to the code that creates those objects. So if you write a constructor that adds a lot of fields to this but doesn’t get rid of this, The corresponding structure of the transition chain occurring in constructor will have a valid transitionThreadLocal and writeThreadLocal watchpoint set. To ensure that the object still has a Flat Butterfly, you either don’t add properties dynamically to the object or don’t write objects to threads that didn’t construct it. Objects that follow these rules will experience almost no concurrent load, because property access will have at most one additional instruction (one mask) on the fast path

Watchpoints, and the actions that trigger them, are executed at a safe point: if we perform some operations and it finds that one watchpoint must be invalidated, it will do so after all the other threads have stopped, like a garbage collection. If a block of optimization code is making a non-atomic transition that contains a structure, and other threads are trying to write or transition from using those structure objects, it won’t actually do the write until the optimization code reaches a safe point and is invalidated. Most of the time, watchpoint assemblies tell us that they have been invalidated, even before the optimized JIT compiler tries to install any Watchpoints. We expect almost no Watchpoint invalidation in the stable state.

To summarize, if our optimizer could guess which attributes you would add to an object at allocation time, then the cost model for object access wouldn’t change at all, because inline attributes gain concurrency for free. If you do have external properties, they will perform almost exactly as they do, as long as the object is only written by the thread that created it (and any thread reads it), or as long as you don’t add new properties to the object after creation (in which case they can be read and written by any thread). (Occasionally, An additional arithmetic instruction involves counting butterfly reads and writes). If you violate this pattern, the transition is seven times slower, and all other objects are twice as slow. The attenuation was very targeted: only the visits involving segmented Butterfly felt the slowdown. The system is designed to let you add content to objects dynamically and concurrently, and the hope is that if you do this properly, you won’t see much performance change.

Arrays

Access to array elements is allowed from TTL, similar to what named access does:

  • Access to the TTL array is as fast as it is now

  • An additional indirect instruction is required for non-TTL arrays

The way we’re going to handle transition is going to be a little bit special. Many transitions from JavascriptCore arrays already use locks because they need to avoid competing with garbage collection. Locking the rest of the transition array may cause performance problems, so consider alternatives for this section.

Resize arrays to handle out-of-bounds storage or something like array.push is the most common type of array transition. Fortunately, this transition doesn’t change anything in the object’s head ———— it just changes the Butterfly pointer. Whether a butterfly has array elements is reflected in the structure. So, we can just use the CAS(compare-and-swap) on butterfly Pointers, and then make a rule that says that any modification of butterfly Pointers to objects that have arrays requires CAS even if the object lock is already there. Since transitionThreadLocal and writeThreadLocal Watchpoints for array structures are unlikely to be complete (any use of shared arrays invalidates them because arrays share structures), We want the transition even on a TTL array to use CAS in the general case. This Butterfly pointer CAS is sufficient to ensure that non-TLL arrays are not confused with resizing operations and are synchronized attempts.

An array resizing CAS may be small enough to be realistic. CAS may be less expensive for current resizing operations, which involve allocating, copying data, and initializing newly allocated memory.

Objects that have both named attributes and array elements now have locks and CAS on certain transitions involving named attributes. Fortunately, objects with array elements and named attributes are special enough that it is possible to avoid the small cost of increasing the transition of their named attributes.

Fast transition from Flat to Segemented

Converting Flat Butterflye to segmented Butterfly required a special transition. Fortunately, the transition is inexpensive. Butterfly fragments have only data, and they look like fragments of a flat Butterfly load (fixed-size slices). Therefore, we could transition a flat Butterfly into a segmented Butterfly by assigning a spine and pointing its fragment pointer to the original Flat Butterfly.

When Butterfly was converted from flat to Segemented, any reading and writing of the flat Butterfly seemed to occur in fragments for the users of segmented Butterfly. That is to say, although some threads might mistakenly think that Butterfly was still flat, they were still safe to access Butterfly even after Butterfly was segmented.

To make this part work normally, it was confirmed that the fragment of segmented Butterfly shared the same memory layout as the Flat Butterfly of the transformation source. For this reason, each array fragment contains a common length and an old _ vector length; It would enter the vector length that flat Butterfly had been using, and the actual vector length that would appear in the segmented Butterfly spine. This ensured that if there was competition between flat Butterfly array access and flat-to-segmented transition, then flat Butterfly access would know the size of flat Butterfly correctly, Because its vector length doesn’t change. In order to organize the object model, we also store external properties in reverse order in external fragments to match what Flat Butterfly does.

This mechanism works as long as flat Butterfly access loads butterfly Pointers before transition occurs. If flat Butterfly visits a bit late to load its pointer, the TID check for the visit will fail, possibly in the form of a page error while visiting Butterfly.

Delete with dictionary query

JavaScriptCore wants objects to share structures whenever possible. This only requires that the structure have two properties:

  • Structures are immutable

  • When we need a new structure, we usually hash cons.

This is a great optimization if it actually allows objects to reuse structures. But some objects will have a particular structure, no matter what the VM wants to do with it. JavascriptCore tries to detect when this happens and then puts the object in the _ dictionary _ mode. In dictionary mode, structure has a 1-1 mapping to objects. Also, the structure becomes variable. Adding and removing properties means editing structures and objects first.

Deletion is closely related to dictionary mode because deletion immediately puts objects into dictionary mode.

It is already a fact that changes to dictionaries require maintaining a structure lock. This is necessary to support concurrent JIts and concurrent garbage collection.

To support concurrent JS, we only need to make the following changes:

  1. My dictionary reads need to hold the state of the structure lock in case some other thread changes the dictionary

  2. Attributes added before objects enter dictionary mode must be removed for special handling

We don’t worry about the performance of locking all the read operations of the dictionary, which is more expensive than reading a dictionary.

Existing facilities for secure Transition objects will naturally support transition to dictionaries. In general, the dictionary transition does not include deleting properties. Deletion occurs when we discover that the program is adding so many attributes to the object that it may perform better than the dictionary. In this case, it doesn’t matter that some other thread may be in the process of accessing the object without holding any locks. For non-dictionary objects, we just need transition to have a lock. Reading and writing involves a number of separate steps to check the object type, load butterfly, and then access Butterfly. But if no attributes are added before the transition dictionary is removed, it’s okay for other threads to compete for access to the old attributes. We call this phenomenon tardy access. Even if Tardy Access does not do any locking, objects are correct until they become a dictionary, for the same reason that they are correct. The issue depends on the Butterfly object model, which is either flat or segment, depending on whether the object is still TTL. Dictionary mode operates orthogonally to TTL inference and butterflies.

But if any attributes are added before the transition dictionary is deleted, then we have to deal with the deletion case in a special way. Normally, deletion causes the removed slot to become reusable. We cannot do this here because a tardy read on a deleted attribute f might result in overwriting the value of a newly added attribute G. We can prevent these unintended consequences by simply not reusing the extra space created by deleting attributes, if those attributes have been added before the dictionary transition.

This does not result in the use of unbounded memory. When the garbage collection is at its safe point, it already knows that all memory accesses are complete. Therefore, the simplest way to do this is to let the garbage collection change the state of those deleted properties when it accesses the object. Once the garbage collection identifies those property slots at the safe point, they can be reused for subsequent property additions.

Inline cache

Once we make concurrency work, recompiling an inline cache becomes more difficult because you need to be careful when modifying code that might currently be running on another CPU. We plan to deploy a layered defense scheme to prevent this problem: We infer the thread-locality of the code to use as much of the inline cache as possible, we buffer the inline cache whenever it is safe, and finally we rely on safePoints if an inline cache urgently needs to be reset.

We expect that in many programs inline caching will reach a steady state before code is executed on more than one thread. Therefore, we plan to have each piece of code trace its Thread-locality. As long as it has only executed on one thread, it remembers and checks for this at the entry point. If true, any inline caches in that block of code can be modified without any additional synchronization. Thread – Locality can be traced at a granular level of preference, for example it can even be a block of base code. This is probably the most natural if we use JavascriptCore’s CodeBlock notation as our granularity level; This is roughly equivalent to the function hierarchy in source code.

Once the thread-locality of the code is broken, we can start caching inline cache changes. Our PolymorphicAccess inline cache facility has buffered the changes because it is most efficient even if we do not do expensive synchronization. For inline caches that may be executed globally, we can make global cache changes. For example, the VM performs a safepoint every microsecond to flush changes to the global inline cache.

Sometimes, we need to modify the inline cache immediately, in which case we use the SafePoint system’s ability ———— to stop all threads at a point at which we can process each thread state. This is not a quick operation when there are many threads. Java virtual machines already need to do something similar for class-level invalidation and lock bias. By design, these urgent invalidations are unlikely to happen. We only do optimizations that lead to imminent invalidation if we have measurements that suggest it’s unlikely.

Local lock optimization tool

So far, the algorithm needs to be able to sneak 16 bits of information into the higher part of a Butterfly pointer. Some hardware does not allow us to do this; for example, less than 16 high pointer bits may be used to do an all-zero check in virtual memory. If the system only allowed 8 selected highs, we would only be able to support 127 threads. Even if concurrent Javascript had a strict 127-thread upper limit, it might still be useful, but it’s an important lower limit that needs to be tightened at the language level. This section shows how to overcome this limitation.

If the object model can only handle 127 threads of thread-locality, we can choose to do no thread-locality inference on the 127th thread, or we can try to map more than one logical thread to the same thread recognizer. Threads mapped to the same TID will not be able to execute concurrently with each other. To do this, we can take an idea from Python’s GIL (Global Interpreter Lock). This CPython implementation only allows a single native thread to interpret Python code once, which CPython does by using a lock (called a GIL) that protects the interpreter. The lock is periodically released and rerequested, which can show concurrency. Because we can also release locks on any potentially blocking operation, we can even avoid deadlocks due to insufficient concurrency. We can apply this technique here: if we have 127 threads at most, we can have 127 locks to protect JS execution. As long as there are 127 or fewer threads, the lock does nothing; Any time we want to release it, we don’t have to do anything because the lock tells us that no one is waiting to get the lock.” The “release and retrieve” lock is really an inexpensive load and branch, so there is no need to release it.

This lock is local to the thread pool rather than engine-wide and global, and it protects all of our optimization paths, not just the interpreter. Hence the name: Local Optimizer Tool, or LOL(Local Optimizer Lock).

Thread-unsafe object

A DOM object behaves like a Javascript object, but is actually a proxy for complex logic implemented in C++. That part of the logic is usually not thread-safe. The native code that supports DOM passing uses a lot of native apis that are only used by the main thread, and it takes a lot of work to ensure that the DOM is completely thread-safe. Our Strawman proposal only needs to enable the DOM global Object to handle queries for its own properties, which we need to allow threads to access global Javascript properties like objects and Threads.

In Webkit, variable resolution and self-property resolution on JSDOMWindow largely follow a pattern that relies on the existing JS property query mechanism, which uses external behavior when the window has a null frame. We have installed Watchpoint on a non-null frame. Therefore, by using the existing set of WathCpoints, we can support fast concurrent access to the JSDOMWindow property when given a special case of frame. This means that part of the slow path of JSDOMWindow will need to be locked, but this is acceptable because most global object access is cached inline.

Native applications that want to tap into concurrent JS may also want to restrict sharing of some of these classes. Thread-relational checking of Object access will need to be implemented by the VM itself to give the compiler the ability to optimize, which means it is possible to expose functionality to C/ object-C API clients.

conclusion

We think we can execute Javascript concurrently in Webkit if we provide performance that is satisfactory enough for developers to use this feature. The core of the technology was the segmented Butterfly, transition TTL inference, and many good single-object locks. As long as programs obey certain behaviors, they experience minimal overhead: property access is about as expensive as it is today, and objects don’t require any additional memory. If some objects do not follow our rules, they will be slightly more costly.

Related work

Segmented Butterfly is a simple combination of the array object model in Schism garbage collection and the Butterfly object model we used in JavascriptCore for a long time.

TTL inference is inspired by lock bias work. In the HotSpot Biased Locking plan, we use the type to determine whether we can guarantee thread locality. But unlike that plan, we don’t rely on reverse optimization to break thread-object relationships. The main mechanism we use to tell other threads that an object is no longer TTL is to keep different identifier bits in butterfly Pointers.

Segmented Butterfly, TTL inference, and the two planned fallback mechanisms on single object locks when strange phenomena occurred, the combination of the three was not seen before. Most object-oriented system implementations that allow concurrency avoid these phenomena either through an object model that naturally avoids parsing expensive races, as in Java, or through some mechanism that limits the amount of concurrency in the language.

For example, CPython uses a global interpreter lock (GIL — Global Interpreter Lock) to ensure that the interpreter never competes with it. This is a powerful technique that captures something at the heart of concurrency: once AN I/O or other type of blocking occurs, you want your other threads to do something useful at that point. The GIL mechanism allows this to happen, so it makes threads useful for many types of applications. Unfortunately, it doesn’t allow programs to develop parallelism. GIL’s best feature is its ease of implementation. For example, it can be used to implement the thread semantics we want in any JS engine. This article is all about optimizing full concurrency for 64-bit platforms.

Another attempt, called Gilectomy, is currently underway to remove CPython’s GIL and replace it with a fine-grained lock and smart atomization algorithm. Gilectomy has yet to match the current CPython in speed; Both single-threaded and multithreaded programs are going to run very slowly in Gilectomy. The performance issue seems to be related to locks in the allocator (we plan to use thread-local allocation buffers) and reference counters (we don’t have reference counts). Gilectomy will not remove the lock from the object access. Like Javascript objects, Python objects are dynamically resized dictionaries. Much of our proposal is about how to access objects quickly when multiple threads are reading the same object.

PyPy also has an ongoing attempt to remove the GIL, but they haven’t said much about how they plan to handle synchronous object access other than using locks. We will also have locks, but we have also considered how to optimize to avoid locks in most cases.

The other method is SpiderMonkey’s _title locking_ plan, which has been removed. That plan also included single-threaded locks, but didn’t do the part of the optimization that we do to avoid locks in most cases.

Cohen, Tal, and Petrank recently introduced layout locks, which allow fast reading and writing of objects that might have their layout changed. Reads require an extra load (it must be quarantined or dependent), writes must take the read lock (it costs about as much as a pair of quarantined reads and shunt), and transitions must take the write lock and do some extra book-keeping. We suspected that this was similar to the cost model of segmented Butterfly, except for the writing part. Segmented Butterfly only needed an extra dependency load for writing, which was slightly lower than the cost of request read lock, which would require two loads around a store. Loading around store is not easy; For example, forcing a store to occur before the x86 load would require an XCHG instruction for the store, which is more expensive than just doing a mov once. On ARM, it needs to use the ACQ /rel variables of load and Store. On the other hand, the extra dependency loads written into the segmented Butterfly did not require extra fencing on x86 or ARM. Transition and read might have the same speed when using segmented Butterfly and layout lock, but write was important to us. The writes happened frequently enough that the segmented Butterfly might be significantly faster than the Javascript object model layout lock. Moreover, the segmentation methods are not likely to reproduce in many different types of data structures. Layout locks can be useful for more complex data JSC data structures like SparseArrayValueMap and Map/WeakMap/Set due to their greater applicability. Those data structures may be too complex for segmentation, but not too complex for layout locks. Like the segmented Butterfly, layout locks might be used in conjunction with TTL inference to avoid any locks until it was needed to protect the competing transition.

conclusion

This article shows how Webkit’s Javascript implementation has been modified to support threading. Our plan allows all Javascript to be shared, and our plan avoids locking on fast paths, such as property access fast paths. We want it to be able to do this with low overhead. Serialized code will experience zero overhead, while concurrent code will only experience significant overhead when it adds attributes to an object, if the object is being written by multiple threads.

The next step in this wild idea is to try it out and see if it works. We will track progress under Bug 174276.