Sorting in V8 engines
This post was posted on the V8 Developer blog on September 28, 2018
Translation only for learning exchange, reprint please indicate the source, commercial use please contact the original copyright owner
By Simon Zund (@Nimodota)
Translator: Smith,
Array.prototype.sort is one of the last built-in functions in the V8 engine to be implemented in JavaScript self-hosting. In the process of porting it, we experimented with different algorithms and implementation strategies, and finally released a stable sorted version in V8 7.0 (Chrome 70).
background
Sorting in JavaScript is not easy. This blog introduces some of the problems associated with sorting algorithms and JavaScript language features, and documents our process of moving V8 to a more stable algorithm to make performance more predictable.
When comparing different sorting algorithms, we treat their worst and average performance as constraints on the asymptotic increase (the “big O” sign) in the number of memory accesses or comparisons. Note that in dynamic languages such as JavaScript, comparison operations are generally more expensive than memory access. This is because comparing two values when sorting usually involves a call to user code.
Array.prototype.sort will pass in the callback function [… .sort((a, b)=> a-b); There are also values that are handled without callbacks, such as [1,’2′], and Javascript does a type conversion before comparing numbers and strings.
Let’s look at a simple example: sort some numbers in ascending order based on a user-provided comparison function. When a is smaller, equal, or larger than B, the comparison function returns -1 (or any other negative value), 0, 1 (or any other positive value), respectively. Comparison functions that do not follow this pattern are incompatible and can have arbitrary side effects, such as modifying the array to be sorted.
const array = [4.2.5.3.1];
function compare(a, b) {
// Arbitrary code, such as' array.push(1); `
return a - b;
}
// A "typical" sort call
array.sort(compare);
Copy the code
Even in the following example, where no callback function is passed in, a call to user code may occur. The comparison function, “by default,” calls toString on the two values to be compared and performs a dictionary comparison of the two returned strings.
const array = [4.2.5.3.1];
array.push({
toString() {
// Arbitrary code, such as' array.push(1); `
return The '42'; }});// There is no comparison function sort
array.sort();
Copy the code
More interesting is the interaction between the property accessor and the stereotype chain
In this section, we leave the specification behind and begin the journey of trying to “define concrete implementations.” The specification has a full list of conditions that, when met, allow the engine to sort the object/array in any way it sees fit – or not sort it at all. While sorting users must follow some basic rules, everything else is pretty much up in the air. On the one hand, this gives engine developers the freedom to experiment with different implementations. On the other hand, users expect to get some reasonable performance, even if it is not required in the specification. This is further complicated by the fact that “reasonable performance” is not always directly and clearly defined.
As this section shows, Array#sort still varies quite a bit from engine to engine. These are mostly marginal scenarios where, as mentioned above, it’s not always clear what “reasonable performance” should be. We strongly recommend not writing such code, the engine will not optimize it.
The first example shows sorting an array in a different JavaScript engine, with some memory access (that is, getters and setters) and “log printing.” Accessors are the first example of how defining concrete implementations can affect sorting results:
const array = [0.1.2];
Object.defineProperty(array, '0', {
get() { console.log('get 0'); return 0; },
set(v) { console.log('set 0'); }});Object.defineProperty(array, '1', {
get() { console.log('get 1'); return 1; },
set(v) { console.log('set 1'); }}); array.sort();Copy the code
Here is the output of this code in different Javascript engines. Note that there is no “right” or “wrong” answer — it is not specified in the specification, but left to implementations of different engines!
get 0
get 1
set 0
set 1
// JavaScriptCore
get 0
get 1
get 0
get 0
get 1
get 1
set 0
set 1
// V8
get 0
get 0
get 1
get 1
get 1
get 0
#### SpiderMonkey
get 0
get 1
set 0
set 1
Copy the code
The next example shows the impact of prototype chains on sorting results. For brevity, we do not print logs.
const object = {
1: 'd1'.2: 'c1'.3: 'b1'.4: undefined.__proto__: {
length: 10000.1: 'e2'.10: 'a2'.100: 'b2'.1000: 'c2'.2000: undefined.8000: 'd2'.12000: 'XX'.__proto__: {
0: 'e3'.1: 'd3'.2: 'c3'.3: 'b3'.4: 'f3'.5: 'a3'.6: undefined,}}};Array.prototype.sort.call(object);
Copy the code
The following is the result of sorting the object. Again, there is no right answer. This example just shows how strange the interaction between indexed attributes and stereotype chains can be:
Pseudo-array like
// Chakra
['a2'.'a3'.'b1'.'b2'.'c1'.'c2'.'d1'.'d2'.'e3'.undefined.undefined.undefined]
// JavaScriptCore
['a2'.'a2'.'a3'.'b1'.'b2'.'b2'.'c1'.'c2'.'d1'.'d2'.'e3'.undefined]
// V8
['a2'.'a3'.'b1'.'b2'.'c1'.'c2'.'d1'.'d2'.'e3'.undefined.undefined.undefined]
// SpiderMonkey
['a2'.'a3'.'b1'.'b2'.'c1'.'c2'.'d1'.'d2'.'e3'.undefined.undefined.undefined]
Copy the code
What does V8 do before actually sorting
V8 has two pre-processing steps before the actual sort.
First, if the object to be sorted has holes and elements on the prototype chain, they are copied from the prototype chain to the object itself. This way, we don’t have to worry about the prototype chain for all the rest of the steps. Currently, V8 only does this for non-standard JSArrays, and other engines do the same for standard JSArrays.
The second pretreatment step is hole removal. The V8 engine moves all elements in the sort range to the beginning of the object. Then move undefined. This is actually required to some extent by the specification, because the specification requires the engine to always sort undefined to the end. Thus NBBBB, undefined will never be used as an argument to call the user-provided comparison function. After the second preprocessing step, the sorting algorithm only needs to consider non-undefined, which reduces the number of elements that are actually sorted.
history
Array. The prototype. Sort and TypedArray. Prototype. Sort are based on the same kind of Quicksort implementation written in JavaScript. The sorting algorithm itself is pretty simple: the base is a Quicksort, and for shorter arrays (length <10) it’s demoted to Insertion Sort.
Quicksort also uses insertion sort when it recurses to subarrays of length less than 10 in divide and conquer. Because insert sort is more efficient for shorter arrays. This is because Quicksort is called recursively twice after being partitioned. Each such recursive call has the overhead of creating (and discarding) stack frames.
So choosing the right axis element (pivot) has a big impact on Quicksort’s performance. V8 uses two strategies:
- Find the first, last, and “third” elements in the array, and then select the middle of those three elements as pivot. For shorter arrays, the “third” element is the middle element.
- For a long array, a small array is sorted and the sorted median is used as the “third” element in the above calculation.
One of the advantages of Quicksort is that it sorts in-place and doesn’t require much memory overhead. Only when dealing with large arrays do you need to allocate memory for the selected sample array, as well as log (n) stack space. Its disadvantages are that it is not a stable sorting algorithm, and in the worst case, the time complexity degrades to O (n^2).
Introduce V8 Torque
If you’re a fan of V8 developer blogs, you’ve probably heard of CodeStubAssembler, or CSA for short. CSA is a component of V8 that allows us to write low-level TurboFan IR directly in C ++ and then use TurboFan’s back end (compiler back end) to write its properly structured machine code.
See CSA for a link earlier. TurboFan IR is V8’s own TurboFan and has been specially optimized over the traditional graph-based mid-tier, so check out This article for details
CSA is widely used to write so-called “fast paths” for JavaScript built-in functions. The built-in “fast path” version usually checks for specific conditions (no elements on the prototype chain, no accessors, etc.) and then implements the built-in functions with faster, more specialized optimized actions. This can make the function execute an order of magnitude faster than the generic version.
The downside of CSA is that it really can be considered assembly language. Process control is modeled using explicit labels and GOTo, which makes the code difficult to read and error-prone when implementing complex algorithms in CSA.
And then V8 Torque. Torque is a domain-specific language with a typescript-like syntax that currently uses CSA as its only build target. Torque allows developers to use almost the same level of process control operations as CSA, while providing higher-level constructs, such as while and for loops. In addition, it is strongly typed and in the future will include safety checks such as automatic out of bounds, providing V8 engineers with even stronger safeguards.
The first significant built-in functions rewritten in V8 Torque are Typeray # sort and Dataview. Both of these rewrites serve the additional purpose of giving Torque developers feedback on the language features they need and which patterns can be used to write the built-in functions more efficiently. At the time of this writing, several of JSArray’s built-in functions and their corresponding self-hosted JavaScript post-degradation implementations have been migrated to Torque (for example, Array# unshift), and the rest have been completely rewritten (for example, Array# splice and Array# Reverse).
willArray # sort
Migrated to the Torque
The original version of Array# Sort Torque was more or less a direct portage of JavaScript implementations. The only difference is that instead of sampling a small array for a longer array, an element in the array is randomly selected as the “third” element in the axis element selection.
This works pretty well, but Array# sort is still unstable because it still uses Quicksort. Array# sort, which requests a stable version, is one of the oldest workorders in V8’s bug loggers. Next, we tried Timsort instead, and we got multiple benefits from this attempt. First, we like it to be a stable algorithm and provide some good algorithmic guarantees (see the next section). Second, Torque is still a work in progress project, and implementing complex built-in functions in Torque with Timsort, such as “Array# sort,” can bring a lot of practical advice to the Torque language itself.
TimSort
Timsort, first developed by Tim Peters in 2002, can be considered a variant of adaptive stable Mergesort. The implementation details are quite complex and it is best to refer to the author’s notes or Wikipedia, the basic concepts should be easy to understand. While Mergesort uses a recursive approach, Timsort is iterative. Timsort iterates through an array from left to right and looks for what is called _runs_. A run can be considered a small sorted array, as well as a reverse sorted array, because these arrays can be simply reversed to become a run. At the beginning of the sorting process, the algorithm determines the minimum length of a run based on the length of the input array. If Timsort cannot find a run in the array that meets this minimum length, it “generates” a run artificially using Insertion Sort.
The runs found are traced in a stack that records the starting index position and the length of each run. The runs on the stack are gradually merged until there is only one sorted run left. When determining which runs to merge, Timsort tries to maintain a balance between the two. On the one hand, you want to try merging as early as possible, since the run data is likely already in the cache, and on the other hand, you want to merge as late as possible to take advantage of certain characteristics that might appear in the data. To achieve this balance, Timsort follows two principles. Suppose A, B, and C are the three top runs:
- |C| > |B| + |A|
- |B| > |A|
Greater than here means greater than length
In the above example, because | A | > | | B, so B be merged into it before and after the two runs (A, C) in the smaller one. Note that Timsort only merges consecutive runs, which is necessary to maintain the stability of the algorithm, otherwise equal-sized elements will be transferred in the run. In addition, the first principle ensures that the length of runs increases as the slowest Fibonacci sequence, so that when we know the maximum bounds of the array, the upper and lower bounds of the runs stack size can also be determined.
You can now see that the sorted array will do so in O (n) time, because such an array will produce only a single run and no merge operations will be required. The worst case is order n log n. Such algorithm performance parameters, along with Timsort’s inherent stability, are several reasons why we ultimately chose Timsort over Quicksort.
Implement Timsort in Torque
Built-in functions usually have different code versions, and the appropriate code version is selected at runtime based on various variables. The generic version can handle any type of object, whether it’s a JSProxy, has interceptors, or stereotype chain queries when looking up/setting properties.
In most cases, the generic path version is quite slow because it needs to consider all the possibilities. But all these expensive [[Get]] and [[Set]] operations can simply be replaced with FixedArray Loads and Stores if we know in advance that the object to sort is a simple JSArray containing only Smis. The main difference is ElementsKind.
ElemenKind is the element type of an array, such as [1, 2] ElementKind, Int, and [1, 2.1] Double
Now the question becomes how to implement fast paths. The core algorithm remains the same for all scenarios, except for the way elements are accessed with different changes based on ElementsKind. One way to do this is to assign an appropriate accessor to each operation. Imagine that each “load”/” store “has a switch that we use to select branches of different fast paths.
Another implementation, which was initially tried, is to copy the entire built-in function for each fast path and inline the appropriate load/store methods. But this approach is not feasible for Timsort because it is a large built-in function, and copying each fast path requires a total of 106 KB, which is too much for a single built-in function.
The final plan was slightly different. Each load/store method of each fast path is put into its own mini-built-in function. See a code example that shows a load operation for FixedDoubleArray.
Load<FastDoubleElements>(
context: Context, sortState: FixedArray, elements: HeapObject,
index: Smi): Object {
try {
const elems: FixedDoubleArray = UnsafeCast<FixedDoubleArray>(elements);
const value: float64 =
LoadDoubleWithHoleCheck(elems, index) otherwise Bailout;
return AllocateHeapNumberWithValue(value);
}
label Bailout {
// In the preprocessing step, all the holes were removed by moving all the elements to the top of the array
// If a hole is found, the CMP function or ToString changed the array
returnFailure(sortState); }}Copy the code
By contrast, the most common “load” operation is just a call to GetProperty. Whereas the previous version generated efficient and fast machine code to load and convert Number, GetProperty is just a call to another built-in function, which might involve lookup or accessor functions on the prototype chain.
builtin Load<ElementsAccessor : type>(
context: Context, sortState: FixedArray, elements: HeapObject,
index: Smi): Object {
return GetProperty(context, elements, index);
}
Copy the code
In this way, the fast path becomes a set of function Pointers. This means that we only need a copy of the core algorithm, while pre-setting Pointers to all the relevant functions. While this significantly reduces the required code space (as low as 20K), it comes at the cost of using different indirect branches at each access point. This situation has been exacerbated by recent changes that introduced embedded built-in functions.
Order status
The figure above shows the “sort state”. It’s a FixedArray that shows everything involved in sorting. This sort state is assigned each time Array# sort is called. Midterm 04 through 07 are the set of function Pointers that form the fast path discussed above.
The built-in function “check” is called every time the user’s JavaScript code returns to check if we can continue using the current fast path. It uses “Initial Receiver Map” and “Initial Receiver Length” for checking. If user code modifies the current object, we simply abandon the sort run, reset all Pointers to the most common version and restart the sorting process. The “rescue status” in 08 is used as a reset signal. “Compare” can point to two different built-in functions. One calls the user-provided comparison function, and the other is the default comparison described above (toString for both arguments, followed by a dictionary comparison).
The remaining fields (except 14: Fast Path ID) are unique to Timsort. The run stack at runtime (as described above) initializes to a length of 85, which is sufficient to sort an array of length 264. The length of the temporary array used to merge runs increases as needed by the runtime, but never exceeds N / 2, where n is the length of the input array.
Performance compromise
Moving the self-hosted JavaScript implementation of Array # sort to Torque requires some performance tradeoffs. Since Array# sort is written in Torque, it’s now a statically compiled piece of code, which means we can still build fast paths for default specific ElementsKind’s, but it’ll never be faster than TurboFan’s highly optimized code, TurboFan can be optimized using type feedback. On the other hand, if the code isn’t hot enough to JIT compile or the call point is megamorphic, we’ll get stuck in the interpreter or in a slow/general-purpose version. All the parsing, compilation, and possible optimization overhead associated with the self-hosted JavaScript implementation is eliminated in Torque.
While Torque’s implementation doesn’t achieve the same peak performance for sorting, it does avoid a performance cliff. The result is that sorting performance is much more predictable than before. Note that Torque is still iterating, and in addition to compiling to CSA, it may support compiling to TurboFan in the future, allowing the JIT to compile code written in Torque.
Microbenchmark
Before we started developing Array# sort, we added a number of different microbenchmarks to better understand the impact of rewriting on performance. The first figure shows a “normal” use case for sorting various ElementsKinds using a user-provided comparison function.
Note that the JIT compiler does a lot of work in these cases, because the sorting process is pretty much what we (the engine) deal with. While this allows us to inline the comparison functions in the JavaScript version of the compiler, there is also the overhead of calling built-in functions to JavaScript in Torque. However, our new Timsort performs better in almost every case.
The next chart shows the performance impact of Timsort when dealing with fully sorted arrays or arrays with subsequences that have been sorted one way. Using Quicksort as a baseline, the figure below shows the Timsort acceleration ratio (up to 17 times in the case of “DownDown”, where the array consists of two reverse sorted subsequences). As you can see, Timsort performed better in all cases except for random data, even though the object we sorted was PACKED_SMI_ELEMENTS (Quicksort outperformed Timsort in the microbenchmark above).
Web tools benchmarking
Web Tooling Benchmark is a test performed in the JS environment carrier of common Web developer tools, such as Babel and TypeScript. The figure below shows Timsort’s speed increase using JavaScript Quicksort as a baseline. We achieved the same performance in almost all tests except CHAI.
The CHAI benchmark spends a third of its time in a comparison function (string length calculation). The benchmarks are based on Chai’s own suite of tests. Because of the data, Timsort requires more comparisons in this case, which has a big impact on the overall operation because most of the time is spent in specific comparison functions.
The memory effect
When viewing snapshots of the V8 heap from approximately 50 sites (on mobile and desktop devices), no increase or decrease in memory consumption was shown. On the one hand this is surprising: the conversion from Quicksort to Timsort introduces space for the temporary array needed for merge run operations, which should be much larger than Quicksort’s temporary array for sampling. On the other hand, these temporary arrays are actually very short-lived (only for the duration of the sort call) and can be created and deleted very quickly in V8’s new space.
conclusion
Overall, we felt good about the algorithmic properties and predictable performance of Timsort implemented in Torque. Timsort is available from V8 V7.0 and Chrome 70. Happy sort!
By Simon Zund (@Nimodota), Consistent Comparator.