Zhaohuaxishi >

  • Ali Cloud ranked first in the first PolarDB Database Performance Competition in Java
  • 【 blogger 】 Github

Participated in the tianchi competition – the first PolarDB database performance competition of ali cloud. The competition took NVME Optane SSD as the background, based on which the development of stand-alone storage engine to compete performance, supporting C++ and Java language. I finished the competition ranked first in Java language and ranked 20th in total (1653 participants, team name: neoremind), 2.1% (<9s) behind the first in C++. Java has some natural disadvantages, but as a senior JAVAer for so many years, I still want to challenge it.

This article is the solution report, source address github.com/neoremind/2…

The problem is introduced

Using Intel Optane SSDS as storage, cgroup limits memory to 3G (for Java). Achieve a simplified, efficient KV storage engine, support Write, Read, Range interface.

The evaluation process is divided into two stages:

  1. Correctness evaluation

At this stage, the evaluation program will concurrently write specific data (key 8B, value 4KB) and execute any kill -9 times to simulate the unexpected exit of the process (the competing engine needs to ensure that the data persistence is not lost when the process unexpectedly exits), then reopen DB and invoke Read and Range interfaces for correctness verification.

  1. The performance evaluation

    2.1 Random Write: 64 threads Write random data concurrently. Each thread uses Write to Write random data (key 8B, value 4KB) for 1 million times.

    2.2 Random reads: 64 threads Read random data concurrently, each using Read 1 million times.

    2.3 Sequential reads: 64 threads concurrently read sequentially, and each thread iterates DB data twice using the Range global sequence.

Supplement,

  1. The page cache is cleaned at the end of each phase, and the cleaning time is included in the total time.
  2. Read and Range check whether the key and value match, and Range check whether the sequence is preserved.

Thinking before implementation and final achievement

A single KV engine is required to ensure high throughput and write, low latency point search, range query, and crash consistency. Leveldb and Rocksdb implemented by WAL+ LSM-Tree were the first ideas that came up, but both engines are for common scenarios. Lsm-tree’s architecture converts random write into sequential write. A multi-layer compaction or lookup with write and read magnification is a disadvantage in the age of HDDS, but the cost of a random write to a disk is 1000x higher than that of a sequential write. In the age of SDD, the cost is not that high, and the concurrent read and write performance of SDD is excellent. Therefore, using lSM-tree directly in a match is not advisable.

1) fixed length KV, 2) large value 4K, 3) 64 concurrent query.

Following the idea of LSM-Tree, I think of a paper “WiscKey: Separating Keys from Values in SSD-Conscious Storage Maximize bandwidth and take advantage of SSD features such as high sequential I/O throughput and high random concurrency. This idea of kv separated storage structure optimized on SSD fits well with the topic: large Value and concurrent query. I integrated the storage design and engine implementation of the problem by combining the insights and ideas of this paper. And the rest of the fixed-length KV requirements, further simplify the implementation of the difficulty.

The goal of the race is to get the shortest time, so it becomes how to squeeze the IO dry to get the maximum throughput.

(4k+8 bytes)64100w=256 gb, random read =256 gb, Range: scan 512 gb twice. According to the official data provided by the Intel Optane SSD, sequential write operations are 2 gb /s, sequential read operations are 2.4 gb /s, random read operations are 55w IOPS, random write operations are 50w IOPS, and the read/write latency is 10us. The throughput and IOPS of both sequential and random reads were slightly higher than the official values. The measured sequential reads were 2.2G/s and random reads were 2.5G/s. The sequential reads were even faster. The theoretical limit is about 410s.

Finish result of the first C++ player:

413.69s (Write 116s + Read 103s + Range 193s) Write throughput: 2.21G/s, Read throughput: 2.49g /s, Range throughput: 2.65g /sCopy the code

Basically squeezed the disk dry.

I won the first prize in Java.

422.31s (Write 116s + Read 109s + Range 196s) Write throughput: 2.21G/s, Read throughput: 2.35g /s, Range throughput: 2.61g /sCopy the code

Store design

Instead of using LSM-tree model, WisckKey paper’s idea is used to separate key and value. As shown in the figure below.

Wal (Write Ahead Log) stores the offset of the key and value in the vlog. Vlog is written sequentially. Wal and Vlog are written with a fixed length of Append-only. Vlog seq uses a 4byte Int to represent the storage, the big/small tail of the program itself, and then multiplied by 4096 is the offset in the vlog file. The gc problem about vlog is not required to delete, so it is not considered.

Wal and Vlog are sequential IO writes, and there is no write magnification problem in LSM-Tree model.

Due to kv separation, writing must be locked, and locking will limit performance. Since it is random writing, conflicts can be reduced according to the idea of divide-and-conquer, and data should be sharded. My strategy is to divide the key into 1024 shards in lexicographical order. The first byte of the key (8byte) + the first two bits of the second byte are taken out and converted into int. After partitioning, the key can be routed to the correct fragment.

Implement analysis-write

1024 sharding, single sharding and lock write, the process is as follows

synchronized(lock) {  write vlog 4k;  write wal 8byte + 4byte vlog seq;  vlog seq++; }
Copy the code

The available I/O modes are BUFFER I/O and direct I/O. The VFS Read /write and Mmap buffer I/O modes can be considered.

Java FileChannel writes through byte[]->offheap direct memory->kernel Page cache-> Disk. Byte []-> memory mapping address (also called page cache) ->disk (byte[]-> memory mapping address (also called page cache) ->disk) Truncate deletes unnecessary bytes when closing db. Vlog seq is incremental, so read wal is 0x00000000. You can discard subsequent data.

Total wal file size = (8byte key+4byte vlog seq) 64 concurrent 100w=768MB. Since the evaluation program is random enough, with wal size =768MB/1024=750KB, each fragment can be directly mapped to 12byte<<16 size files through MMAP. Ensure capacity is used to make Re-Mmap, which can be compatible with uneven file sizes in correctness tests. The correctness program writes are concentrated in five shards.

The size of a single vlog file is 4k x 6400w/1024=256MB.

Echo 3 > /proc/sys/vmp/drop_caches; echo 3 > /proc/sys/vmp/drop_caches; echo 3 > /proc/sys/vmp/drop_caches Therefore, direct IO is used. Since the JDK does not provide a DIRECT IO API, the options are to use the JNA library directly, call through JNI, or use Jaydio, which encapsulates JNA. I’m just using the JNA API.

If the sequential I/O problem is solved, then the large block of I/O can be fully loaded. In order to ensure crash consistency, direct I/O requires a MMAP in front of it. In this case, four values (16K values) are saved before disk writing. If mmap is closed properly, the temporary files of Mmap will be deleted. Otherwise, the contents of the Mmap file will need to be append to wal for recovery during the next initialization.

The whole process was young GC for 4 times, without full GC. Write on-CPU flame map, 86% IO + 4% lock consumption + other. Basically achieved the target.

Implementation analysis -Read

To initialize the database, as explained in the above write analysis, wal and Vlog both need to do some work to ensure crash consistency. The next step is how to create an index, support point Lookup.

The total wal file size is 768MB, and the index can be placed in memory.

1) Load WAL forms key 8 bytes + Vlog SEQ 4 bytes

2) Sort the 12 bytes according to the key dictionary, and then select the largest byte according to vlog SEq to deal with the duplicate key situation.

3) Put sorted data into memory.

Each shard is independent and can be parallelized. This link is not good enough in Java, 64 concurrent load takes 1.5-2.5s, with jitter, and is not as slow as C++ 300ms. Sort by byte, or convert the key to an unsigned long. Since the latter stage also requires Range, to avoid repeating the process, you can optionally persist sorted, de-duplicated keys +vlog seq to disk and create a wal. Sort file.

Key 8byte + vlog seq 4byte fixed length, so binary search can be used in memory. My implementation uses offheap memory, malloc and free memory via the Unsafe JDK, avoiding old regions in the heap and GC overhead. The memory binary search cost is very low, accounting for about 1% of the total time.

A point Lookup requires one binary memory lookup and one disk I/O. To read 4K value from disk, you only need to restore vlog seq * 4096 to the actual file offset. To read 4K data from vlog through offset, you also need to consider buffer IO or direct IO. The order of read and write keys of the evaluation program is consistent, which has a local effect. Using buffer IO and Page Cache, the operating system’s read ahead will take effect, which will be much faster. The buffer IO and read Ahead are actually bad, reading a lot of useless data into the Page Cache and wasting bandwidth.

Read using Direct IO. Java uses Direct IO through JNA. Int posix_memalign(void **memptr, size_t, size_t size); Use preAD (fd, MemPointer, 4096, offset) system call to read the MemPointer address and copy it to the user space heap. There is little difference between the two schemes.

There is a small point to avoid frequent Young GC, 64 threads read 4K value through ThreadLocal, avoid frequent memory allocation (thanks to @Dao Wind for reminding me of this, another Java player, see link for his share).

This part is the biggest gap with C++ among all three links, and it is 2.35g /s, which is smaller than C++ 2.49g /s. The 140MB/s gap can be ascribe to the slow index creation, and the direct IO through JNA is not as good as the system call.

4 to 5 times of young GC, no full GC.

Implement analysis-range

This is the part of the game that opens the gap. 64 run a Range2 time, 128 full scan times, and 64 run a wait for the visit callback. The second 64 threads go head to head visit. I’ve taken the former and implemented an AccumulativeRunner utility class using the java.util.Concurrent library.

The basic idea is to submit a Range Task and wait to block all concurrent Range requests. There are two trigger conditions in the background that meet a certain number of Range tasks, or time out such as 5s, trigger a Range for full scan. We then call back to the visitor for all Range requests and scan the uniform Notify Range request thread to unblock. Let’s do the second range. The sequence diagram is as follows,

Next, we considered how to perform a full scan efficiently. At first, we used the idea of “sliding window + concurrent random I/O query” to take advantage of SSD concurrent random I/O feature, but the result was not ideal. Reading data from the Page Cache is not as fast as loading a large chunk of IO into memory. Second, the Java FileChannel has a position lock that limits performance.

After abandoning this model, to the idea of “sliding window + concurrent memory query”, the effect is good, close to full bandwidth. Because the key divides 1024 fragments in lexicographic order, the basic idea is to iterate over 1024 fragments sequentially. Each fragment is accessed in memory and seamlessly connects to each fragment. Each shard access is divided into three steps.

1) prefetch: WAL sort good build index, vlog load to memory.

2) Range read: Iterate wal, search for value for each key and vlog seq becomes memory access, the essence of “concurrent memory query”.

3) Evaluation program VISIT: Evaluation program needs to verify order, correct value, etc., but also has certain consumption.

In order to seamlessly connect 1024 shards to do the above 3 steps, use sliding Windows as shown below, which are divided into 5 categories:

  • – The access is complete
  • – Reading in Range and visit
  • – prefetch Indicates that the prefetch is finished and is ready to be read
  • – the prefetching
  • – Not accessed

The maximum capacity of the sliding window is three fragments, and the maximum memory usage = Vlog (256MB3) + WAL index (750KB3) =770MB, which ensures that the cgroup memory limit will not be broken.

The bottleneck of prefetch prefetch, Range read, and evaluation program visit should be in Prefetch to fill the bandwidth. So for the last two parts, Benchmark Everything has actually measured that it might be a bit of a drag. Go back to the old way of serial-to-parallel, and build a multilevel pipelined architecture.

Prefetch is a process that wal sorts to create indexes and load vlogs and cache. The two processes can be parallel.

Create an index, load wal or wal. Sort files into memory, and as in the read phase, the sorted keys and vlog seq are placed in off-heap memory to be indexed for Range reading.

Load vlog and cache. This process can be concurrently loaded. The size of a single vlog fragment is 256MB and eight concurrent 32 MB block I/o reads are loaded in parallel. Load vlog Can be mmap, File Channel, or Direct IO load. The actual direct IO load and File channel are not very different. The final load vlog file uses Direct IO. DirectBuffer/Unsafe, the difference between DirectBuffer and offheap direct memory is not significant. If DirectBuffer is used, subsequent memory get operations are not thread-safe. Addressing, which is accessed via the Unsafe copyMemory API, is unlocked. Load vlog uses direct IO, so memPointers are pooled and 24 32MB mempointers are allocated in 3 Windows and 8 concurrent reads.

Range reads 256 kV in batches, 2 in parallel, and then puts the results into a lockless queue. The visit function of the evaluation program is done in a separate thread, poll no lock queue, and perform 64 visitor callbacks in 4 parallel for the KV data read. So these two steps don’t drag you down.

For fragments that have been accessed, resources need to be released, including wal indexes and vlog caches, and MemPointer returned to the pool.

Java implementation specificity

Java has a certain disadvantage compared with C++ in the competition, which directly leads to the fact that although Java ranked first, it ranked 20th in total. Although the gap between Java and C++ was only 2.1%, less than 9s, I was quite satisfied with such a result.

Disadvantages of the JVM compared to C++ include:

1) It is not close enough to the bottom layer. It depends on the JVM to interpret bytecode execution. Although the JIT helps the hot code to be optimized for native code execution, it is not direct enough.

There is a GC overhead, and if the heap is cached, there must be frequent GC, which affects throughput. Concurrent GC and user thread can be alleviated, must control only young GC, not full GC, this race is compared to IO, so the CPU theory is sufficient, observe 400%-600% CPU is the median.

3) It is not convenient to use the OS API, such as direct IO native JDK does not support, Mmap is not convenient to free memory, thread bind CPU core is not convenient, etc.

As a Java player to overcome the above difficulties, it is necessary to use some big kill, the following is summarized in turn.

1, the mmap

Help write wal to ensure crash consistency during the write phase. The JDK provides native apis, but releasing them is relatively cumbersome.

2, direct IO

JNA encapsulation, or using Jaydio, splices small IO into large IO writes. FileChannel is not used in this tournament because it has an internal position lock and uses buffer IO, which is not appropriate in this scenario, but NIO FileChannel is preferred for most Java SCENARIOS involving IO.

3. Off-heap memory

Offheap can be DirectBuffer, or Unsafe (Malloc, Free).

4. Gc control

The parameters for the competition are as follows,

-server -XX:-UseBiasedLocking -Xms2000m -Xmx2000m -XX:NewSize=1400m -XX:MaxMetaspaceSize=32m -XX:MaxDirectMemorySize=1G -XX:+UseG1GC

The young GC in write and read stages is rare and mainly used in range stage. Due to the multi-stage pipeline architecture, the memory consumption is serious. The young GC is relatively frequent, but not full GC, so it is acceptable.

5. Pooling technology

DirectMemory is allocated, pooled, and overwritten repeatedly to reuse resources. The read phase uses ThreadLocal to reuse values to avoid frequent young GCS.

6. Lock control

Kv separate write, must lock. The direct IO load of the read phase and the return to the user space also need to be locked. Minimizing the lock granularity and spreading out lock conflicts, like the Idea of ConcurrentHashMap, can minimize lock consumption.

7, concurrent sharp tool

Use java.util.Concurrent well, Range stage ride model, parallel load vlog, sliding window all use thread pool, Lock, condition, mutex, etc. You can also try some of the concurrency libraries without locks, such as ConcurrentLinkedQueue, JCTools’ MpmcArrayQueue, and Disruptor’s lockless queue.

8. Reduce context switching

Since Alijdk is used in the contest, Alijdk has a Wisp API that can do Java coroutines, which can be used in some scenarios where resources are released without waiting. After the test, vmstat -w 1 command is used to see that cs columns are indeed a little less, but it does not help to improve the score.

Stand-alone database engine thinking

First, in my opinion, the storage architecture design is more important than the engine implementation quality and more important than the language selection, so there is no qualitative difference between Java and C++ in this topic.

Second, in the field of platform application services, Java has the advantage of engineering, rich class library, design mode and other features, so it has a broad space. However, in the field of distributed computing, compared with the gap between HDD and SDD, the improvement from ms to US is a big problem, and the delay of distributed call is tens of ms with the level of machine room and across regions. Therefore, many open source big data projects, such as Spark, Hadoop and Flink, mostly use Java, which is easy to be engineered and maintained. It also meets the needs.

Third, storage engine implementation language considerations:

1) Engineering, such as whether static type & abstract design

2) Tail latency control

3) the Runtime overhead

Java, 2, 3 compared to C++ are Java’s weak points, such as GC and JVM overhead, but both are a DB requirement. On the other hand, from the perspective of system stratification, C++, Rust, which is sensitive to 2 and 3, is suitable. Java and Go have their own application scenarios if they have higher engineering requirements.

conclusion

The first time to participate in the engineering nature of the competition, there are various masters, we compete for the second millisecond competition, very exciting, but also the first time to systematically practice a Java IO related technology, the goal of learning and experience has been achieved. I hope I can have energy and vitality to participate in the competition in the future, learn from the masters, learn from progress. The road of technology is very long, the accumulation of each step is for a better tomorrow, and you are reading this article.

Please quote from neoremind.com when reprinting.