Abstract:

December 13-14, by the cloud community and Alibaba Technology Association co-hosted the “2017 Alibaba Double 11 technology twelve lectures” successfully ended, focused on sharing the black technology behind the 2017 Double 11. This paper is a compilation of the speech “Distributed Caching under Double 11 Trillion Traffic”. This paper mainly starts from the development and application of Tair, then discusses the challenges faced by Double 11, focuses on sharing the practice of performance optimization, and finally gives solutions to the cache problems. It reads as follows.

Sharing Guests:



Zong Dai: Alibaba senior technical expert, joined Taobao in 2008, ali distributed cache, NoSQL database Tair and Tengine in charge.

Tair overview

Tair development history

Tair is widely used in Alibaba. Whether you browse and place an order on Taobao tmall or browse and play on Youku, Tair’s presence quietly supports huge traffic. The development history of Tair is as follows:

  • 2010.04 Tair V1.0 officially launched @ Taobao core system;
  • 2012.06 Tair V2.0 launched LDB persistence products to meet the requirements of persistent storage;
  • 2012.10 Launched RDB cache products and introduced RedIS-like interfaces to meet the storage requirements of complex data structures;
  • Launched Fastdump product for full import scenario based on LDB, greatly reducing import time and access delay;
  • July 2014 Tair V3.0 was officially launched, and the performance was improved by X times;
  • 2016.11 Taidou intelligent operation and maintenance platform was launched, helping Double 11 to enter the 100 billion era in 2016;
  • Performance leap, hot hash, resource scheduling, support trillion traffic.

Tair is a high performance, distributed, scalable, highly reliable key/value architecture storage system! Tair features are mainly reflected in the following aspects:

  • High performance: guarantee low latency under high throughput, which is one of the biggest internal call volume systems in Alibaba Group. Double 11 reaches 500 million call volume per second, with average access delay below 1 ms;
  • High availability: Automatic failover enables disaster recovery within and between equipment rooms to ensure that the system can run normally under any circumstances.
  • Scale: distributed in all data centers around the world and used by all BU of Ali Group;
  • Business coverage: e-commerce, Mafengyi, One in one, Ali Mom, Autonavi, Ali Health, etc.

In addition to the functions provided by ordinary Key/Value systems, such as GET, PUT, DELETE, and batch interfaces, Tair has some additional utility functions that make it applicable to a wider range of scenarios. There are four Tair application scenarios:

1. Typical application scenarios of MDB: used for caching to reduce the access pressure to back-end database. For example, goods in Taobao are cached in Tair; It is used for temporary data storage. The loss of some data does not have great impact on services. Read more and write less, read QPS reach ten thousand level above.

2. Typical application scenarios of LDB: general KV storage, transaction snapshot, security risk control, etc. Store black and white single data, read QPS is very high; Counter function, update very frequently, and data can not be lost.

3. Typical application scenarios of RDB: cache and storage of complex data structures, such as playlists and live broadcast rooms.

4. Typical application scenarios of FastDump: Periodically and quickly import offline data to the Tair cluster, quickly use new data, and have high requirements for online reading. Read low latency, no burrs.

What about the Double 11 challenge?



The data from 2012 to 2017 are shown in the figure. It can be seen that IN 2012, GMV was less than 20 billion, in 2017, GMV reached 168.2 billion, the peak value of transaction creation increased from 14,000 to 325,000, and the peak QPS increased from 13 million to nearly 500 million. We can conclude that Tair access peak growth: Tair peak > transaction peak > total GMV, how to ensure that the cost does not exceed the transaction peak growth? For distributed systems with data, the cache problem is difficult to solve, and the cache traffic is particularly large. After November 11, 2017, we completely solved the cache problem. Our current trading system and import system are multi-region, multi-unit and multi-machine room deployment, how to rapidly deploy the business system?

Multi-region and multi-unit



Multi-region multi-unit The first is the central machine room, we did the double machine room disaster recovery, and there are several units on both sides. The system diagram in the machine room is shown as follows: from traffic access to unified access layer – application layer – middleware -Tair-DB, we will do multi-region data synchronization in the data layer.

Elastic site



We need the system to be able to build sites flexibly. Tair is a complex distributed storage system, we have established a complete set of operation control system for leading, elastic in leading construction site system, it will work through a series of steps to Tair system construction, we also from the system level, cluster level and instance connected level verification, to ensure that the system function, stability, quite safe.

In fact, the water level of each Tair service cluster is different. During each full-link pressure test before Double 11, the Tair resources used will change due to changes in the service model, resulting in changes in the water level. In this case, we need to schedule Tair resources between multiple clusters after each pressure test. If the water level is low, some machine server resources will be moved up to the high water level until all cluster water levels are close.

Data synchronization



For unitary services, we provide the ability for the unit to access the local Tair, and for some non-unitary services, we also provide a more flexible access model. Synchronization delay is something we have been doing. In 2017, the data synchronization per second has reached tens of millions. So, how to better solve the problem of non-unitary business writing data conflicts in multiple units? That’s what we’ve been thinking about.

Performance optimization and cost reduction

Server costs do not increase linearly with the number of visits, but decrease by 30 to 40 percent per year. We mainly achieve this goal through server performance optimization, client performance optimization and different business solutions.

Memory data structure



The figure shows a schematic diagram of MDB’s memory data structure. After the process is started, we will apply for a large chunk of memory to organize the format in the memory. There are mainly slab allocator, HashMap and memory pool. When memory is full, data will be eliminated through LRU chain. As the number of server CPU cores increases, it is difficult to improve overall performance without handling lock contention well.



We used fine-grained locking, lock-free data structures, CPU native data structures, and read copy updates (locking is not required when reading linked lists), referring to various literature and operating system designs. The figure on the left shows the consumption of each functional module in lock competition without optimization. It can be seen that the network part and data search part consume the most. After optimization (the figure on the right), 80% of the processing is in the network and data search, which is in line with our expectation.

User-mode protocol stack



After lock optimization, we found that a lot of CPU was consumed in kernel mode. At this time, DPDK+Alisocket was adopted to replace the original kernel mode protocol stack. Alisocket adopted DPDK to collect nic packets in user mode, and used its own protocol stack to provide socket API for integration. We compared with similar systems in the industry, as shown in the figure.

Memory consolidation



When the performance of a single machine is improved high enough, the problem is that the amount of memory corresponding to unit QPS becomes less, how to solve it? We found that there was still some memory available in the public cluster, and some slab could not be written to, for example, the 64-byte slab was not fully written, but the 72-byte slab was fully written, and the memory pool was used up, so we merged the 64-byte slab free pages with each other, so that a large number of free pages could be released. It is inevitable to add mode lock, and when triggering the merge threshold, it switches to lock state. The merge is done in the peak period of low traffic volume, and there is no problem for the peak business.

Client optimization



We have optimized the client in two aspects: replacing the network framework and adapting the coroutine, replacing the original MINA with Netty, and increasing the throughput by 40%; Serialization optimization, integration of Kryo and Hessian, throughput increase by 16%+.

Memory grid



How to integrate with the business to reduce overall Tair and business costs? Tair provides multi-level storage integration to solve business problems, such as security risk control scenario, large amount of reading and writing, there is a lot of local calculation, we can in business machines to save the local business need to access data, a lot of reading will hit on the local, and write for a period of time can be merged, after a certain period, The merge is written to the remote Tair cluster as the final storage. We provide read and write penetration, including merge write and the original Tair itself has the ability of multi-cell replication. At double 11, the read of Tair drops to 27.68% and the write of Tair drops to 55.75%.

Hot-spot problems have been solved

Cache breakdown



The cache evolved from a single point to a distributed system, organized by data shards, but still existed as a single point for each data shard. When there is a big push or hot news, the data is often on a single shard, which causes a single point of access, and a node in the cache can’t handle the pressure, resulting in a large number of requests that can’t be answered. Even limiting traffic is lossy and can cause a system-wide crash. We found that the root cause of the problem was access to hot spots, which needed to be completely resolved.

Hot hash



After many schemes are explored, hot hash scheme is adopted. We have evaluated client-side local cache and second-level cache solutions, both of which can solve the hot spot problem to some extent, but each has its drawbacks. In hot-hash mode, hotzone zones are directly added to data nodes to store hot data. For the whole scheme, the most critical steps are as follows:

  • Intelligent identification. Hotspot data is always changing, either frequency hotspot or traffic hotspot.
  • Real-time feedback. Multi-level LRU data structure is adopted, and different weights are set on LRU of different levels. Once LRU data is full, it will be eliminated from the low-level LRU chain to ensure that the ones with higher weights are retained.
  • Dynamic hashing. When a hot spot is accessed, the Appserver and the server will be linked and dynamically hashed to other data nodes hotzone according to the preset access model. All nodes in the cluster will undertake this function.

In this way, we distribute the traffic that would otherwise be carried by a single point of access through some machines in the cluster.



As you can see, at zero on double 11, we absorbed more than 8 million hot spots. If hot hash is not done, the index before the hash will exceed the death mark.

Write hot



Write hotspot is similar to read hotspot. Hotspot needs to be identified in real time, and then the I/O thread sends the request with hotspot key to the merge thread for processing, which will be submitted to the storage layer by the scheduled processing thread.

After the double-11 test to deal with read/write hot spots, we can now safely say that we have completely solved the read/write hot spots of the cache including the KV storage part.