Author: Sun Xiaoguang, the head of Zhihu search backend, is currently responsible for the architecture design of Zhihu search backend and the management of the engineering team. Have been engaged in the development of private cloud related products for many years, focus on cloud native technology, TiKV project Committer.

This article is based on sun Xiaoguang’s speech on TiDB TechDay 2019 Beijing website.

The share first introduced from the Angle of macroscopic zhihu read service business challenges in the scene, architecture, design idea, and then from the Angle of the micro introduces the realization of the key components, the last to share in the process of the whole TiDB help us solve the problem what kind of, and how TiDB help us will be huge system comprehensive cloud, And push it to a very ideal state.

1. Business scenarios

Zhihu extract from the start, in the past eight years has gradually grown into a large-scale comprehensive knowledge content platform, at present, as many as 30 million questions on zhihu, harvest more than 130 million replies, zhihu also precipitated the number of articles, books and other paid content, the number of registered users is 220 million. These numbers are pretty impressive. We have 130 million answers and many more columns, so how to effectively distribute the quality content that users are most interested in is a very important issue.

The home page of Zhihu is a key entrance to solve traffic distribution, but the problem that reading service wants to help the home page of Zhihu is how to recommend the content that users are interested in on the home page, while avoiding recommending the content that users have seen before. The read service will record all the in-depth reading or quick skimming content on Zhihu website for long-term preservation, and apply these data to the read filtering of homepage recommendation information flow and personalized push. Figure 2 is a typical flow:

When the user opens the zhihu enter the recommendation page, system to the home page service request pull “users interested in the new content”, according to user’s home page, to multiple queue recall recall new candidates, the recall is there may be part of a new content users have seen, so before distribution to the user, homepage will first send the content to read filtering service, The business process of further processing and finally returning to the client is quite simple.

The first feature of this business is the high availability requirement, because the home page is probably the most important traffic distribution channel of Zhihu. The second feature is that the write volume is very large, with a peak value of 40K + records written per second and nearly 3 billion new records added every day. In addition, we keep the data for a long time, which needs to be saved for three years according to the current product design. The entire iteration has saved about 1.3 trillion records so far, and at a rate of nearly 100 billion records a month, it could swell to three trillion in about two years.

The query side of the business is also demanding. First, product throughput is high. Users need to check the front page at least once every time they refresh it online, and the query throughput can be magnified because of multiple recall sources and concurrency. Peak time The home page generates about 30,000 independent read queries per second, with an average of 400 documents checked per query and about 1,000 documents in the long tail. That is to say, the entire system processes about 12 million read queries per second at its peak. At this throughput level, the response time requirement is also strict, requiring the overall query response time (end-to-end timeout) to be 90ms, meaning that the slowest long-tail query cannot exceed 90ms. Another feature is that it can tolerate false positive, which means that some content is filtered out, but the system can still recall enough content that users may be interested in, as long as the false positive rate is controlled within an acceptable range.

Second, architecture design

Due to the importance of zhihu home page, we have considered three design objectives when designing this system: high availability, high performance and easy expansion. First of all, if users open the home page of Zhihu and see a large number of content they have already seen, it is definitely unacceptable, so the first requirement for the read service is “high availability”. The second requirement is “high performance” because of the high throughput and response time requirements. The third point is that the system is constantly evolving and developing, and the business is constantly updating and iterating, so the “scalability” of the system is very important. It cannot be said that it can be supported today, but it will not be supported tomorrow, which is unacceptable.

Next, we will introduce how to design the system architecture from these three aspects.

2.1 high availability

When we discuss the high availability, it means that we have realized fault happened all the time, want to make the system achieve high availability, first of all need to be systematic to fault detection mechanism, the health of the test kit, and then design good self-healing mechanism of each component, let them can automatically recover after failure, without human intervention. Finally, we want to isolate the changes caused by these failures so that the business side is as unaware of the occurrence and recovery of faults as possible.

2.2 the high performance

For common systems, the heavier the state of the core component is, the higher the cost of expansion is. Layer upon layer interception is a very effective means to improve performance by rapidly reducing the amount of requests that need to be penetrated into the core component. First, we expand the size of data that can be cached by the cluster by caching slots. The read throughput of a single Slot buffered data set is then further increased by multiple copies within the Slot, and a large number of requests are intercepted in the buffer layer of the system for digestion. If the request inevitably makes its way to the final database component, we can also continue to reduce the I/O pressure falling on the physical device with efficient compression.

2.3 easy extension

The key to improving system scalability is to reduce the range of stateful components. With the help of routing and service discovery components, stateless components in the system can be easily expanded, so expanding the scope of stateless services and shrinking the proportion of heavy state services can significantly help us improve the scalability of the whole system. In addition, if we can design some weak state services that can recover state from external systems, and partially replace the heavy state components, we can reduce the proportion of heavy state components. With the expansion of weak state components and the contraction of heavy state components, the scalability of the whole system can be further improved.

2.4 Final architecture of read services

Under the design concept of high availability, high performance and easy extension, we designed and implemented the architecture of the read service. Figure 8 is the final architecture of the read service.

First, the upper-level client APIS and proxies are completely stateless components that can be extended at any time. The bottom layer is TiDB, which stores all state data, and the components in the middle are weak-state components, whose bodies are layered Redis buffers. In addition to Redis caching, there are other external components that work with Redis to ensure Cache consistency, which will be detailed in the next chapter.

From the point of view of the whole system, TiDB layer has its own high availability ability, which can be self-healing. Stateless components in the system are very easy to expand, while the weak part of stateful components can be recovered by the data saved in TiDB, and can also be self-healing when failure occurs. There are also components in the system responsible for maintaining buffer consistency, but they are stateless themselves. Therefore, on the premise that all components of the system have self-healing capability and global fault monitoring, we use Kubernetes to manage the entire system, thus ensuring the high availability of the entire service mechanically.

Key components

3.1 the Proxy

The Proxy layer is stateless, similar in design to common Redis proxies, and very simple from an implementation perspective. First, we split the Cache into several slots based on the user’s latitude. Each Slot has multiple copies of the Cache. These copies can improve the availability of the entire system on the one hand, and also share the load of reading the same batch of data on the other hand. How to ensure the consistency of the Cache copy? ** What we have chosen here is “session consistency”, which means that a user coming through the same portal for a period of time will be bound to a copy of the Slot and the session will remain there as long as there is no failure.

If a copy in a Slot fails, the Proxy first picks up other copies in the Slot to continue providing services. In more extreme cases, such as when all copies of the Slot fail, the Proxy can sacrifice system performance by sending requests to a completely unrelated Slot that has no cache of the data for the current request and does not cache the results once they are received. The payoff for this performance cost is higher system availability, even if all copies of the Slot fail at the same time.

3.2 the Cache

One of the most important aspects of buffering is how to improve buffering utilization.

The first is how to buffer a larger amount of data with the same resources. In the space composed of “user”, “content type” and “content”, the bases of “user” dimension and “content” dimension are very high, both in hundreds of millions. Even if the number of records is in the order of trillions, the distribution of data in the whole THREE-DIMENSIONAL space is very sparse. See the left half of Figure 10.

Considering the huge amount of content deposited on Zhihu website, we can tolerate false positive but still recall enough content that users may be interested in. Based on such business characteristics, we transform the original data stored in the database into more dense BloomFilter buffer, which greatly reduces the memory consumption and can buffer more data under the same resource condition, improving the cache hit ratio.

There are many ways to increase cache hit ratio. In addition to increasing the density of cache data mentioned above and increasing the size of bufferable data, we can further improve cache efficiency by avoiding unnecessary cache invalidation.

On the one hand, we designed the cache as a write through cache to avoid invalidate cache operations by updating the cache in place. Together with data change subscription, we can ensure that multiple cache pairs of the same data can reach a final agreement in a very short time without invalidating the cache.

On the other hand, thanks to the design of Read through, we can convert multiple concurrent query requests for the same data into one cache miss and multiple buffer reads (the right part of Figure 11), further improving the cache hit ratio and reducing the pressure of penetrating into the underlying database system.

Let’s share some things that aren’t just about buffer utilization. Buffering is notoriously cold, and when it gets cold, a large number of requests are sent back to the database instantly, and the database is likely to fail. In the case of system expansion or iteration, it is often necessary to add new buffer nodes, so how to heat up the new buffer nodes? If we can control the speed of things like expansion or rolling upgrades, we can control the speed at which we open up the flow and let the new buffer node warm up, but when the system fails, we want that node to warm up very quickly. Therefore, the difference between our system and other caching systems is that when a new node is started, the Cache is cold, and it immediately transfers the active Cache state from the nearby Peer, which can be heated up very quickly. Provide online services in a warm up state (Figure 12).

In addition, we can design layered buffering, and each layer of buffering can design different strategies to deal with problems at different levels. As shown in Figure 13, L1 and L2 can be used to solve data heat problems at the spatial level and heat problems at the time level respectively. Using multiple caches can reduce the number of requests to the next layer layer by layer, especially when we deploy across data centers, which requires high bandwidth and latency. If we have a hierarchical design, we can put another Cache layer between data centers to reduce the number of requests to the other data center.

In order to ensure that services do not interfere with each other and different caching strategies are selected according to the data access characteristics of different services, we further provide the mechanism of Cache label isolation to isolate offline writes and queries from multiple different business tenants. The zhihu read service data mentioned just now not only provides services for the home page, but also provides services for personalized push. Personalized push is a typical offline task that filters users to see if they have seen the content before pushing it. Although the two services access the same data, their access characteristics and hot spots are completely different, and the corresponding buffering strategies are also different. Group and we are doing isolation mechanism (figure 14), do isolation buffer node in the form of labels, different business use different buffer node, different buffer node match different buffer strategies, to achieve higher input-output ratio, at the same time also can isolate different tenants, to prevent the mutual influence between them.

3.3 Storage

Storage, we initially using MySQL, so obviously a large amount of data is not single so we used the depots table + MHA mechanism to enhance the performance of the system and guarantee system of high availability, the traffic is not too big can stand it, but in the case of when monthly one hundred billion new data, we are growing unease about in my heart, So I was thinking about how to make the system sustainable and maintainable, and I started choosing alternatives. At this point, we found that TiDB was compatible with MySQL, which was a very good feature for us with very little risk, so we started the migration. Once the migration is complete, the system’s weakest “extensibility” weaknesses are addressed.

3.4 Performance Specifications

The whole system is now highly available and scalable, and performance gets better. Figure 16 shows the performance index data I took out two days ago. At present, the traffic of read service has reached 40,000 lines of record writes per second, 30,000 independent queries and 12 million document interpretation. Under such pressure, P99 and P999 of read service response time remain stable at 25ms and 50ms. In fact, the average time is much lower than that. The point is that the read service is very sensitive to the long tail, and the response time needs to be very stable, because no one user’s experience can be sacrificed, and one user’s time out is time out.

B: All about TiDB

Finally, I will share some of the difficulties we encountered in migrating from MySQL to TiDB, how we resolved them, and what dividends we have reaped from this rapidly iterating product since TiDB 3.0 was released.

4.1 MySQL to TiDB

Now in fact, the ecological tool for data migration of the whole TiDB has been very perfect. We opened TiDB DM to collect the incremental binlog of MySQL and saved it first. Then we used TiDB Lightning to quickly import the historical data into TiDB, which should be about 1.1 billion records at that time. The import took four days. This time is still quite impressive, because it would have taken at least a month to write logically. Four days, of course, also is not shortened, at that time we was not especially abundant hardware resources, selected a group of machines, a batch of finished data guide to guide the next batch, if enough hardware resource, you can import faster, the so-called “high investment yield,” if you had more resources, so should be able to achieve better results. After all historical data is imported, the incremental synchronization mechanism of TiDB DM needs to be enabled to automatically synchronize the historical and real-time incremental data saved to TiDB and maintain the consistency between TiDB and MySQL data in near real time.

After the migration is complete, we will start small traffic read test. At the beginning of the launch, we find that there is a problem with Latency, which is sensitive to Latency. At this point, PingCAP partners work with us to continuously tune and adapt, and solve Latency problems. Figure 18 is our comparison of key lessons.

To prevent Latency from affecting Latency sensitive queries on the same TiDB, use a separate TiDB to isolate Latency sensitive queries. Second, some Query execution plan choices are not particularly good, and we have also made some SQL hints to help the execution engine choose a more reasonable execution plan. In addition, we made some more micro optimizations, such as using low-precision TSO, and re-using Prepared Statement to further reduce roundtrips on the network, which achieved good results.

We also did some development work along the way, such as binlog adaptation. Binlog is especially important because the system relies on binlog change push-downs to maintain consistency between cached copies. We need to change the original MySQL binlog to TiDB binlog, but we encountered some problems in the process, because TiDB is a database product, its binlog needs to maintain the global ordering of the arrangement, but in our previous business, due to the division of databases and tables, We didn’t care about this, so we made some adjustments. We changed the previous binlog to a binlog that can be split by database or table. This reduced the burden of global order, and the binlog throughput also met our requirements. At the same time, PingCAP partners have also made a lot of Drainer optimisations, and currently the Drainer is in a much better state than it was a month or two ago. Both the drain and Latency can meet our current online requirements.

The last lesson is about resource assessment, because that’s probably something we didn’t do particularly well at the time. At first we didn’t think too hard about how many resources would be needed to support the same data. Raft protocol used in TiDB requires at least three copies. Therefore, resources need to be prepared. You can’t expect to use the same resources to support the same business. In addition, our business model is a very large federated primary key, which is not clustered in TiDB index, which will lead to larger data and need to prepare more machine resources. Finally, because TiDB is a storage and computing architecture, the network environment must be ready. When these resources are ready, the final benefits are very obvious.

4.2 TiDB 3.0

In Zhihu, we adopt the same technical structure as the read service and support a set of risk control services for anti-cheating. Unlike the extreme historical data size of the read service, the anti-cheat service has a more extreme write throughput but only requires an online query of the data entered in the last 48 hours (see Figure 20 for a detailed comparison).

So what does the release of TiDB 3.0 bring to our two businesses, especially the anti-cheating business?

First let’s look at the read service. Read service write read throughput is not small, about 40K +, TiDB 3.0 gRPC Batch Message and multithreaded Raft Store can help a lot in this matter. In TiDB 3.0, we can write at least a few SQL hints to ensure that our Query can select the best execution Plan at least. In TiDB 3.0, we can write at least one SQL Hint to ensure that our Query can select the best execution Plan at least. Plan Management is a very useful feature.

Just now, Teacher Ma Xiaoyu introduced TiFlash in detail. When I first heard about this product at TiDB DevCon 2019, I was shocked. You can imagine how much value can be found from more than one trillion pieces of data. However, in the past, such high-throughput writes and large full data scale were difficult to synchronize data to Hadoop daily for analysis at a feasible cost using traditional ETL methods. And when we have TiFlash, everything becomes possible.

Taking the anti-cheat business to the extreme, TiDB 3.0’s Batch Message and multi-threaded Raft Store features allow us to achieve the same effect with lower hardware configurations. TiDB 3.0 contains a new storage engine Titan to solve this problem. We have introduced TiDB 3.0 into the production environment since TiDB 3.0.0-RC1. The Titan storage engine was started shortly after rC2 was released. In the right part of the figure, you can see the write/query Latency comparison before and after Titan was started. When we saw this figure, we were very, very shocked.

In addition, we also used the Table Partition feature in TiDB 3.0. By splitting the Table Partition in the time dimension, you can control the query to fall to the nearest Partition, which greatly improves the query timeliness.

Five, the summary

Finally, a brief summary of our development of this system and in the process of migration to TiDB harvest and thinking.

First of all, we must understand the business characteristics before developing any system, so as to design a better sustainable support scheme. At the same time, we hope that this architecture has universal applicability, just like the architecture of read service. Besides supporting the home page of Zhihu, it can also support the anti-cheating business.

In addition, we use open source software extensively, not only using it all the time, but also participating in development to a certain extent, and we’ve learned a lot in the process. So we should not only participate in the community as users, but also contribute to the community and make TiDB better and stronger together.

Last but not least, the design of our business system may seem a little too complicated, but from the perspective of Cloud Native today, we hope that even the business system can support high availability, high performance and easy expansion just like Cloud Native products. We also need to embrace new technologies with an open mind when making business systems, Cloud Native from Ground Up.

More TiDB user practices: pingcap.com/cases-cn/