How can hundreds of millions of data be queried and responded in milliseconds?

preface

Zhihu, which in classical Chinese means “do you know? , it is China’s answer to Quora, a question-and-answer website in which various questions are created, answered, edited and organized by a community of users.

As the largest creative Commons platform in China, we currently have 220 million registered users, 30 million questions and more than 130 million answers on our website.

As the user base grew, the data size of our application became unachievable. Our Moneta application stores about 1.3 trillion rows of data (storing posts that users have already read).

With about 100 billion rows of data being generated every month and growing, that number could reach $3 trillion within two years. While maintaining a good user experience, we face a tough challenge in scaling the back end.

In this article, I’ll delve into how to maintain microsecond query response times on such a large amount of data, and how TiDB, an open source, mysql-compatible NewSQL hybrid transaction/analytical Processing (HTAP) database, enables us to gain real-time insight into our data.

I’ll cover why we chose TiDB, how we used it, what we learned, good practices and some ideas for the future.

Our pain point

This section describes the architecture of our Moneta application, the ideal architecture we tried to build, and database scalability as our main difficulty.

System Architecture Requirements

Zhihu’s Post Feed service is a key system through which users can receive content posted on the site.

The Moneta app on the back end stores posts that users have read and filters them out from the stream of posts on Zhihu’s recommendation page.

The Moneta application has the following characteristics:

1. High availability data is required

The Post Feed is the first screen to appear, and it plays an important role in driving user traffic to Zhihu.

2. Processing large data writes

For example, at peak times more than 40,000 records were written per second, and the number of records increased by nearly 3 billion per day.

3. Store historical data for a long time

Currently, about 1.3 trillion records are stored in the system. With about 100 billion records accumulated every month and growing, historical data will reach 3 trillion records in about two years.

4. Processes high-throughput queries

At peak times, the system handles queries on an average of 12 million posts per second.

5. Limit the query response time to 90 milliseconds or less

This can happen even for long-tail queries that take the longest time to execute.

6. Tolerate false positives

This means that the system can pull up a lot of interesting posts for users, even if some posts are incorrectly filtered out.

Given these facts, we need an application architecture that has the following capabilities:

1. High availability

When users open zhihu’s recommendation page and find a large number of posts they have already read, it is a bad user experience.

Excellent system performance

Our application has high throughput and strict response time requirements.

3. Easy to expand

As the business grows and the applications grow, we want our systems to scale easily.

exploration

To build an ideal architecture with these capabilities, we integrated three key components into the previous architecture:

1. The agent

This forwards the user’s request to the available nodes and ensures high availability of the system.

2. The cache

This temporarily handles requests in memory, so we don’t always need to handle requests in the database. This improves system performance.

3. Store

Before using TiDB, we managed our business data on a separate MySQL. With the amount of data exploding, a stand-alone MySQL system is not enough. We then adopted a solution of MySQL sharding and Master High Availability Manager (MHA), which was not an option when 100 billion new records were flooding into our database every month.

Disadvantages of MySQL Sharding and MHA

MySQL sharding and MHA are not a good solution because MySQL sharding and MHA both have their disadvantages.

Disadvantages of MySQL sharding:

Application code becomes complex and difficult to maintain. Changing an existing sharding key is cumbersome. Upgrading the application logic can affect the availability of the application.

MHA disadvantages:

1. We need to write scripts or use third-party tools to achieve the virtual IP (VIP) configuration.

2. The MHA monitors only the primary database.

3. To configure MHA, we need to configure password-less secure Shell (SSH). This can lead to potential security risks.

4.MHA does not provide read load balancing for slave servers.

5.MHA can only monitor the availability of the primary server (not the secondary primary server).

Until we discovered TiDB and migrated data from MySQL to TiDB, database scalability remained a system-wide weakness.

In the public account top architect reply “clean architecture”, get a surprise gift package.

What is TiDB?

The TiDB platform is a set of components that, when used together, become an HTAP-enabled NewSQL database.

TiDB platform architecture

Inside the TiDB platform, the main components are as follows:

1. The TiDB server is a stateless SQL layer that processes user SQL queries, accesses data in the storage layer, and returns the corresponding results to the application. It is MySQL compatible and sits on top of TiKV.

2.TiKV server is a distributed transaction key value storage layer with persistent data. It replicates using the Raft consensus protocol to ensure strong data consistency and high availability.

3. The TiSpark cluster is also located above TiKV. It is an Apache Spark plug-in that works with the TiDB platform to support complex online analytical processing (OLAP) queries for business intelligence (BI) analysts and data scientists.

4. Place driver (PD) servers are metadata clusters supported by ETCD for managing and scheduling TiKV.

In addition to these major components, TiDB also has an ecosystem of tools such as Ansible scripts for rapid deployment, Syncer for migration from MySQL, and TiDB data migration.

And the TiDB Binlog used to collect logical changes made to TiDB clusters and provide an incremental backup. Copy to downstream (TiDB, Kafka, or MySQL).

The main functions of TiDB include:

1. Horizontal scalability.

2.MySQL compatible syntax.

3. Distributed transactions with strong consistency.

4. Cloud native architecture.

5. Use HTAP for minimal extract, transform, and load (ETL).

6. Fault tolerance and Raft recovery

7. Online architecture changes.

How do we use TiDB

In this section, I’ll show you how to run TiDB in Moneta’s architecture and performance metrics for Moneta applications.

TiDB in our architecture

TiDB architecture in Zhihu’s Moneta application

We deployed TiDB to the system, and the overall architecture of the Moneta application became:

1. Top-level: stateless and scalable client apis and proxies. These components are easy to extend.

2. Middle tier: Soft state components and layered Redis cache as the main part. When the service is interrupted, these components can recover the service from themselves by restoring the data stored in the TiDB cluster.

3. Underlying layer: The TiDB cluster stores all stateful data. Its components are highly available and it can self-restore its services if a node crashes.

In this system, all components are self-recovery, and the whole system has a global fault monitoring mechanism. We then used Kubernetes to coordinate the entire system to ensure high availability of the entire service.

Performance indicators of TiDB

Because we used TiDB in a production environment, our system was highly available and easily scalable, and system performance improved significantly. For example, a set of performance metrics was adopted for the Moneta application in June 2019.

Write 40,000 rows per second at peak times:

Rows of data written per second (thousands)

Checking 30,000 queries and 12 million posts per second at peak times:

Rows of data written per second (thousands)

The 99th percentile response time is about 25 milliseconds and the 999th percentile response time is about 50 milliseconds. In fact, the average response time is much less than these numbers, even for long-tailed queries that require stable response times.

99th percentile response time

999th percentile response time

What have we learned

Our migration to TiDB was not a smooth one, and we want to share some lessons learned here.

Faster data import

We used TiDB Data Migration (DM) to collect MySQL incremental Binlog files and then used TiDB Lightning to quickly import the data into the TiDB cluster.

In the public account programmer Xiao Le replied “Java”, get Java interview questions and answers surprise gift package.

To our surprise, it took only four days to import the 1.1 trillion records into TiDB. If we logically write the data to the system, it could take a month or more. If we had more hardware resources, we could import data faster.

Reduce query latency

After the migration, we tested a small amount of read traffic. When the Moneta application first went live, we found that the query latency did not meet our requirements. To solve the latency problem, we worked with PingCap engineers to adjust system performance.

In the process, we have accumulated valuable data and data processing knowledge:

1. Some queries are sensitive to query latency, others are not. We deployed a separate TiDB database to handle delay-sensitive queries. (Other non-delay-sensitive queries are processed in different TiDB databases.)

2. In this way, large queries and delay-sensitive queries are processed in different databases, and the execution of the former does not affect the latter.

3. For queries that do not have an ideal execution plan, we write SQL prompts to help the execution engine select the best execution plan.

4. We use low-precision timestamp Oracle (TSO) and pre-processed statements to reduce network round-trips.

Assessment of resources

Before we tried TiDB, we did not analyze how much hardware resources we would need to support the same amount of data on the MySQL side.

To reduce maintenance costs, we deployed MySQL in a single-host – single-slave topology. In contrast, the Raft protocol implemented in TiDB requires at least three copies.

Therefore, we need more hardware resources to support the business data in TiDB, and we need to prepare machine resources ahead of time.

Once our data center is set up correctly, we can quickly complete the evaluation of TiDB.

Expectations for TiDB 3.0

At Zhihu, the anti-spam application has the same architecture as the Moneta application. We tried Titan and Table Partition in TiDB 3.0 (TiDB 3.0.0-rc.1 and TiDB 3.0.0-RC.2) candidate versions in an anti-spam application for production data.

①Titan shortens the delay

Anti-spam applications have been plagued by severe query and write delays.

We have heard that TiDB 3.0 will introduce Titan, a key-value storage engine to reduce write amplification on RocksDB (the underlying storage engine in TiKV) when large values are used. To try this feature out, we enabled Titan after the release of TiDB 3.0.0-Rc.2.

The following figure shows the write and query latency compared to RocksDB and Titan, respectively:

Write and query delays in RocksDB and Titan

Statistics show that both write and query latency dropped dramatically after we enabled Titan. It’s amazing! When we see the statistics, we can’t believe our eyes.

② Table partitioning improves query performance

We also used the table partitioning feature of TiDB 3.0 in our anti-spam application. Using this feature, we can divide the table into multiple partitions on time.

When the query arrives, it is executed on the partition that overwrites the target time range. This greatly improved our query performance.

Let’s consider what happens if we implement TiDB 3.0 in Moneta and anti-spam applications in the future.

③ TiDB 3.0 in Moneta application

TiDB 3.0 has features such as batch messaging in gRPC, multi-threaded Raftstore, SQL plan management, and TiFlash. We believe these will add luster to the Moneta app.

④ Batch processing messages in gRPC and multi-threaded Raftstore

Moneta has a write throughput of over 40,000 transactions per second (TPS), TiDB 3.0 can send and receive Raft messages in bulk, and can process Region Raft logic in multiple threads. We believe these features will significantly improve the concurrency of our system

⑤SQL plan management

As mentioned above, we wrote a number of SQL prompts to enable the query optimizer to select the best execution plan.

TiDB 3.0 adds an SQL plan management feature that binds queries to specific execution plans directly from the TiDB server. With this feature, we do not need to modify the query text to inject the hint.

6 TiFlash

At TiDB DevCon 2019, I first heard of TiFlash as an extended analysis engine for TiDB.

It uses column-oriented storage techniques to achieve high data compression rates and applies the extended Raft consistency algorithm in data replication to ensure data security.

Because we have large volumes of data with high write throughput, we cannot use ETL to copy data to Hadoop for analysis on a daily basis. But with TiFlash, we are optimistic that we can easily analyze our huge data volume.

⑦ TiDB 3.0 in anti-spam applications

The anti-spam application has a higher write throughput compared to the Moneta application’s large historical data size.

However, it only queries data stored in the last 48 hours. In this application, data is adding 8 billion records and 1.5 terabytes per day.

Because TiDB 3.0 can batch send and receive Raft messages, and it can process Region Raft logic in multiple threads, we can manage the application with fewer nodes.

Previously, we used seven physical nodes, but now we only need five. Even if we use commercial hardware, these features can improve performance.

What’s next

TiDB is a MySQL compatible database, so we can use it just like we use MySQL.

Thanks to the horizontal scalability of TiDB, we are now free to scale our database even though we have more than a trillion records to cope with.

So far, we’ve used quite a bit of open source software in our applications. We also learned a lot about using TiDB to deal with system problems.

We decided to participate in the development of open source tools and in the long-term development of the community. Based on our joint efforts with PingCAP, TiDB will become even stronger.

The last

I have compiled a list of Mybatis related materials, Mysql documents, brain maps, Spring family bucket series, Java systematic materials (including Java core knowledge points, interview topics and the latest Internet real questions in 20 years, e-books, etc.) friends in need can follow the public account [Cheng Xuyuan Xiaowan] to get.