Zhihu, in classical Chinese, means “do you know? , it is China’s Answer to Quora, a q&A site where various questions are created, answered, edited and organized by a community of users.

As the largest knowledge sharing platform in China, we currently have 220 million registered users, 30 million questions and over 130 million answers on our website.

As the user base grew, the data size of our application was impossible to achieve. Our Moneta application stores approximately 1.3 trillion lines of data (storing posts that users have already read).

With about 100 billion lines of data generated every month and growing, that number will reach 3 trillion within two years. We face serious challenges in scaling the back end while maintaining a good user experience.

In this article, I will delve into how to maintain millisecond query response times over such a large amount of data, and how TiDB, an open source mysqL-compatible NewSQL Hybrid Transaction/Analysis Processing (HTAP) database, provides us with the support to gain real-time insights into our data.

I’ll cover why we chose TiDB, how we used it, what we learned, good practices and some ideas for the future.

Our pain points

This section describes the architecture of our Moneta application, the ideal architecture we tried to build, and database scalability as our main challenges.

System Architecture Requirements

Zhihu’s Post Feed service is a key system that allows users to receive content posted on the site.

The back-end Moneta application stores posts that users have read and filters them from the stream of posts on Zhihu’s recommendation page.

The Moneta application has the following characteristics:

  • ** High availability data is needed: ** The Post Feed is the first screen to appear and it plays an important role in driving user traffic to Zhihu.

  • ** Handles huge writes: ** For example, over 40,000 records are written per second at peak times, increasing the number of records by nearly 3 billion records per day.

  • ** Long-term storage of historical data: ** Currently, approximately 1.3 trillion records are stored in the system. With about 100 billion records accumulated each month and growing, historical data will reach 3 trillion records in about two years.

  • ** Handles high-throughput queries: ** During peak times, the system processes queries executed on an average of 12 million posts per second.

  • ** Limits the response time of a query to 90 milliseconds or less: ** This happens even for long-tailed queries that take the longest to execute.

  • ** Tolerates false positives: ** This means the system can call up many interesting posts for users, even if some posts are mistakenly filtered out.

Given the above facts, we need an application architecture with the following capabilities:

  • ** High availability: ** When users open the recommendation page of Zhihu, it is a bad user experience to find a large number of posts that they have already read.

  • ** Excellent system performance: ** Our application has high throughput and strict response time requirements.

  • ** Easy to scale: ** As our business grows and our applications grow, we want our systems to scale easily.

exploration

To build the ideal architecture with the above capabilities, we integrated three key components into the previous architecture:

  • ** proxy: ** This forwards user requests to available nodes and ensures high availability of the system.

  • ** Cache: ** This handles in-memory requests temporarily, so we don’t always need to process requests in the database. This can improve system performance.

  • ** Storage: ** Before TiDB, we managed our business data on a separate MySQL. As data volumes proliferate, a standalone MySQL system is not enough.

    We then adopted a solution of MySQL sharding and Master High Availability Manager (MHA), but this solution was not desirable when 100 billion new records flooded into our database every month.

Disadvantages of MySQL Sharding and MHA

MySQL sharding and MHA are not a good solution because both have their drawbacks.

MySQL sharding

  • Application code becomes complex and difficult to maintain.

  • Changing an existing shard key is cumbersome.

  • Upgrading application logic affects the availability of the application.

Disadvantages of MHA:

  • We need to implement virtual IP (VIP) configuration either by scripting or using third-party tools.

  • The MHA monitors only the primary database.

  • To configure MHA, we need to configure password-free secure Shell (SSH). This can lead to potential security risks.

  • The MHA does not provide read load balancing for slave servers.

  • The MHA can only monitor the availability of the primary (not the secondary) server.

Until we discovered TiDB and migrated data from MySQL to TiDB, database scalability was still a system-wide weakness.

What is TiDB?

The TiDB platform is a set of components that, when used together, become a NewSQL database with HTAP functionality.

TiDB platform architecture

Inside the TiDB platform, the main components are as follows:

  • The TiDB server is a stateless SQL layer that processes the user’s SQL queries, accesses the data in the storage layer, and returns the corresponding results to the application. It is MySQL compatible and sits on top of TiKV.

  • TiKV server is a distributed transaction key-value storage layer for data persistence. It uses the Raft consensus protocol for replication to ensure strong data consistency and high availability.

  • The TiSpark cluster is also located above TiKV. It is an Apache Spark plug-in that works with the TiDB platform to support complex online analytical processing (OLAP) queries for business intelligence (BI) analysts and data scientists.

  • Place Driver (PD) servers are metadata clusters supported by ETCD for managing and scheduling TiKV.

In addition to these major components, TiDB also has an ecosystem of tools, such as Ansible scripts for rapid deployment, Syncer for migration from MySQL, and TiDB data migration.

And TiDB Binlog for collecting logical changes to the TiDB cluster and providing incremental backups. Copy downstream (TiDB, Kafka or MySQL).

Key features of TiDB include:

  • Horizontal scalability.

  • MySQL compatible syntax.

  • Distributed transactions with strong consistency.

  • Cloud native architecture.

  • Use HTAP for minimal extract, Transform, load (ETL).

  • Fault tolerance and Raft recovery.

  • Online schema changes.

How do we use TiDB

In this section, I’ll show you how to run TiDB and performance metrics for Moneta applications in Moneta’s architecture.

TiDB in our architecture

TiDB architecture in Moneta application of Zhihu

We deployed TiDB on the system, and the overall architecture of the Moneta application became:

  • ** Top layer: ** Stateless and scalable client apis and agents. These components are easy to extend.

  • ** Middle tier: ** Soft state components and layered Redis cache as main parts. When service is interrupted, these components can self-restore service by restoring data saved in the TiDB cluster.

  • ** Bottom layer: **TiDB cluster stores all stateful data. Its components are highly available, and if a node crashes, it can self-restore its services.

In this system, all components are self-recoverable and the whole system has a global fault monitoring mechanism. We then use Kubernetes to coordinate the entire system to ensure high availability of the entire service.

Performance specifications of TiDB

Because we used TiDB in a production environment, our system was highly available and easy to scale, and system performance improved significantly. For example, a set of performance metrics was adopted for Moneta applications in June 2019.

Write 40,000 rows per second at peak times:

Rows of data written per second (thousands)

Check 30,000 queries and 12 million posts per second at peak times:

Rows of data written per second (thousands)

The 99th percentile response time is about 25 milliseconds, and the 99th percentile response time is about 50 milliseconds. In practice, the average response time is much smaller than these numbers, even for long-tail queries that require a steady response time.

99th percentile response time

The 999th percentile response time

What have we learned

Our move to TiDB did not go well, and here we would like to share some lessons learned.

Faster data import

We used TiDB Data Migration (DM) to collect MySQL incremental Binlog files, and then used TiDB Lightning to quickly import the data into the TiDB cluster.

To our surprise, it took only four days to import the 1.1 trillion records into TiDB. If we logically write data to the system, it can take a month or more. If we had more hardware resources, we could import data faster.

Reduce query latency

After the migration, we tested a small amount of read traffic. When the Moneta application first went live, we found that the query latency was not what we wanted. To address latency, we worked with PingCap engineers to tune system performance.

In the process, we have accumulated valuable data and data processing knowledge:

  • Some queries are sensitive to query latency, others are not. We deployed a separate TiDB database to handle delay-sensitive queries. (Other non-delay-sensitive queries are handled in different TiDB databases.)

    In this way, large queries and delay-sensitive queries are processed in separate databases, and the execution of the former does not affect the latter.

  • For queries that do not have an ideal execution plan, we write SQL prompts to help the execution engine choose the best execution plan.

  • We use low-precision timestamp Oracle (TSO) and preprocessed statements to reduce network roundtrips.

Assessment of resources

Before we tried TiDB, we didn’t analyze how much hardware resources we needed to support the same amount of data on the MySQL side.

To reduce maintenance costs, we deployed MySQL in a single-host-single-slave topology. In contrast, Raft protocol implemented in TiDB requires at least three copies.

Therefore, we need more hardware resources to support business data in TiDB, and we need to prepare machine resources in advance.

Once our data center is set up correctly, we can quickly complete the evaluation of TiDB.

Expectations for TiDB 3.0

At Zhihu, the anti-spam and Moneta applications share the same architecture. Titan and Table Partitions from TiDB 3.0 (TiDB 3.0.0-RC.1 and TiDB 3.0.0-RC.2) candidates were tried in an anti-spam application for production data.

①Titan shortens the delay

Anti-spam applications have been plagued by severe query and write delays.

We’ve heard TiDB 3.0 will introduce Titan, a key-value storage engine designed to reduce write magnification for RocksDB (the underlying storage engine in TiKV) when using large values. To try this out, we enabled Titan after TiDB 3.0.0-RC.2 was released.

The following figures show write and query latency compared to RocksDB and Titan, respectively:

Write and query latency in RocksDB and Titan

Statistics show that both write and query latency dropped dramatically after we enabled Titan. This is amazing! We couldn’t believe our eyes when we saw the statistics.

② Table partitioning improves query performance

We also used TiDB 3.0’s table partitioning capabilities in our anti-spam application. Using this feature, we can split the table into multiple partitions on time.

When the query arrives, it is executed on the partition covering the target time range. This greatly improves our query performance.

Let’s consider what would happen if we implemented TiDB 3.0 in Moneta and anti-spam applications in the future.

③ TiDB 3.0 in Moneta applications

TiDB 3.0 has features such as batch messaging in gRPC, multi-threaded Raftstore, SQL plan management and TiFlash. We believe these will add luster to the Moneta application.

④ Batch messages in gRPC and multithreaded Raftstore

Moneta has a write throughput of over 40,000 transactions per second (TPS), TiDB 3.0 can batch send and receive Raft messages, and can handle Region Raft logic in multiple threads. We believe these features will significantly improve the concurrency of our system

⑤SQL plan management

As mentioned above, we wrote a number of SQL hints to enable the query optimizer to choose the best execution plan.

TiDB 3.0 adds an SQL plan management capability to bind queries to specific execution plans directly from the TiDB server. With this feature, we do not need to modify the query text to inject hints.

6 TiFlash

I first heard about TiFlash as an extended analytics engine for TiDB at TiDB DevCon 2019.

It uses column-oriented storage technology to achieve high data compression rates and applies the extended Raft consistency algorithm in data replication to ensure data security.

Since we have massive amounts of data with high write throughput, we cannot use ETL to copy data to Hadoop for analysis on a daily basis. But for TiFlash, we are optimistic that we can easily analyze our huge amount of data.

⑦ TiDB 3.0 in anti-spam applications

Anti-spam applications have higher write throughput compared to the huge historical data size of Moneta applications.

However, it queries only the data stored in the last 48 hours. In this application, data grows by 8 billion records and 1.5 terabytes per day.

Because TiDB 3.0 can batch send and receive Raft messages, and it can handle Region Raft logic in multiple threads, we can manage applications with fewer nodes.

Before, we used seven physical nodes, but now we only need five. Even if we use commercial hardware, these features can improve performance.

What’s next

TiDB is a MySQL compatible database, so we can use it just like MySQL.

Thanks to TiDB’s horizontal scalability, we are now free to scale our database even though we have over a trillion records to cope with.

So far, we’ve used quite a bit of open source software in our applications. We also learned a lot about using TiDB to handle system problems.

We decided to participate in the development of open source tools and participate in the long-term development of the community. Based on our joint efforts with PingCAP, TiDB will become even stronger.

Source: itindex.net

Some interview questions for 2020 are summarized. The interview questions are divided into 19 modules, which are: Java Basics, Containers, Multithreading, Reflection, Object copy, JavaWeb exceptions, Networking, Design Patterns, Spring/SpringMVC, SpringBoot/SpringCloud, Hibernate, MyBatis, RabbitMQ, Kafka, Zookee Per, MySQL, Redis, JVM.

Get the following information: pay attention to the public number: [programmers with stories], get learning materials.

Remember to click follow + comment oh ~