The heterogeneous transition of Autohome from SQL Server to TiDB

Author: Autohome Technical College – Technical Architecture Group

SQL Server +.NET is the standard technology stack of many early Internet enterprises. Although TiDB is compatible with MySQL protocol and ecological database, it is applicable to common business scenarios. Autohome is an innovative demonstration of how to seamlessly migrate from SQL Server to TiDB in an era of new open source technologies.

This paper will introduce in detail how to migrate from SQL Server database to TiDB database from business background, migration scheme, synchronization, business transformation, online effect, surrounding construction and other perspectives. Whether you are an architect, a business developer, or a DBA, you will reap the benefits on different levels.

I. Project background

Autohome community was launched in 2005. As one of the oldest businesses of Autohome, it has accumulated hundreds of millions of posts and replies in the past 14 years. At present, there are tens of millions of DAILY ACTIVE users and hundreds of millions of visits every day, and the daily interface usage is 1 billion + times. During this period, it has undergone architecture upgrade and reconstruction, technology stack upgrade, etc., but its data is always stored in SQL Server. As the data continued to grow, we encountered many bottlenecks in using the SQL Server database, so much so that we had to find a new database replacement.

2. Bottlenecks encountered in using SQL Server

As the business continues to expand, the number of visits and publications to the Autohome community continues to increase, and more and more database problems are encountered. Here are two problems that must be solved quickly:

Historically, Autohome community response library has adopted the design of separate database and separate table to solve the problem of performance degradation when SQL Server single table is too large. Today, the reply database has 100+ libraries and 1000+ tables (according to the post ID). There’s nothing wrong with that in and of itself, the code is written, the data is written where it should be written, and the data is read where it should be read. However, with the development of applications and the change of requirements, we find that the structure of sub-database and sub-table is difficult to meet some requirements. We need the data to be logically in a table.
In recent years, with the acceleration of business growth, the amount of data has made rapid progress. However, the capacity of hard disks is limited, and the number of hard disks that can be expanded on each server is also limited. As a result, storage servers with larger capacity have to be added every once in a while to cope. And it was complicated at first, involving a lot of related projects, and even now we are familiar with it, we still need to pay attention to it every time we change servers, and large database servers are expensive. We need to make the capacity expansion invisible to the application.

Third, distributed database research

3.1 Direction Determination

At the end of 2018, the company set up a virtual architecture group to investigate new databases to solve problems encountered by the Autohome community. After a variety of analysis and testing, the direction was set for distributed databases earlier this year, with a total of three current popular distributed databases surveyed: TiDB(PingCAP), Ignite(ASF-TLP) and CockroachDB. After countless tests, we finally chose TiDB for the following reasons:

Compatible with MySQL protocol and ecology, low threshold to get started;
Keep good technical communication with TiDB;
TiDB company is in Beijing, any problem can be solved face to face;
TiDB has a better design architecture;
The official community is active and well-documented;
Official technicians often come to the company to communicate.

Here is a TiDB official description:

TiDB is an online transaction processing, online analysis processing (HTAP: Hybrid Transactional/Analytical Processing (Transactional/Analytical Processing) is a converged database product that achieves one-key horizontal scaling, strong consistency of multi-copy data security, distributed transactions, real-time OLAP and other important features. Compatible with MySQL protocol and ecology, convenient migration, very low operation and maintenance cost.

It is not difficult to find that TiDB actually solves the pain point when we apply SQL Server:

Horizontal scaling: Nodes can be added to the current cluster at any time, and nodes can be easily replaced.
Massive data support: Based on its features and industry experience, billions or billions of levels of data can be easily handled.
High availability: Compared to THE master-slave mode of SQL Server, TiDB is based on Raft protocol, achieving 100% data consistency and automatic failure recovery when most copies are available.
HTAP: TiDB itself supports some degree of OLAP scenarios, and more complex OLAP analysis can be done through TiSpark projects. We are also well on our way to implementing deeper OLAP applications.

3.2 True knowledge comes from practice

Based on the support of the above theories, we have carried out a lot of functional tests, performance tests, exception tests, business access tests and so on.

OLTP test: 20 million data and 500 concurrent threads were tested. In OLTP scenario test, 99% TiDB response time was less than 16ms, meeting service requirements. Moreover, in the case of increasing data magnitude, TiDB will show greater advantages. In the future, TiDB/PD/TiKV nodes can be added to improve read and write performance, as shown in the figure below:
OLAP test: 50G TPC-H test, TiDB has a great speed advantage over MySQL:

TPC Benchmark™H (TPC-H) is the decision support Benchmark. It consists of a set of business-oriented temporary queries and concurrent data modifications. The data selected to query and populate the database has broad industry-wide relevance. The benchmark illustrates a decision support system that can examine large amounts of data, execute queries with high complexity, and provide answers to critical business questions.
Abnormal test: We have tested the performance of PD and TiKV in the case of abnormal downtime, which has little impact on services and can realize automatic fault recovery.

Iv. Migration scheme

4.1 Problems to Be Solved Before Migration

There are some practical issues that need to be addressed before real data migration can take place:

Some field types are different between SQL Server and TiDB. Refer to relevant documents to map different fields to each other before creating a table in TiDB, such as the accuracy of DATETIME.
When synchronizing, merge database and table data into one table. Fortunately, in the original design, in addition to increment the primary key ID, we also have a business ID, which is not repeated in each table, so it saves a lot of things.
How to ensure data consistency during one-time import of gigabyte data and subsequent incremental synchronization?
If there is an unpredictable problem with TiDB in production that cannot be solved for a while, we must switch to SQL Server immediately to ensure that the business is not affected. In other words, the data generated in TiDB needs to be synchronized back to SQL Server in real time.
Because of the large number of visits, the switching time must be controlled in seconds.
Because SQL Server is a commercial database, there are few data synchronization schemes with open source databases, so the synchronization scheme, architecture design, research and development, testing must be solved by ourselves.

4.2 Overall Migration Architecture Diagram

The following is an architectural diagram of our entire migration process, including full synchronization from SQL Server to TiDB, incremental synchronization, and reverse synchronization from TiDB to SQL Server.

Now, what needs to be determined is the migration process of the whole project. With the big direction, the goal will be more clear in the implementation.

Considering the business form and data level of autohome community, it is completely impossible to accept more than 10 hours of offline migration. We can only complete the migration in the 1:00-3:00 am time window, and the shorter the better.

Therefore, we choose the online migration scheme, which is a little more complicated. The process prepares full data, and then synchronates incremental data in real time. After data synchronization (delay of seconds), the application read traffic is switched to TiDB in a rolling upgrade mode.

Observe that applications are running properly, and stop the SQL Server write permission for a short time to ensure that no data is written to the SQL Server. Then, write traffic can be directed to TiDB and migration is complete.

During the entire migration process, the read data scenario of the application is not affected, but the write scenario is affected from the period when the write permission is disabled to the period when the write traffic points to TiDB.

The diagram below is the flow chart we combed out, which we had to follow strictly throughout the migration process.

The implementation of full and incremental synchronization is described in detail below.

Five, full synchronization

First we would like to thank the following two open source projects for standing on the shoulders of giants and saving us a lot of time.

Github.com/alibaba/yug…
Github.com/alswl/yugon…

Yugong is an Oracle data migration and synchronization tool launched by Alibaba, and the author AlSWL realized the support of SQL Server data source on this basis. We will not repeat the use of Yu Gong here, interested students please check.

After carefully reading and testing Daishen’s project, we found that it did not meet 100% of our needs.

Yugong data flow is a standard ETL process. There are Extractor, Translator and Applier to implement ETL process.

First of all, Extractor, Yugong’s original configuration is to need to export the library table written in the configuration file, this for 1000+ tables, too unrealistic. Here we add a new feature that reads out all user tables in the database without configuring the table names to be exported, and uses a new configuration item to perform regular matching to determine which tables need to be synchronized.

# lookup table
SELECT name FROM sys.databases WITH (nolock) WHERE state_desc = 'ONLINE'

# Query the table where CDC is enabled
SELECT name FROM %s.sys.tables t WITH (nolock) JOIN %s.[cdc].[change_tables] ct WITH (nolock) ON t.object_id = ct.source_object_id

Copy the code

RowDataMergeTranslator (RowDataMergeTranslator) reads the RowData in memory and translates the RowData that is read from SQL Server. Discard its original primary key column and use TiDB generation instead. The configuration file determines which tables need to implement this feature.

record.removeColumnByName(config.getDiscardKey());
Copy the code

The final Applier is left unchanged and the processed data is written directly to TiDB.

From now on, we’re done closing the list.

Incremental synchronization and real-time verification

To fulfill this requirement, we applied THE CDC of SQL Server and added deferred validation of data correctness on top of incremental synchronization. For more information about the CDC, which is not covered here, you need to know that it has access to incremental data, see the CDC documentation.

Note that the CDC must be enabled before full synchronization. Ensure that the CDC records can cover the incremental data generated during full synchronization.

According to the above flowchart, the Producer reads CDC logs from SQL Server, converts them into a message containing table information, column information and row data, and delivers them to Kafka. Downstream consumers, after pulling the message, convert the data into a mysql-compatible SQL statement and execute it in TiDB (which also implements the combined database table), thus realizing the whole process of incremental data synchronization.

Here’s another consumer realize data validation function, it may delay five seconds consumption the same queue, and through the extraction of a primary key (or index) from TiDB found out which has been written to the data, will compare the entire line on either side of the data (in this practice, after the removal of the primary key), if there is a problem will try to write again, If abnormal, send alarm to relevant personnel.

After implementing these and entering the testing phase, we found a problem that 1000+ reply table corresponds to 1000+ CDC log table, and a Producer needs to start 1000+ thread. When polling these tables at the designed interval of 5s, the server CPU was directly full, resulting in a large number of threads waiting, and the timeliness of polling CDC logs could not be guaranteed. Based on analysis of business and DBA queries, 95% of the responses generated by the Autohome community each day are concentrated in the latest 5% of posts. In other words, we only had a few dozen tables that needed to retrieve the CDC logs at such a high frequency. For the rest of the tables, we solved this problem by increasing the polling interval, deploying in batches, and so on.

Careful students will find that the verification function is not rigorous logically. If the upstream data changes within five seconds, the problem of using the new data to verify the old data may arise. Here are two solutions:

Using a single partition topic and a single consumer program to ensure that the order of incremental synchronization and verification is strictly consistent, but the performance of this scheme is relatively low, and the availability cannot be guaranteed.
We synchronized the version stamp to the TiDB by adding the rowVersion to the table rows in SQL Server. The value is compared during the verification. If the value is inconsistent, the verification is abandoned. This solution will lose some validation samples, but can improve performance and availability by adding partitions and consumers.

7. Rollback Plan

As mentioned earlier, when switching to TiDB, you need to prevent unpredictable problems and be able to switch back to SQL Server at any time.

TiDB Binlog makes this easy. We use the official Pump and Drainer to extract the Binlog into Kafka, parse the data changes, calculate which SQL Server database and table the data belongs to based on the business ID, and synchronize the data.

Viii. Relocation and transformation of TiDB of Autohome community business

In the process of business transformation, due to historical accumulation, there are many places that need to be modified, which are distributed in various projects. We adopt the methods of interface search implementation, code search and DBA help to grasp SQL to ensure that 100% of relevant businesses are covered. Only in this way can we guarantee no failure after the launch.

Data access layer added support for MySQL syntax.
Remove stored procedures from SQL Server.
SQL Server and TiDB (MySQL) support different statements and functions, one by one, modified, tested and optimized.
According to the principle of TiDB index and the combed out SQL statement, rebuild the index.

At the same time, we optimized for each modified SQL to accurately hit the optimal index, thus achieving 99% of the TP service response time of 12ms and 99.9% of the response time of 62ms under the billion level data volume.

Ix. TiDB peripheral system construction

In addition to the function points involved in the above migration process, we have also developed some development specifications and research and development of some practical tools to ensure better application of TiDB in the car home.

We have established perfect TiDB development specifications, operation and maintenance specifications, on-line specifications, and conducted relevant training for development students within the company.
Developed a real-time slow SQL analysis tool, TiSlowSQL, which provides real-time, multi-dimensional, full-view SQL reports to help us quickly locate cluster-level failures caused by slow SQL.
In order to solve the single point of monitoring problem, we developed a set of monitoring tools to monitor the core components of TiDB. In the future, the monitoring system will be migrated to the Home cloud platform.
Regularly hold technical training in Autohome University, regularly share technology and summarize experience in the group.

TiSlowSQL is also part of the Motorhome operation team that participated in the Hackathon project. For more details, please look forward to the next article!

X. Summary and prospect

Autohome community has officially launched the distributed database TiDB at the end of September, and the operation is stable now. After other business migrations are completed, SQL Server services in the Autohome community will be phased out. We have made the following conclusions about the migration process:

Through continuous optimization of SQL, the current online TP99 is stable, and there is no big difference with the migration before, in line with the test effect. It is not aware of users or services.
With the continuous expansion of services, it can better cope with the explosion of data. When expanding the cluster, it does not need to find expensive large storage machines, and it can be expanded at any time without interruption of online services.
In this migration, we have accumulated a lot of experience from SQL Server to TiDB, and can provide technical support for other teams to use distributed DATABASE TiDB, so that other teams can save time in the migration process.
We are currently in official communication with TiDB to put the migration solution and non-business related migration logic into the open source community.

Moving from SQL Server to TiDB, from traditional relationships to distributed HTAP, and from commercial licensing to open source is one of the most significant technological shifts in the history of the Autohome community.

Autohome has a lot of massive data application scenarios. This migration from SQL Server to distributed database TiDB lays a good foundation for us to migrate other businesses to TiDB in the future, and also establishes a good regular communication mechanism with TiDB. Hopefully TiDB official will continue to iterate quickly, and we will also work with TiDB official to develop some useful features.

For any doubts or experiences about the use of TiDB, you can log on to www.asktug.com to communicate with everyone