This article is based on the sharing by Huang Wei, senior database architect of WeBank in PingCAP DevCon 2021. It mainly expounds the application practice of TiDB in WeBank, including the background of WeBank’s selection of TiDB and the deployment architecture of TiDB. And the application of TiDB in the loan core batch scenario. Finally, the best practice and future planning based on TiDB optimization scheme are shared.
TiDB’s product advantages
From the end of 2018 when WeBank started to contact the TiDB team to the launch in 2019, TiDB has shown many unique advantages in database selection.
TiDB is compatible with the MySQL protocol, as well as the MySQL ecosystem tools such as backup, recovery, monitoring, etc. The cost and threshold of migrating from MySQL to TiDB is low for the application itself, operation and maintenance, and developers. With TiDB’s native computing/storage separation architecture, users don’t have to worry about capacity or stand-alone performance bottlenecks, and can use TiDB as a large MySQL to some extent. At the same time, TiDB’s feature of strong consistency of multiple copies of data is very important for financial scenarios. TiDB also naturally supports multi-IDC deployment architecture, which can support applications to achieve multi-live deployment architecture in the same city. In addition, the TiDB open source community is also very active, for example, AskTUG platform can see many typical users of the solution to the problem, contains a lot of valuable lessons to learn, can further reduce the threshold of users to use TiDB.
At present, more and more users are using TiDB, no matter Internet manufacturers or financial industry users are using it in large quantities, which is also a reflection of the growing maturity of TiDB products, and also brings more users more confidence in using TiDB.
The deployment architecture of TiDB in Webank
Are the features of TiDB sufficient to meet the high availability architecture requirements of financial institutions?
This is the deployment architecture of TiDB in Webank, as shown in the figure. First of all, TiKV selected three copies and deployed them in three data centers in the same city, so as to achieve IDC-level high availability. Meanwhile, a set of TiDB Server was deployed in each IDC to provide VIP service by binding to load balancer. This enables the application to implement the mode of live access. This architecture has also been tested and validated with idC-level real failures, where the entire network of one IDC was taken down and the cluster was quickly recovered. We believe that TiDB can meet the requirements of high availability in financial scenarios.
Core batch accounting scenario
Core batch accounting of loans is a classic and very important scene in the financial industry, which we have integrated into TiDB. Core batch below is before the bank loan application scenario, the architecture of the left this part there are a lot of business units, the equivalent of the user data of the unitary split, each unitized data may be different, but the architecture and deployment model is the same, the use is the single instance database, at the same time, the batch is run in every single instance database, Finally, ETL of batch results will be sent to the big data platform for downstream use, so what are the bottlenecks or optimization points of this architecture?
\
\
It is a pure application of batch means that there are a lot of batch write, update, and operations, and is particularly large amount of data, or billion level above, with the rapid business developed, and the number of ious, users and water data continues to increased, if you are using a single database to host, first of all, is limited by single database performance limit, run time batch will be more and more long, In addition, the single-server database load is already high, and the I/O and CPU have reached 70% to 80%. If you want to improve the batch running efficiency, it is risky to increase the concurrency of applications. Because the database load is too high, the active/standby replication may be delayed or a quick active/standby switchover cannot be performed when a fault occurs, so the efficiency cannot be improved. Secondly, it is very difficult for a single database to add fields or manage data in such hundred-million-level or billion-level tables. Although Webank will use online DDL tools such as PT-online-schema-change to change tables, there will be a small probability of table lock. In addition, considering resource utilization, batch system and online system reuse the same set of stand-alone database, so if batch tasks cause high load, it is likely to affect online transactions. Based on these background problems, Webank upgraded its architecture with the help of TiDB.
The upgraded architecture is shown in the figure below. It can be seen that WeBank synchronizes and summarizes the data of each business unit to TiDB in real time through DM tool. Then the batch APP makes batch calculation directly based on TiDB, and then transfers the results to the big data platform. It is equivalent to using TiDB’s horizontal expansion to achieve the horizontal expansion of batch efficiency. Previously, the traditional MySQL master/standby architecture required the APP server and the primary node of MySQL to be deployed in the same IDC. However, if the access is cross-room, the network delay will be large, affecting the batch time. Therefore, other IDC APP servers are in standby state. However, all TiKV nodes in TiDB architecture can read and write data at the same time. Therefore, multiple IDCs can start batch tasks at the same time to maximize resource utilization.
\
Value of earnings
The application of TiDB in webank’s loan core business scenarios summarizes three main value gains.
- Batch efficiency improvement. The left side of the figure below is a comparison of the batch time of one of webank’s loan business on the billing day. It can be seen that under the single instance architecture, the batch time is about three hours, while webank’s time is reduced to about 30 minutes after the architecture is upgraded and optimized with the help of TiDB, which shows an improvement in absolute efficiency.
- Linear horizontal expansion. The demand of Webank is not only to improve efficiency, but also to achieve horizontal expansion, that is, elastic expansion. Because with the business development, ious quantity including users continues to grow, if there is a hot spot or other bottlenecks, ascension will be very difficult, want to continue after the chart on the right shows the batch of time-consuming contrast, in the initial one resource situation probably run for 25 minutes, if the data volume doubled, takes up to 50 minutes, If you want to reduce the time and double the resources, you can find that the time is reduced to about 26 minutes, indicating that you have linear scaling capability. So in addition to the efficiency increases, the ability of linear scaling a big advantage is that with the development of business continued, number of iou, iou amount is in rapid growth, rapid growth of the framework will not need to worry about the business possible technical bottleneck, the business can be more focused on the product itself, this is TiDB brings a real business value.
- The bulk system is separated from the online trading system. Previously mentioned with the online system because of the consideration of resources to do a reuse, now split with the online has been completely stripped, and there is no master/standby replication delay like a stand-alone database, can maximize resource utilization to improve batch efficiency.
Optimization based on TiDB
We can see the obvious effects of the above earnings. Then what optimization has Webank made or what problems have it encountered?
-
SQL schema optimization. TiDB, because of its distributed architecture, has a higher latency of single request than MySQL, so it needs to package some requests that frequently interact with the database to minimize interaction, such as changing multiple select to in mode. Replace multiple inserts with single INSERT values and update with replace with multiple values. In addition, because the multiple units of data are all summarized into a TiDB cluster, then its single table data volume must be very very large, if run a relatively inefficient SQL, it is easy to bring down the cluster, such as OOM risk, so need to pay special attention to SQL audit and tuning. For example, earlier versions may have inaccurate execution plans. Version 4.0 supports SQL execution plan binding, which can bind some high-frequency SQL to make it run more stably. Because Webank’s access to TiDB is relatively early, it mainly uses the optimistic lock mode, and the application has also made a lot of adaptations. At present, the code adapted to the optimistic lock mode has been solidified into a general module, which can be directly used when the new system is connected. \
-
Concurrency optimization of hotspots and applications. Users who use TiDB more often may be familiar with the hot spot issue. Elastic scaling has been mentioned before, and data must be sufficiently discrete to perform elastic scaling. Therefore, When Accessing TiDB in the early stage, Webank also found hot spot problems like the Auto Increment feature in MySQL. For example, the user’s card number and IOU number may also be some consecutive numbers, so WeBank makes some adjustments or optimizations for these two numbers, such as changing it to Auto Random. Then, according to the data distribution rules of some card numbers, weBank calculates the distribution interval of these data in advance through the algorithm. Split Region Split function is used for pre-split, so that the performance of each node can be fully utilized when a large number of instantaneous write; In addition, in-application cache processing is also carried out for small tables with low frequency modification and high frequency access to alleviate hot read problems. In addition to sufficiently discrete data, the application also needs to be transformed and optimized in a distributed way. Because the application is distributed, an App Master node is required to do the work of data sharding, and then the sharding task is evenly distributed to each App for calculation. During the operation, the status and progress of each sharding task need to be monitored. Finally, through the collaborative optimization of data and application, the overall horizontal expansion ability is achieved. \
-
Data synchronization and data verification optimization. This is why WeBank summarized the data of various business units through DM tools mentioned above. The DM 1.0 version used in the early stage did not have the feature of high availability, which was quite fatal in financial scenarios. In DM 2.0, several features, including high availability, compatibility with gray DDL, ease of use and so on, have been launched steadily. In addition, the data verification part, because it is the core batch scenario, the data synchronization must be not lost, good, so the application also embedded data checksum logic, such as the MySQL database before the data fragment, and then the checksum value of each fragment into the table. Then it synchronizes with the DM to the downstream TiDB, and finally loads each fragment from TiDB during batch run. Then it runs the checksum of the corresponding fragment again, and compares the checksum values of the upstream and downstream to ensure data consistency through such a verification mechanism. \
-
Failure drill and bottom – of – the – hole plan optimization. Before the system is based on MySQL batch system, moved to TiDB after failure may appear the phenomenon of unexpected scene performance, so the Banks do a lot of fault, the first is to simulate various TiDB component nodes, to ensure that the application can be compatible, at the same time when there is a batch after the interruption, application and support breakpoint continue to run; The second is the rerun of the whole batch. In order to quickly restore the rerun site due to procedural bugs or unexpected problems, the application has developed the functions of quick backup and rename flashback. The third is the exercise for extreme scenarios. For example, assuming that the TiDB database becomes unavailable as a whole, Webank combines Dumpling and Lightning to quickly back up and restore the whole cluster. The difficulties include fast confirmation of the restoration point of DM synchronization, manual pre-disassembly of large tables, etc. Finally, the verified results meet the requirements of correctness and timeliness. Because this architecture involves a lot of data flow, we have done a lot of drills for failure scenarios and compiled corresponding plan SOP.
The future planning
Webank started research and POC in 2018, and launched the first APPLICATION of TiDB in 2019. Currently, the application fields of TiDB in Webank have covered loan, peer, science and technology management, basic science and technology, etc. At present, several core business scenarios are conducting POC tests. There are five aspects of future planning:
1. Cloud native + containerization of TiDB Can bring such as the improvement of automated operation and maintenance ability, resource allocation ability and so on.
2. Persistence scheme based on Redis + TiKV. It is mainly to replace the bottom-pocket scheme of Redis + MySQL, and make persistent scheme with the help of TiKV’s natural high availability characteristics.
3. SAS disk-based low-cost applications. Webank has many archiving scenarios with a particularly large amount of data, because it is required to be retained for a long time under regulatory requirements. For such scenarios with high storage capacity but low frequency access, TiDB will also conduct some trials based on SAS disks with low cost.
4. Localized TiDB application of ARM platform. Webank has already entered into TiDB ARM business last year. In the future, with the trend of localization, this sector will continue to increase investment.
5. Evaluation and application of TiFlash. TiFlash provides HTAP capabilities, especially for scenarios like real-time risk control and lightweight AP queries, which weBank plans to focus on in the future.