Author: Zhu Jinsong, system architect of Quant R&d Center, mainly involved in the development of basic components, API Gateway and other projects. Now he is committed to the architecture design and research and development of the company’s risk control system related businesses.
I. Company profile
QuantGroup, founded in 2014, is a data-driven technology company and a national high-tech enterprise. WITH THE vision of “MOVE THE WORLD WITH DATA, ENLIGHTEN LIFE WITH AI” (DATA drives THE WORLD and intelligence enlightens LIFE), Quant school utilizes ARTIFICIAL intelligence, machine learning and big DATA technologies. Provide customized strategies and models for partners in finance, e-commerce, tourism, mobility, automotive supply chain and other fields to help improve industry efficiency. Quanta has established in-depth cooperation with more than 300 institutions and companies at home and abroad, committed to creating a more dynamic win-win ecosystem and promoting sustainable economic development.
Our company started to investigate TiDB in mid-2017 and set up TiDB cluster in the user behavior data analysis system for data storage. After more than one year of application and research, we have accumulated rich experience. At the same time, TiDB officially launched 2.0 GA version, TiDB is more mature, stability and query efficiency and other aspects have been greatly improved. We deployed TiDB version 2.0.5 in July 2018 to try to apply it to the risk control business. The risk control system mainly performs real-time calculation and returns the loan results according to the risk control rules combined with the model and user characteristics when the user applies for loan.
Second, business background
The data used in the risk control system can be divided into two parts:
-
One is the original data, which is used to analyze the user’s current characteristics.
-
One is snapshot data, which is used to calculate the characteristic index at a specified time point in history for model training.
There are three main types of raw data:
-
Business system data generated from various product lines within the company.
-
Crawler group provides user contacts, operators, consumption records and other data.
-
User feature data after processing.
Because our risk control strategies used in a lot of models, including the neural network model, grade model, the model of training need to rely on the history of a large number of orders and related user characteristics, in order to train a more precise and excellent model, will need more dimensions of features, the features of accuracy will directly affect the training results of the model, To do this, we need to use a data snapshot to trace the user characteristics of each order at a given time.
Data snapshot can be achieved by zipping a table. Simply put, three fields are added to each table, namely new_id, start_time, and END_time. Each update of a record generates a new data and changes the end_time of the original record to record the change history of data.
As can be seen from the above introduction, the data volume of business data and crawler data itself is very large, and the data volume increases exponentially when the corresponding zipper data is generated. Assuming that each piece of data has changed only once since creation, the amount of data in the zipper table is already twice as large as the original table, whereas the data in the actual production environment has changed much more than once.
Through the above introduction, we conclude that the data storage requirements under the risk control system should meet the following points:
-
Business data.
-
Zipper table of business data.
-
Crawler data, such as contact information, carrier data, consumption records, etc.
-
Crawler data zip table.
-
Other data, such as pre-processing data.
Third, the current plan
In the previous solution, HBase is used for data storage. Its horizontal expansion solves the problem of large amount of data. However, there are also obvious problems in actual use, the most obvious is that the API of query is weak in function, which can only obtain a single piece of data through Key or read in batches through Scan API, which undoubtedly increases additional development costs in feature backtracking and cannot realize code reuse.
In the real-time computing scenario, in order to reduce the development cost, the acquisition of business data is queried by accessing the MySQL slave library of the online system. Crawler data is stored in HBase in a unified manner. Therefore, all data used in crawler calculation need to be pulled from memory before calculation.
In the backtracking scenario, you can perform feature calculation by querying data before the order time for service feature backtracking. This method is ineffective for changed data and can only be implemented by using data snapshots in HBase. However, this method adds a lot of development work.
3.1 TiDB opens a new horizon for us
Through the above introduction, we know that to build a real-time data warehouse environment of risk control system, the following characteristics need to be met:
-
Highly available, providing robust and stable services.
-
Supports horizontal elastic expansion to meet increasing data requirements.
-
Good performance and high concurrency.
-
Fast response.
-
Support for standard SQL, preferably MySQL syntax and MySQL protocol, to avoid additional development in backtracking.
It can be found that TiDB perfectly fits each of our needs. After TiDB in user behavior data analysis system of use for a long time, we have accumulated some experience, in the process TiDB official also gives technical support over a long period of time, the problems will also be able to timely feedback during communication, but also the technical communication with our technical personnel many times and offline sharing, here we greatly appreciated. With the continuous growth of the demand for risk control system, we have carried out a new round of optimization for the overall architecture. The new data access and storage architecture is shown in Figure 1.
As can be seen from Figure 1, the data generated by online business system are uniformly stored in MySQL. Collecting these isolated data in TiDB can provide SQL-based query services. Access the database directly from the MySQL instance in binlog mode, and store the data in two different formats:
-
One is the source data after separating database and table, which reduces the implementation and maintenance cost of real-time feature calculation.
-
The other is to store data in the form of zipper data to achieve data snapshot function.
After investigation, for the first scenario, alibaba’s Otter or TiDB peripheral tool Syncer can be quickly realized, but there is no ready-made mature solution for the second requirement. Finally, we customized the client based on Ali’s Canal, assembled and merged SQL according to different needs and wrote it into different TiDB clusters. Data from some tables can also be assembled and sent to Kafka on demand for quasi-real-time analysis scenarios.
For data from the crawler group, we use direct consumption Kafka to assemble SQL and write to TiDB.
In actual use, TiDB can fully support the online real-time query business requirements through optimization of indexes. In feature backtracking, we only need to add query conditions to obtain the feature results of specified time, which greatly reduces the development cost.
3.2 Problems encountered
In risk control services, SQL extracted from user features is relatively complex. In actual use, some SQL execution time is longer than that in MySQL. Through explain, we found that he did not use the index we created, but carried out a full table scan. After further analysis, we also found that the results of Explain were uncertain.
After communication with the official technical staff of TiDB, we deleted similar indexes and analyze table, and found that the problem still exists. In Figure 2, you can see the difference in the execution results of exactly the same SQL statement. Finally, according to the official suggestion, we added use Index to force it to remove the index, and the execution time changed from 4 minutes to < 1s, which temporarily solved the business requirement.
At the same time, TiDB technical staff also collected relevant information and gave feedback to the r&d staff. In the whole process of dealing with the problem, the technical staff of TiDB gave a high degree of cooperation and timely feedback, and also showed a strong professional, which greatly reduced the troubleshooting time. We are very grateful.
Four, outlook
At present, we have built two TiDB clusters with dozens of physical nodes and ten billion characteristic data. Benefiting from the high availability framework of TiDB, we have been running stably since the launch.
As mentioned above, the application of TiDB in our risk control business is just beginning, and the migration of some businesses is yet to be further verified. However, the benefits TiDB brings to us are self-evident, and it opens a new horizon for us in data storage and data analysis. In the future, we will continue to increase investment in TiDB to make it better serve various scenarios such as online and offline analytics. We also hope to further increase the communication and cooperation with PingCAP team, carry out more in-depth application and research, and contribute to the development of TiDB.