This article was first published on wechat public account “Shopee Technical Team”

Abstract

Shopee supply chain mainly covers various scenarios in the process of selecting goods from the warehouse, delivering them to the 3rd Party Logistics (hereinafter referred to as 3PL) and delivering them to the buyer after the seller delivers the goods. Logistics Channel Service project (hereinafter referred to as LCS) is a link connecting the inside and outside of the Logistics performance link, collecting the external interaction businesses such as placing orders to 3PL and obtaining Logistics track update.

Since September 2019, with the split and migration of upstream systems, more and more 3PL have entered the scope of RESPONSIBILITY of LCS; In addition, due to the rapid development of e-commerce business in 2020, the monthly order volume of the system is also rapidly increasing, and the requirements and pressure on the database are followed. In just one year, the database architecture has been adjusted many times with the growth of business.

This paper briefly reviews the changes of the database architecture since the birth of LCS, and introduces the practice of database sorting by the project team at the end of 2020. It also sorts out the pits stepped in and summarizes the experience, lessons and thinking.

In the practice of the new scheme, there are some common problems in the split of the conventional library table and some “characteristic” problems in the project scenario. There is no perfect solution to solve them, but more is a process of trade-offs, which needs to analyze and choose the most favorable way according to the business characteristics.

1. The background

1.1 Project Origin

LCS is a Project based on the Python Django framework. Its core business is the fulfillment process of logistics orders, including the creation of orders, tracks and freight updates connecting upstream and third party logistics services. In terms of deployment, LCS is deployed in different markets based on the service market, and the application layer uses the database corresponding to each market. At the beginning of the project, the databases in these different markets share the same physical cluster, memory and disk space, which is sufficient to handle the initial traffic in terms of resources.

With the expansion of the service, the problem of shared cluster slowly manifests itself. Services vary from region to region, order quantity and order track ratio vary, and resource scheduling based on the physical cluster itself cannot meet our needs. In many cases,

  • Database reads and writes in a region with a large order volume affect I/O resources in a region with a small order volume.
  • Track push QPS is much higher than ordering, and a large number of track information reads and writes affect the core ordering process.

In order to minimize the impact on services, we split the database in the hotspot area into a separate physical cluster to monopolize resources, ensuring the stability of services in the hotspot area and reducing the impact on services in other areas. In this process, because different markets use different logic libraries, the switch is done without any business changes and can be done very quickly with the assistance of the DBA.

Faced with the problem of higher QPS pushed by track, we used temporary tables and message queues to delay the process of parsing track and binding it to the corresponding order, which significantly reduced the response time perceived by the third party when requesting us. The problem is that temporary tables are designed in the business MySQL library, and although track updates are handled asynchronously, the peak pressure on MySQL is not reduced because there is no flow limiting.

Therefore, after the separation of the previous physical cluster, we enabled TiDB to store the temporary track data. First, we separated the temporary data from the business data. Second, because the temporary tables are divided by day, if MySQL is used, it is not easy to expand the capacity of a single machine, and the historical data needs to be cleaned regularly, resulting in a large number of DELETE operations affecting performance. Moreover, the TiDB cluster used in the product line can provide large capacity and easy scalability. The subsequent data clearing is handled by the DBA through DROP TABLE after a certain period of time. The development team believes that TiDB is a better alternative to MySQL in this scenario.

1.2 Challenges

Although two split and adjustment, but as the amount of orders continue to rise, and because of the information in the logistics order multifarious, particularly upstream early without restrictions of all kinds of shipping address, name and contact information, and cross-regional transportation business need the detailed product details, plus logistics trajectory description information, make each order of text type data very much, The size of database tables tends to jump.

According to the data statistics in the second half of 2020, the disk space usage of MySQL cluster in the region with large business volume has exceeded 30% within a very short period of project launch, and the remaining space can only support about half a year. And as the proportion of upstream traffic gray to the service rises, the increment of each month is bigger than before. After the end of the gray scale, there were several important promotional activities in the second half of the year — 9.9, 11.11, 12.12. The data volume increased further, and there was no way to maintain the future use of cluster disk space only through daily data cleaning and database compression.

2. Architecture evolution

In the process of cluster adjustment, the DBA team repeatedly stressed the problems of fast data growth, disk space planning and use. The product team also proposes operations-related requirements, and future business scenarios require the retention of historical logistics order data. This also pushed the R&D team to produce solutions as quickly as possible to cope with larger single volume and different product requirements.

2.1 Design Objectives

Under the current design, each region uses a single database. Due to the data volume of the single database skyrocketing, it is necessary to separate the database from each region. We expect the new architecture to:

  • Carries the data storage requirements for a long time in the future. Because of limited hardware resources, changes are often not easy to implement, and architecture readjustment can be painful for DBAs and business developers alike.
  • Equally stable and reliable. For projects on the core business link, stability is of Paramount importance.
  • Maintain reasonable scalability. Realigning the architecture is difficult, but it does not mean that horizontal scaling without business awareness cannot be done under the current architecture.
  • Facilitate data archiving. The requirements from the product team, data is also the core factor supporting future business design.

2.2 Scheme Design

2.2.1 Branch library model

In the original design, we expected the new repository model to be able to perform data archiving operations easily, such as moving historical data from a coarser granularity to archiving machines. So one of the first ideas that was born was to design a database that was broken down by month. The current single-library design, where all the data falls into different hash tables, is cumbersome to fish out historical data for transfer:

  • The DBA team does not access and maintain the data in the database, so the business side does the archiving itself;
  • Accessing and migrating data in a specified time range means that the B+ tree of all hash tables requires a large number of page merging operations, which is bad for online business performance.

Therefore, the idea of monthly repositories is to centralize the data that can be backed up and archived. Whether the business side tries to find a way to migrate, or the DBA provides the database dimension, it can simplify the operation cost.

However, there is no perfect solution. If you want to split the data by month, the data in the month will fall into the same database. In the future, there will inevitably be problems such as uneven data amount in each library and hotspot data concentration. And with the increase of monthly data volume, the future to deal with the volume of single database, for example, the data of a month to hash database, database structure becomes more complex.

Therefore, the second, easier partition scheme is to start by hashing the database directly, so that the data falls evenly into different branches. However, it has the same problem as single database data archiving — database tables are divided by hash, data is evenly distributed, there is no coarse granularity to locate historical data, and it needs to rely on the application layer to migrate bit by bit.

At this point, we need to consider a certain balance between the two schemes and choose the one that is more suitable for current and future business:

  • Support future business: Both schemes reduce the data volume of single database, but the volume of single database will continue to be difficult to control after the further growth of business. It’s going to be more uniform.
  • Stability: In terms of deployment and use, the stability factors of the two solutions are similar. However, warehousing by time may require manual maintenance by the DBA team, including how to maintain the database for future months. We would prefer to use a scheme with fewer manual operations, and the hash library is more reliable after it is built and delivered.
  • scalability: It is difficult to continue to split the library by time, for example, change to the time + hash scheme in the future, but it will significantly increase the complexity; Hashes can be added directly.% base“Residuals (data may need to be migrated), and business adjustments are smaller and easier to implement.
  • Data archiving: Hash branch library is not good for processing archived data. The granularity of backup and migration operation is too fine, which has operation overhead of application layer and introduces a large number of data rowsDELETEOperation, and uncontrolled resource usage will affect online services; With the support of DBA, data archiving can be done more easily and the operation difficulty is greatly reduced.

In consideration of all aspects, we finally chose the hashing library solution. By sacrificing the convenience of data archiving operation, we hope to make the business application more stable and reliable in the future.

2.2.2 Application Deployment Architecture Splitting

After confirming the model of sub-database, we met another problem: limited by resources, we wanted to build multiple sets of physical clusters for each regional market, so that each hashed sub-database could have its own machine, but the remaining number of machine rooms could not support it at present. According to the collocation of a single MySQL cluster with 1 master and multiple slave, n regional markets and m hash branches in each market require (1 master + S slave) x N x M physical machines, which is very expensive. Therefore, the number of clusters that the DBA team can deliver to the business team is only N, that is, n regional markets are all independent clusters, and m hash branches in the market share a cluster.

Due to the characteristics of Python applications, each Pyhton process has its own connection pool, so the number of database connections held by the business side is:

  • Number of single-process connections * Number of processes * Number of machines

However, due to the problem of machine resources, different M hash libraries fall on the same set of physical cluster. For the machines in the main library, the upper limit of connection number becomes:

  • Number of single-process connections * Number of processes * Number of machines * Number of databases

Based on the current traffic volume, maintaining the existing architecture directly connected to the new branch cluster would exceed the maximum number of connections set by the machine, and the large number of connections itself would be detrimental to the service. Therefore, the business side needs to find a way to reduce the peak connection number, otherwise the new database cluster will not be available.

The modification can start with the calculated formula, and lowering any one of the four parameters will provide the desired effect. Among them:

  • The number of databases is calculated based on business data to ensure that the volume of single database data does not exceed a certain value in the future, so it cannot be further reduced.
  • The number of processes and connections in the connection pool is adjusted based on the number of service requests. If the number of processes and connections in the connection pool is reduced, the request processing capability of a single machine deteriorates.

So we tried to change the number of machines. Since there is no separation between the service processing order and the service processing track push update, and the track flow is several times that of the order flow, the service uses a large number of machines to support both order and track update. However, track traffic is temporary storage of TiDB and does not access MySQL. If services can be deployed separately, the old service is only responsible for placing orders to access MySQL, and the new service track update only accesses TiDB, the number of machines that need to access MySQL can be greatly reduced. According to online data statistics, the number of track requests is 9 times that of single request. Taking this ratio into consideration and considering a certain number of redundant machines to protect the core single request business process, we divide different types of interfaces into different services at the Nginx agent layer.

After the split, the number of machines handling services such as orders was reduced by 25%, and the number of connections was reduced by a quarter.

Of course, there are many reasons for the high connection count, and there are many solutions, such as refactoring the project with Golang (reducing the number of server processes), building a second set of services between the application layer and the database cluster to manage the connection uniformly, using middleware with separate libraries, etc. Currently, for LCS projects, the cost of reconstruction is relatively large, and other solutions need to be designed and promoted separately. Application architecture splitting is probably one of the more reliable and easiest solutions to implement.

2.3 Service Transformation

After confirming the database architecture scheme, the application layer needs to adjust the following contents:

  • Branch according to rules;
  • Application data routing;
  • Compensation for cross-library operations.

2.3.1 Base classification and routing rules

In order to avoid data migration and make the new database architecture smooth and grayscale ground and line, we delimited some tables as the scope to divide into the branch, the new data of this table will be read and write on the new branch cluster, while the old data will be read and write on the original database.

The business team combs the read and write model of the corresponding scope table, because the order IDS are read and write according to them, so there is no doubt that the order IDS can be used as the basis of the repository. Because the database routing is managed by the application layer, we only need to modify the order ID slightly and add some flag information so that the application layer can distinguish:

  • Should the order ID be in the original database or the new hash branch library?
  • In which hash branch the order ID should fall.

The database routing rules in the adjustment framework are complicated, including rewriting the management class of each table. Because there are many tables involved, a large number of codes have been changed, which will not be described in detail.

2.3.2 Cross-library consistency compensation

Because only part of the table is delineated into the branch, the interaction between that table and the original database table is not transactional, that is, there is no way to rely on the database transaction to ensure that they are consistent. MySQL provides XA cross-library strong consistent transactions, but in practice, using XA cross-library transactions has a high performance cost and requires developers to have the appropriate knowledge base to operate correctly.

Therefore, the final solution of the application layer is to add a set of check mechanism, so that the operation of different library tables can be restored (rollback/clean) to a consistent state after the completion of the asynchronous check task.

For example, in the following single process, we have the following tables in different databases, but they need to be strictly consistent:

  • Order table: In the sub-database, there are order ID and 3PL pl number information (obtain order information through order number, including 3PL pl number);
  • Reverse lookup table: in the old database, there are 3PL pl number information and corresponding ORDER number information (check order number by 3PL pl number).

Since the sub-database is based on the order ID, the query mode of reverse lookup table (according to the 3PL tracking number) makes it impossible to put it in the sub-database. Only the query with the order ID can be correctly routed in the sub-database cluster, otherwise it needs to traverse all sub-databases to obtain data.

To keep the backcheck and order table data consistent, we added a Checker to the old library so that the Checker and backcheck can use transaction features to ensure that if the backcheck table needs to change, the corresponding Checker will be recorded.

After the operation is complete, the application layer continues to operate the order table. If the operation fails, data inconsistency occurs between the order table and the reverse lookup table. The application layer performs asynchronous tasks periodically. Check the Checker information and check whether the current order table is consistent with the anti-lookup table. If they are inconsistent, modify the anti-lookup table records to restore them to the consistent state.

In the above figure, the mapping and ORDER_INFO data are in different databases. Single database transactions are used to ensure the existence of the mapping data and check_log data at the same time. If the subsequent order_INFO data execution is abnormal, the background task delays obtaining the check_log. Compare the mapping with the order_info data and choose whether to clean up the mapping.

Checker acts as an undo log in InnoDB. If a transaction fails, the Checker performs rollback or similar operations to ensure final consistency.

Data consistency compensation in the application layer is one of several different schemes proposed at the beginning. The disadvantage is that consistency can only be guaranteed after the execution of background tasks, and temporary inconsistency may occur in the tables involved in abnormal cases. If the corresponding reverse lookup table is set in each branch database, the feasibility of database transactions can also be retained, but because of the different query conditions, additional cost of traversing all branch databases for query is needed. The above ideas, including not using XA transactions, are the result of trade-offs under different business scenarios and practical costs.

2.4 Maintenance and Outlook

2.4.1 Data Archiving

As mentioned above, under this design scheme, it is difficult to archive data according to time, and specific data in the table need to be manipulated. The disk I/O caused by a large number of Updates and DELETES during archiving affects service performance.

The current service only needs to retain the data of the last 6 months. Therefore, we need to filter out orders created earlier than 6 months, write them to the archive, and delete the service database data. The r&d team tries to avoid centralized migration when designing archiving tasks in the background. Theoretically, it can be balanced as long as the migration speed is equal to single-write QPS. Therefore, according to the resource demand of busy time, the task design is executed periodically at different time intervals, so that the pressure on the database can be controlled within an acceptable range.

2.4.2 Expanding Capacity without Migration (Rebalance)

Usually, when the database is expanded, we need to migrate the existing data so that it is evenly distributed among the new databases. However, with an archiving foundation, we can find that logical expansion of the database does not require data migration if sufficient time is reserved.

Take the number of databases from 16 to 64 as an example, by changing the order ID rule, new orders are routed according to % 64, and old orders continue to use % 16, then the data growth rate of each database is 1/64 of the growth rate of the total order quantity, which is 25% of the original 1/16. Under the condition that the growth rate slows down and the archiving migration rate remains unchanged, the data volume in the original 16 databases will gradually decrease from 1/16 of the total order volume to 1/64, while the data volume in the newly added 48 databases will also increase from 0 to 1/64.

With this no-data migration (Rebalance), the complexity and uncertainty of implementing double-write schemes can be avoided. There are fewer steps to manually operate and check the correctness of double-write. The cost is that the capacity expansion cannot immediately reduce the pressure on the original database, because there are still a large number of old order reads and writes in the old database, and the pressure needs to be slowly reduced over time. With a sound monitoring mechanism and regular service growth, we can reserve enough time for capacity expansion and slow data rebalancing by combining these factors.

2.4.3 Future planning

The current database architecture has been running stably for more than half a year. However, due to the design scenarios and external conditions, there are many deficiencies, especially the control and use of historical data is still very weak. In this half year, Shopee supply chain set up a new Data team. Thanks to the gradual improvement of Data warehouse, the support ability of providing historical query and OLAP business has become more accessible.

LCS project data can be transferred to a data warehouse in the future via ETL or other means, which can provide raw and aggregated data for products instead of business systems. After the ETL link is grounded, the migration and archiving function of the service system can be offline. We also hope that through cooperation with the Data team, we can obtain more accurate analysis results for all kinds of OLAP requirements, such as assessing the timeliness and reliability of third-party logistics, so as to provide better logistics performance services for sellers.

3. Summary

There is no perfect solution, and every structural adjustment is the result of comprehensive considerations and trade-offs. This paper introduces the important nodes in the evolution process of database architecture since the LCS project was launched one year ago, as well as the practice, stumping and summary thinking of the branch database transformation at the end of 2020.

In terms of database division, we compared the following dimensions to investigate the degree of compatibility with the current business scenarios. Hashing database is more conducive to solving the pressure brought by the continuous growth of data, and giving up the coarse-grained (library table dimension) data migration backup capability is a concession in the current reform design.

Store by time Hash the library
Hotspot data distribution Focus on peak season Evenly throughout the year
Stability/maintainability Rely on manual/timed scripts to add new libraries Low maintenance cost when no capacity expansion is required
scalability Narrow the partition interval and enable more new libraries Data rebalancing is not required for database expansion in current service scenarios
Data backup and archiving Provides time-grained library table migration You need to iterate over the specific data and slowly migrate out

As for the solution to seek cross-repository consistency, considering the proportion of exceptions occurring and the business tolerance to the existence time of inconsistent data, we choose a lighter solution than XA strong consistent transaction. The background tasks that are performed regularly to maintain the final consistency of data in different library tables are easy to implement at the business level and require only the most basic transaction support from the database. There is no doubt that they can meet the requirements of existing scenarios.

In the process of new project architecture design, people usually want to design a can stable system available in the next few years, but because of the limitation of various factors, some schemes or high initial costs (although can reduce the change in the future), or a more radical, uncontrolled, will affect the final implementation and landing difficulty, and ruled out.

Architecture design is a task based on the present, considering the future and balancing the cost. Many aspects are mutually restricted, and there may not be an optimal solution. A system architecture that can complement weaknesses when weaknesses exist is more conducive to rapid business growth.

In this paper, the author

Jiekun, back-end engineer, interested in Cache and KV storage, from Shopee Supply Chain team.

Join us

Shopee logistics service team through the standardization integration of a large number of logistics channels and sites in Southeast Asia and Latin America, to create a logistics model that can freely assemble routing links, and on this basis for product refining, casting a logistics network covering Southeast Asia and Latin America.

Interested students can send their resume to: [email protected] (email subject: Shopee Logistics Services – from tech blog).