After 10 years of development, the daily order volume of Ele. me has grown from hundreds of thousands to tens of millions. With such rapid growth of data, how can database managers ensure the rapid development of business, achieve high availability and data security and improve operation and maintenance efficiency? In this article, Guo Guofei, head of ele. me data technology Department, will reveal the answer for you.
Background of ele. me database requirements
Ele. me is an example of a startup or fast-growing industry where the volume of traffic or data increases exponentially. So in the case of the rapid growth of business data, how to ensure the database capacity, high availability, data security and efficiency of operation and maintenance?
First of all, you need a good data architecture, so that you can ensure sufficient capacity. Secondly, we need to improve the availability of the database, control the data flow and guarantee the data security. Thirdly, we need to efficiently complete large-scale data operation and maintenance. Finally, we need to make full use of the power of Aliyun.
You probably know ele. me for the last three or four years, but ele. me has been around for 10 years. In 2015, Ele. me only had hundreds of thousands of orders a day, but in 2016, due to the emergence of O2O hot spots, the order scale reached millions a day. By the end of 2016, the daily order volume reached 8 or 9 million, and there were millions of merchants. From 2017 to now, both orders and waybills are tens of millions of scale every day, and there are millions of merchants and millions of riders. The whole link is still relatively deep, we call it “CBD”, the so-called “C” is the user, “B” is the merchant, “D” is logistics. The whole chain is very long, and each link has a scale of tens of millions, so the precipitation data is also very large.
The evolution of ele. me database architecture
So how has the Ele. me database architecture evolved? First of all, from 2015 to 2016, when orders increased from hundreds of thousands to millions, Ele. me made a vertical split of database. For the core system, it is estimated how many QPS and TPS it has. The original database architecture cannot support such magnitude, so it is necessary to split the relatively independent modules in the business, so that multiple sets of environments can support the access pressure, and the scale of 2-3 million orders can be supported under this structure. But by 2016, business projections indicate that the architecture will soon again face such a bottleneck. In order to solve the capacity problem, we need to move to the next stage, which is the horizontal shard of the core table of the database. Because vertical split is unable to break through the bottleneck of the core system, in order to support the capacity above, then it is necessary to dismantle the core of the table pressure. A logical table may be divided into thousands of small tables, so as to ensure that the pressure of the core system can be reasonably dispersed to different clusters of multiple tables, providing a better underlying technology expansion support for the development of business.
But will that be enough? In 2017, ele. me also faced new problems at the bottom of its database architecture. Although ele. me could support technical expansion at the technical level, the capacity of the machine room was limited and the number of machines that could be put into it was limited, so the bottleneck appeared again. Therefore, in 2017, Ele. me made a “Live more” architecture, which can support business development with multiple computer rooms.
In this process, there are two components that are closely related to the database. One is the research of the DB DAL component on the level of the Proxy, it basically completed the table and the function of the depots, through the component implements the separation of function expansion and slice of the underlying data and, speaking, reading and writing, and on it also implements the connection pool and connection, you can also do more to protect the data layer, such as pin peak, current limiting and mechanism, such as black and white list. In addition, there may be multi-dimensional table requirements in the business. For example, for ele. me orders, there are merchant dimensions and user dimensions. When placing orders, both of them operate frequently, so it is necessary to realize multi-dimensional slices to satisfy both users and merchants. Of course, DAL components can also be intrusive to the business, requiring restrictions on certain SQL operations.
The second important component is the DRC component for cross-room data synchronization during multi-work. The main function of the DRC component is to monitor the data changes in the machine room, that is, BinLog changes, and Push these changes to the opposite machine room. In this way, cross-room data synchronization can be achieved. It can also do data compression, playback, idempotent processing, and filtering transmission, which has great advantages over MySQL’s native data synchronization.
Once the capacity bottleneck is solved, there is a need to improve the overall availability of the website or database. Here can from the few big level, first any machinery and equipment may be a problem, good HA scheme is necessary, hungry to use open source MHA, but also made the corresponding encapsulation and transform, make sure when the main library hang up can quickly switch and batch management, can complete the message communication between various components (after we call EMHA), This solves the usability problems that can arise when a single machine has a problem. The second way to improve availability is to split the core business into multiple pieces. When a single piece of data fails, the impact on the business is 1/N, thus improving overall availability. In addition, the availability at the machine room or territory level is brought about through remote multi-live, while also providing Online support for large maintenance operations.
Let’s look at the data flow. First of all, it needs to be controllable enough. The incoming data needs to be risk-free, and the underlying environment can be protected when the data flow is abnormally large. This can be done in DAL components, including risky SQL statements that need to be rejected. When the data comes in, there need to be some rules and regulations when it lands, so that the data can be efficient. Resources also need to be isolated. Problems in one service cannot cause problems in other services. Moreover, after data processing is completed, the data need to be timely pushed to other consumers for consumption, which is typical of search and big data. For example, for big data, the timeliness of extracting data to make reports is very low before, but the message subscription of DRC component can realize the delay of minutes or even seconds to see the changes of business data in time. The last point is that the landed data needs to communicate with other environments in many cases. For example, data often needs to be imported from development to test. At this time, some data export specifications need to be given, and some sensitive content needs to be desensitized and filtered.
Once the capacity, availability, and data flow issues are resolved, you can see many components emerging. In the case of large-scale data, if there is no corresponding process specification support and operation and maintenance efficiency tools, it is very difficult to carry out database maintenance by manpower. Here I have extracted some key points that you may encounter frequently when doing database operations and maintenance. For example, SQL governance, pre-SQL table construction and writing process needs to have an automatic audit mechanism, to ensure compliance with rules and design efficiency principles. In addition, SQL sent to the production environment must have a corresponding SQL tracking monitoring mechanism to detect problems in time. In addition, for database changes and release, there may be hundreds of DDL table changes every day when we are very busy. In the early stage, THERE is a lot of pressure on DBAs. Later, we realized the platform of r&d self-release, so DBAs do not need to participate in the release process, but they need to ensure the stability of the platform and controllable release risks. Separation of data and hot and cold, in a production environment to store the data of hardware cost is high, so the need for data of hot and cold separation, keep users frequently accessed data in a production environment, and will not be frequently accessed data transfer to cheaper storage above data archiving, guarantee the production efficiency and cost reduction, At present, this piece of our platform also enables research and development to self-help. In order to ensure data security, Ele. me has implemented a data backup and rescue system, including automatic backup, flashback, automatic backup verification and other functions. In addition, there is a need to do data migration frequently, so we developed d-bus tool, DBA only need to configure the rules, the system can automatically move data.
Problems that still exist after database architecture evolution
So those are some of the key things that Ele. me does with the data. In fact, there are still many problems to solve the above problems at present. First of all, the cost of DAL and DRC components from design and development to mature operation is very high, and the cost of resources, personnel and verification cycle is very high. On the whole, the current operation efficiency and utilization rate are not very high. Resources need to be added in the business peak, but cannot be withdrawn in the business trough, so the utilization rate is very low in the business peak, and the scalability is very poor. In addition, the iteration of new technology is very fast, while Ele. me does not have the scale effect of technology, and some technical problems may be too expensive to solve by themselves. At present, Ele. me expects resources to be available at any time, flexible and scalable, with rich ecology to support, simple maintenance, and products with corresponding scale effect, which have been verified on a large scale.
Ele. me has become a heavy user of Aliyun. First of all, our development and test environment is completely based on Aliyun, because the demand is raised quickly but the life cycle is very short. Secondly, the business characteristics of Ele. me are that orders peak at noon and at night, so the demand for resources is relatively high in the peak period, but after the peak period, it is relatively flat. Therefore, Ele. me hopes to achieve elastic expansion and contraction through cloud capacity, and improve resource utilization rate. Thirdly, a large part of our current live architecture is also implemented in ali cloud machine room. The advantage of using Ali cloud machine room is to gradually adjust the flow on Ali cloud, without investing a lot of resources at the beginning. After stability, the cloud can gradually carry the main flow and eventually become the main node. At last, although Ele. me has made a lot of components and products, the products of Aliyun are actually only a subset, and the whole set of solutions can be found on Aliyun. For the high-speed development of the industry, most of the company’s technology is still for business services, can not let technology become an obstacle to restrict business development, but should let technology quickly meet the needs of business development, promote business development, so the mature cloud solution should be a cost-effective choice.