The hottest topic of double “11” is TB. Recently, I happened to talk with a friend of Ali about the technical architecture of Taobao and found a lot of interesting places. Please share their analytical information:

Taobao mass data product technical architecture

One of the biggest characteristics of data products is the non-real-time writing of data. Because of this, we can think of the whole system as read-only data for a certain period of time. This provides a very important foundation for designing caches.

 

Figure 1 Technical architecture of Massive data products of Taobao

According to the data flow, we divide the technical architecture of Taobao data products into five layers (as shown in Figure 1), which are data source, computing layer, storage layer, query layer and product layer respectively. At the top of the structure is our data source layer, which contains databases of users, shops, goods and transactions of taobao’s main site, as well as users’ browsing, searching and other behavior logs. This series of data is the most original vitality of data products.

Data generated in real time at the data source layer is transmitted quasi-real time to a Hadoop cluster with 1500 nodes through data transmission components DataX, DbSync and Timetunnel independently developed by Taobao. This cluster is called “ladder”, which is the main component of the computing layer. On the Ladder, we have about 40,000 jobs a day doing different MapReduce calculations on 1.5 petabytes of raw data based on product requirements. This calculation can usually be completed by 2 a.m. The result is likely to be an intermediate result compared to the data seen by the front-end product, often with a proper balance between data redundancy and front-end computation.

It has to be mentioned that some data with high requirements for effectiveness, such as statistics on search terms, we hope to push them to the front end of data products as soon as possible. For this reason, we made a real-time computing platform of streaming data, which is called “Milky Way”. “Galaxy” is also a distributed system that receives real-time messages from TimeTunnel, performs real-time calculations in memory, and refreshes the results to NoSQL storage devices in the shortest possible time for the front-end product to call.

It is easy to understand that “ladder” or “Galaxy” are not suitable for providing real-time data query services directly to the product. This is because, for “ladder”, its positioning is only to do offline computing, cannot support high performance and concurrency requirements; In the case of Galaxy, even though all the code was in our hands, the integration of data reception, real-time computing, storage, and query functions into a distributed system was inevitable, and ended up on the current architecture.

To this end, we designed a dedicated storage layer for the front-end product. In this layer, we have MyFOX, a distributed relational database cluster based on MySQL, and Prom, a NoSQL storage cluster based on HBase. In the following text, I will focus on the implementation principles of these two clusters. In addition, other third-party modules are also included in the storage layer category.

The increasing number of heterogeneous modules on the storage layer poses challenges to the use of front-end products. To do this, we designed a generic data intermediate layer, Glider, to shield this effect. Glider uses HTTP to provide restful interfaces. A data product can get the data it wants from a unique URL.

The above is a general introduction to the technical architecture of Taobao’s massive data products. Next, I will focus on the characteristics of data Cube design from four aspects.

Relational databases are still king

Relational database (RDBMS) has been widely used in industrial production since it was put forward in the 1970s. After more than 30 years of rapid development, the birth of a number of excellent database software, such as Oracle, MySQL, DB2, Sybase and SQL Server.

Review images

Figure 2. Data growth curve in MyFOX

Although compared with non-relational database, relational database has disadvantages in Tolerance to Network Partitions, it still plays an irreplaceable role in data products due to its strong semantic expression ability and relational expression ability between data.

Taobao data products choose MySQL’s MyISAM engine as the underlying data storage engine. On this basis, in order to deal with massive data, we designed the query agent layer of distributed MySQL cluster — MyFOX, which makes the partition transparent to the front-end application.

Review images

Figure 3 data query process of MyFOX

At present, the statistics data stored in MyFOX has reached 10TB, accounting for more than 95% of the total data volume of the rubik’s cube, and is increasing by more than 600 million per day (see Figure 2). This data is distributed nearly evenly across the 20 MySQL nodes and is served transparently by MyFOX when querying (as shown in Figure 3).

Review images

Figure 4. MyFOX node structure

It’s worth noting that not all of MyFOX’s existing 20 nodes are “equal.” In general, users of data products are more interested in the “last few days” of data, and the earlier the data, the more likely it is to be left out in the cold. For this reason, we have divided the 20 nodes into “hot” and “cold” nodes due to hardware costs (see Figure 4).

As the name suggests, “hot nodes” hold the most recent and frequently accessed data. For this part of data, we hope to provide users with the fastest query speed. Therefore, as for hard disks, we choose SAS hard disks of 15000 RPM. Based on two machines on one node, the storage cost of unit data is about 4.5W/TB. Correspondingly, for “cold data”, we choose SATA hard disks of 7500 RPM, which can store more data on a single disk, and the storage cost is about 1.6W/TB.

Another benefit of separating hot and cold data is that it can effectively improve the memory disk ratio. As can be seen from Figure 4, the single machine on the “hot Node” has only 24GB of memory, while the disk is about 1.8TB (300 * 12 * 0.5/1024) when it is full. The memory disk ratio is about 4:300, which is far lower than a reasonable value for MySQL server. If the memory disk ratio is too low, there will come a time when there will be no more indexes for the data even if all memory is used up – at this point, a large number of query requests will need to read indexes from disk, which is inefficient.

NoSQL is a useful complement to SQL

After MyFOX came along, everything seemed so perfect that developers didn’t even realize MyFOX was there, and a SINGLE SQL statement without any special embellishments would do the trick. This continued for a long time, until one day we ran into a problem that traditional relational databases couldn’t solve — full-attribute selectors (see Figure 5).

Review images

Figure 5. Full attribute selector

This is a very typical example. For the sake of illustration, we continue to describe it in terms of a relational database. For notebook computers, the filter criteria selected by a user in a query may include a series of attributes (fields) such as Notebook size, Notebook Location, and hard disk capacity. The distribution of attribute values for each attribute that may be used in the filter criteria is uneven. As can be seen from Figure 5, the property of laptop computer size has 10 enumerated values, while the property value of “Bluetooth function” is a Boolean value, and the data screening is very poor.

When the filtering conditions selected by users are uncertain, there are two ways to solve the full attribute problem: one is to enumerate all possible combinations of filtering conditions, perform pre-calculation on the “ladder” and store them in the database for inquiry; The other is to store the original data and filter the corresponding records according to the filtering conditions for on-site calculation. Obviously, because the permutations and combinations of filtering conditions are almost impossible, the first scheme is not desirable in reality. In the second scenario, where is the raw data stored? If you’re still using a relational database, how are you going to index this table?

This set of questions led us to the idea of building a custom engine for storage, field computing, and query services, known as Prometheus (Figure 6).

Review images

Figure 6 storage structure of Prom

Figure 6 shows that we have selected HBase as the underlying storage engine for Prom. HBase is selected because it is based on HDFS and has a good programming interface for MapReduce. Although Prom is a general service framework for solving common problems, we still use full attribute selection as an example to illustrate how Prom works. The original data is the transaction details of the previous day on Taobao. In an HBase cluster, attribute pairs (combinations of attribute and attribute values) are stored as row-keys. For the value corresponding to row-key, we designed two column-families, that is, the index field storing the transaction ID list and the data field storing the original transaction details. At storage time, we made a conscious effort to have every element in each field be fixed length, in order to support fast finding of corresponding records by offsets, avoiding complex lookup algorithms and lots of random read requests from disk.

Review images

Figure 7 Prom query process

Figure 7 uses a typical example to describe how Prom works when providing query services. Due to space limitations, it is not detailed here. It is worth mentioning that Prom is not limited to SUM calculation, but all common calculations in the statistical sense are supported. In field computing, we extend HBase. Prom requires that the data returned by each node is the local optimal solution that has been “locally calculated”. The final global optimal solution is only a simple summary of the local optimal solution returned by each node. Obviously, the design idea is to make full use of the parallel computing capability of each node and avoid the network transmission overhead of large amounts of detailed data.

Isolate the front and back ends with an intermediate layer

As mentioned above, MyFOX and Prom provide data storage and low-level query solutions for the different requirements of data products, but the subsequent problem is that various heterogeneous storage modules bring great challenges to the use of front-end products. Also, the data required for a single request from the front-end product is often impossible to obtain from just one module.

For example, when we want to see the hot products yesterday in the data Cube, we first get the data of a hot list from MyFOX, but the “product” here is only an ID, without the corresponding product description, picture and other data. At this time we want to get these data from the interface provided by the main site of Taobao, and then one by one corresponding to the hot list, and finally presented to the user.

Review images

Figure 8 Glider’s technical architecture

As experienced readers can imagine, this is essentially a JOIN operation between heterogeneous “tables” in the broad sense. So, who’s in charge of this? It is easy to imagine adding an intermediate layer between the storage layer and the front-end product, which is responsible for computing data joins and unions among heterogeneous “tables” and provides unified data query services by isolating the front-end product from the back-end storage. This intermediate layer is called Glider (figure 8).

Caching is a systematic project

In addition to its role in isolating the front and back ends and consolidating data between heterogeneous “tables,” glider’s other important role is cache management. As mentioned above, we consider data in a data product to be read-only during a certain period of time, which is the theoretical basis for using caching to improve performance.

As you can see in Figure 8, there are two levels of caching in Glider, a level 2 cache based on a heterogeneous datasource and a level 1 cache based on a consolidated independent request. In addition, each heterogeneous “table” may have its own internal caching mechanism. Careful readers will have noticed the MyFOX cache design in Figure 3. Instead of caching the final result of the summary calculation, we cache each shard to improve cache hit ratio and reduce data redundancy.

The biggest problem with heavy caching is data consistency. How do you ensure that changes to the underlying data are presented to the end user in the shortest possible time? This must be a systematic project, especially for a more layered system.

Review images

FIG. 9 Cache control system

Figure 9 shows us the design idea of data Rubik’s Cube in cache control. The user’s request must have a cache control “command”, including the query String in the URL and the “if-none-match” message in the HTTP header. Also, this cache control “command” must be passed through layers, eventually to the underlying storage of heterogeneous “table” modules. Each heterogeneous table returns its own data as well as its own data cache expiration time (TTL). Glider finally outputs the minimum expiration time for each heterogeneous table. This expiration time must also be passed down through the underlying store and eventually returned to the user’s browser via HTTP headers.

Another issue that cache systems have to consider is the avalanche effect of cache penetration and invalidation. Cache penetration refers to the query of a certain data does not exist, because the cache is passively written when the data is not hit, and for fault tolerance, if the data cannot be found from the storage layer, the data will not be written to the cache, so that the non-existent data will be queried to the storage layer every time, losing the significance of cache.

There are several ways to effectively solve the cache penetration problem. The most common one is to use bloom filters, which hash all possible data into a bitmap large enough that a data that must not exist will be intercepted by the bitmap, thus avoiding the query pressure on the underlying storage system. In the Rubik’s Cube, we take a more crude approach. If a query returns empty data (whether it’s nonexistent or a system failure), we still cache the empty result, but its expiration time is short, no more than five minutes.

The avalanche effect of a cache failure on the underlying system is terrifying. Unfortunately, there is no perfect solution to this problem. Most system designers consider locking or queuing cached single-threaded (process) writes to avoid large numbers of concurrent requests falling on the underlying storage system in the event of a failure. In the data Rubik’s cube, the cache expiration mechanism designed by us can theoretically distribute the data expiration time of each client evenly on the time axis, which can avoid the avalanche effect caused by the simultaneous cache expiration to some extent.

conclusion

Based on the architecture characteristics described in this paper, the data cube has been able to provide 80TB of data storage space before compression. Glider, the data middle layer, supports 40 million query requests every day with an average response time of 28 milliseconds (data as of June 1), which is sufficient to meet the demand of business growth in the future.

However, there are still many imperfections in the whole system. A typical example is the communication between layers using the HTTP protocol in short connection mode. Such a policy directly results in a high number of TCP connections on a single machine during traffic peak hours. Therefore, a good architecture can greatly reduce the cost of development and maintenance, but it must be constantly changing with the volume of data and traffic. I believe that in a few years, the technical architecture of Taobao’s data products will certainly look different.

Abstracts of Other articles

[1] The field of massive data covers distributed database, distributed storage, real-time data computing, distributed computing and other technical directions.

For massive data processing, from the database level is nothing more than two points: 1, how to share the pressure, the purpose of sharing is to change the centralized into distributed. 2. Use a variety of storage schemes, RDBMS or KV Store, different database software, centralized or distributed storage, or some other storage schemes for different business data and different data characteristics.

[2] Split the database, including horizontal split and vertical split.

Horizontal splitting mainly solves two problems: 1. The irrelevance of the underlying storage. 2. Increase the machine linearly to support increased data volume and access requests including TPS (Transaction Per Second) and QPS (Query Per Second). This is done by splitting a large data table into different database servers in a certain way.

The transfer of massive data from centralized to distributed mode may involve the disaster recovery and backup feature across multiple IDCs.

[3] Alibaba’s data processing methods for data in different regions.

The solution is closely coordinated by three products: Erosa, Eromanga and Otter.

Erosa does bin-log parsing for MySQL (or other database libraries) and puts it into Eromanga. Eromanga is a publishing subscription product for incremental data. Erosa generates constantly changing data published to Eromanga. Then each business side (search engine, data warehouse or associated business side) pulls the constantly changing data to its business side through Push or Pull through subscription, and carries out some business processing. Otter is data synchronization across IDC, and data can be reflected to different AA stations in a timely manner.

There may be conflicts in data synchronization. For the time being, the data of the site is given priority. For example, the data of the site in machine room A is given priority.

[4] For caching.

1. Pay attention to the intensity of segmentation and select the intensity of segmentation according to the business. The smaller the cache strength is, the higher the cache hit ratio will be. 2. Confirm the valid life cycle of the cache.

[5] Split strategy

1. Split by field (minimum strength). COMPANY_ID = COMPANY_ID; COMPANY_ID = COMPANY_ID;

2, split by table, split a table to MySQL, that table to MySQL cluster, more like vertical split.

3, Schema split, Schema split related to the application. For example, if the data of one module service is placed in one cluster, the data of another module service is placed in another MySQL cluster. But the overall service provided externally is the overall combination of these aircraft fleets, coordinated by Cobar.

Cobar is similar to MySQL Proxy in that it parses all MySQL protocols and can be accessed as MySQL Server.