preface

Hello everyone, I am ChinaManor, which literally translates to Chinese code farmer. I hope I can become a pathfinder on the road of national rejuvenation, a ploughman in the field of big data, an ordinary person who is unwilling to be mediocre.

This is the mind map for real time technologymanorWill update the reading “Alibaba big Data practice” chapter 5 real-time technology

Data model design is a process of data processing, like streaming data processing, data flow modeling needs to be layered. Real-time modeling is very similar to offline modeling, with the data model as a whole divided into five layers (DS DWD DWS ADS DIM). Due to the limitations of real-time computing, each layer is not as wide as offline, and there are not so many dimensions and indicators, especially indicators involving backtracking state, which are almost absent in real-time data model. On the whole, the real-time data model is a subset of the offline data model. In the process of real-time data processing, many model designs are realized by referring to the offline data model. The following is a detailed explanation from the aspects of data layering, multi-stream association and dimension table use.

1 Data Layering

In the streaming data model, the data model as a whole is divided into five layers.

1. The ODS layer is the same as the definition of the OFFLINE system. The ODS layer belongs to the operational data layer, which is the most original data directly collected from the service system and contains the change process of all services with the smallest data granularity. In this layer, real-time and off-line data are unified at the source. The advantage is that the indicators processed by data are basically unified in caliber, which makes it easier to compare real-time and off-line data. For example, the original order change log the query log of the data server engine. 2. DWD DWD layer is based on the layer, according to business process modeling of real-time fact Ming thin layer, the access log this data (no context, and don’t need to wait for the process of record) and will return to the offline system for the use of downstream, the greatest extent to ensure the real-time and off-line data in layer and layer DWD are consistent. For example, payment schedule, refund schedule, user access log schedule. 3. After DWS subscribes to the detail layer data, the summary metrics of each dimension are calculated in the real-time task. If the dimensions are common across vertical lines of business, they are placed in the real-time common summary layer and used as a common data model. For example, the granularity of sellers on e-commerce websites is related to this dimension as long as it involves the transaction process. Therefore, the seller dimension is a common dimension of all vertical businesses, and the overall indicators are also shared by all business lines. For example: a summary of the major dimensions of e-commerce data — home, commodity, buyer). 4. ADS personalized dimension summary layer. Data of non-common statistical dimensions will be placed in this layer, where the dimensions and indicators that only its own business will pay attention to are calculated. For example: mobile taobao below a shopping, micro tao and other vertical business. 5. Data of DIM real-time dimension surface are basically exported from offline dimension surface and extracted to online system for real-time application call. This layer is static for real-time applications and all ETL processing is done in the offline system. The use of dimension tables in real-time applications is slightly different from that in offline applications, as described in a later section. For example: commodity dimension table, seller dimension table, buyer dimension table, category dimension table. The following simple examples illustrate the data stored in each layer.

  • ODS layer: The change process of order granularity. An order has multiple records
  • DWD layer: Payment record of order granularity, with only one record per order.
  • The real-time transaction amount of DWS sellers, a seller has only one record, and the index is refreshing in real time.
  • The real-time transaction amount of ADS takeout area is only used by takeout business.
  • DIM layer: dimension table of the corresponding relationship between order item category and industry.

Among them, ETL processing from ODS layer to DIM layer is performed inOffline systemAfter the processing is completed, it will be synchronized to the storage system used for real-time computing. The ODS layer and the DWD layer will be placedData middlewareFor downstream subscriptions. The DWS layer and ADS layer will be landed in the online storage system, and the downstream is used in the form of interface call. In each tier, PO, Pl, P2, P3 is classified according to importance, and PO is the highest priority guarantee. Allocate different computing and storage resources to real-time tasks according to their different priorities, and strive to ensure that the most important tasks are best protected.

In addition, field naming, table naming, and indicator naming are defined according to the OneData specification for better maintenance and management. For details On Data, see the following sections.

5.3.2 Multi-stream association

In streaming computing, it is often necessary to associate two real-time streams with the primary key to obtain the corresponding time schedule. In an offline system, the association between two tables is very simple, because offline computing can get the full data of two tables when the task starts, just bucket association according to the association key. However, streaming computing is different. Data arrival is an incremental process, and the time of data arrival is uncertain and disorderly, because details such as the preservation and recovery mechanism of intermediate states are involved in data processing. For example, table and table are associated in real time by ID. Since the arrival order of the two tables cannot be known, each new piece of data in the two data flows needs to be searched in another table. For example, if a certain data of the table arrives, it can be searched in all the data of the table. If the data can be found, it can be associated with the table and spliced into a record and directly output to the downstream. If the data cannot be associated with the table, it needs to be stored in memory or external storage until the records of the table also arrive. One of the key points of multi-stream association is the need to wait for each other. Only when both parties arrive can the association succeed. For example, table and table are associated in real time by ID. Since the arrival order of the two tables cannot be known, each new piece of data in the two data flows needs to be searched in another table. For example, if a certain data of the table arrives, it can be searched in all the data of the table. If the data can be found, it can be associated with the table and spliced into a record and directly output to the downstream. If the data cannot be associated with the table, it needs to be stored in memory or external storage until the records of the table also arrive. One of the key points of multi-stream association is the need to wait for each other. Only when both parties arrive can the association succeed. The following is illustrated by an example (order information table and payment information table association)

In the above example, the data of two tables are collected in real time, and each new data is searched in the other table in memory up to the current full data. If it can be found, it indicates that the association is successful, and the output is direct: if not, the data is put in the data set of its own table in memory and wait. In addition, data in the memory must be backed up to the external storage system regardless of whether the association is successful. When a task is restarted, you can restore the memory data from the external storage system to prevent data loss. Because on reboot, the task is continued and does not restart the previous data. In addition, the change of the order record may occur for many times (such as multiple fields of the order are updated for many times), in this case, it is necessary to remove the weight according to the order ID, to avoid the table and table successfully linked for many times; No output to the downstream will have multiple records, so the data is repeated. The above is the overall two-flow association process. In actual processing, considering the availability of data search, real-time association usually classifs data into buckets according to the association primary key. In addition, real-time association is also performed according to buckets during fault recovery to reduce the amount of data search and improve throughput.

In offline systems, the fact table is associated with the dimension table according to the business partition, because the data in the dimension table is ready before the association. In real-time computing, the associated dimension table generally uses the current real-time data () to associate the t-2 dimension table data, which means that the dimension table data needs to be prepared before the data arrives, and it is a static data. Why do that in real time computing? Mainly based on the following considerations. 1 data cannot be prepared in time when it reaches zero point, real-time stream data must be disassociated from dimension table (because it cannot wait, if it loses its real-time feature), and at this time, dimension table data of T-1 cannot be ready immediately at zero point (because t-1 data needs to be processed and generated in a day), so it is necessary to connect with the giant dimension table. It is equivalent to processing t-2 dimension table data in -1 day. 2. The latest data dimension table cannot be accurately obtained. If the latest data of the current day needs to be obtained in real time, the data of -1 + the change of the current day can obtain the complete data of the dimension table. In other words, the dimension table also serves as a real-time stream input, which requires the use of multi-stream real-time association to achieve. However, because real-time data is disordered and arrival time is uncertain, there are ambiguities in dimension table correlation. 3. Disordered data It is difficult to obtain dimension table data if the dimension table is input as real-time stream. For example, the business data at 10:00 is successfully associated with the dimension table and the relevant field information of the dimension table is obtained. At this time, whether the latest dimension table data has been obtained? This simply means getting the latest status data as of 10:00 (real-time applications never know when the latest state is, because you don’t know if the dimension table will change later). Therefore, t-2 data are generally used for dimension table association in real-time calculation, so that at least the associated dimension table data is determined for businesses (although the dimension table data has a certain delay, the change of many business dimension tables between two days is very little). In some service scenarios, t-1 data can be associated, but the t-1 data is incomplete. For example, at 22:00 in the night of the bucket began to process the dimension table, before the arrival of zero, there are two hours to prepare the data, so that you can close at the time. {# T-1 data, but will be missing two hours of the table change process.

In addition, since real-time tasks are resident processes, the use of dimension tables can be divided into two forms. (I) Full loading In the case of a small amount of dimension table data, it can be loaded into the memory at one time and directly connected with real-time stream data in the memory for association, which is very efficient. The drawback is that memory is always occupied and needs to be updated regularly. For example: category dimension table, only tens of thousands of records every day, at zero o ‘clock every day full load into memory. (2) There is too much data in the incremental loading dimension table to load all the data into memory. You can use the form of incremental lookup and LRU expiration to keep the most popular data in memory. Its advantage is that it can control the amount of memory used; The disadvantage is that external storage systems need to be searched, which reduces the operating efficiency. For example, member dimension table, there are hundreds of millions of records, every time real-time data arrival, to the external database query, and query results in memory, and then every once in a period of time to clean up the least recently used data, to avoid memory overflow. In practical application, these two forms are selected according to the dimension table data volume and real-time performance requirements.

conclusion

The above is alibaba’s big data practice | Real-time technology streaming data model (3) The next chapter will talk about Alibaba’s big promotion challenges and guarantees ~

May you have your own harvest after reading, if there is a harvest might as wellThree even a keySee you next time 👋·