Since I have never written blog articles before, I wrote this time with my heart in my mind, and also because I did not find some relevant documents in this project. At that time, I wanted to write one after completion. If there is anything that is not thoughtful, please contact me to correct it.Copy the code
The user business model includes e-commerce retail and wholesale and retail of franchise stores. The main business requirement of this time lies in the real-time calculation of some index data concerned by users during Taobao Double 11, such as order number, order amount, number of SKUs of goods, place of origin of orders, ranking of goods and so on. Based on these indicators, in addition to real-time requirements, it is also necessary to have appropriate display chart design. This time, aliyun DATAV is used to provide pie chart proportion analysis, commodity and category data ranking, thermal display of national maps, etc. Since the user’s data is under the cloud, we consider migrating the data to the cloud first, and then synchronizing the data to DATAHUB through DTS. Then, we use Ali Stream computing development platform to access DATAHUB data, develop stream computing code, and output the execution results to RDS MYSQL. Finally, DATAV references RDS data and develops a graphical presentation interface. The technical architecture of the final design is shown in the figure below:
Figure: Flow computing data logic design diagram
Third, technical implementation
1. Data migration and data synchronization: As the data cannot be directly transferred to DataHub, aliyun DTS tool is used to complete data migration to RDS first. Link:dts.console.aliyun.com/. Then use its data synchronization function to synchronize RDS data to DataHub(note: RDS charges can be monthly, DTS charges by the hour). During data synchronization, adjust the size of the data transmission channel based on the amount of enterprise data. In addition, DataHub will automatically create Topic corresponding to the table of synchronization, so there is no need to create a Topic before synchronization, which will report an error. (Note how system-generated topics differ from self-built ones)
StreamComputer development: Its development method and technical requirements are much simpler than traditional open source products, and the streaming computing platform has rich functions, especially monitoring systems. The link is:Stream.console.aliyun.com.
2.1 Reference to DataHub business table:
Notes:
A, create a table in the flow computing engine, what is the name of the table, recommended to be the same as DataHub
B. Reference the fields in the table. Do not reference the fields that are not needed
C. TOPIC built by the system. This field records whether the data in the row is updated, inserted or deleted
D. Streaming computing can refer to multiple data sources, the type of which is indicated here
E. Fixed writing
F. Project name on DataHub
Name of topic on DataHub
H. DataHub Retains service data within three days by default. This time specifies the point in time from which the flow computing engine obtains data
2.2 Reference of dimension Table:
Notes:
A. What is the primary key of this table
B, the fixed writing of the dimension table, indicating the update time of the dimension table (default is how long? How to adjust the update time? .
Note that the source of this table is RDS and the following connection mode is no different from the normal MYSQL connection mode.
2.3, the data output table is basically the same as the dimension table, but there is no PERIOD FOR SYSTEM_TIME, it should be built in advance on the RDS.
2.4 Application script development: Associate the referenced business table with the dimension table and output data to the target table
Notes: A, and standard SQL is not much different, mainly is the use of dimension table slightly different, but also fixed writing, copy on the line. Note: Since the original data has three actions of insert, delete and update, there will also be data in three states on DataHub, which needs to be processed separately, otherwise the data will be inaccurate. 3, DataV development: omitted here, simple induction that is: a graph, write a SQL.
Iv. Project Plan
Due to the possible risks of flow calculation, we consider developing a second scheme in the traditional way of calculation. When the flow calculation fails, the scheme can be switched quickly to ensure that the data can be used normally, and the delay may be larger.
After evaluation, as the amount of data is not expected to be too large, it is considered to calculate the indicators on the second set of output tables by regularly tuning stored procedures, and then develop the second set of reports to display the same as the first set. The design drawing is as follows:
Graph: Regular grind data is actually tested and most of the metrics come out within half a minute or even 10 seconds, so these delays are barely acceptable. Since the technical implementation is not complicated, it is skipped here. V. Project pressure test In order to ensure that the platform can still work stably when data burst, we carried out some relevant tests. Firstly, a large amount of data generation is simulated to confirm the synchronization time from the local library to RDS, the time from RDS to DATAHUB, and the processing time of stream calculation. Through many tests, it is concluded that when the amount of data continues to be relatively large, there will be some delay in the data, and the performance bottleneck is mainly reflected in the link of RDS and DTS synchronization to DATAHUB, among which the latter step is more obvious. While the platform stability is basically ok, the processing efficiency of flow calculation is very satisfactory.
Note: Thanks very much for the great assistance of ari technicians during this project!Copy the code