Compared with batch big data computing, streaming computing is still a relatively new computing concept on the whole. Let’s understand the differences between the following two computing methods from the user/product level.
Batch computing
At present, most of the traditional data computing and data analysis services are based on the batch data processing model: ETL system or OLTP system is used to construct the data store, and online data services (including ad-hoc query, DashBoard and other services) access the above data store by constructing SQL language and obtain the analysis results. This data processing methodology has been widely adopted with the evolution of relational database in the industry. However, in the era of big data, as more and more human activities are informationized and digitized, more and more data processing requires real-time and streaming. At present, such processing models begin to face great challenges of real-time. Traditional batch data processing model Traditional batch data processing is usually based on the following processing model:
ETL system or OLTP system is used to construct the original data store for subsequent data services for data analysis and calculation. As shown in the figure below, users load data, and the system will perform a series of query optimization, such as index construction, for the loaded data according to its own storage and calculation conditions. Therefore, for batch calculation, the data must be pre-loaded to the computing system, the subsequent computing system can only perform the calculation after the data loading is completed. The user/system initiates a compute job (such as a MaxCompute SQL job or a Hive SQL job) and requests the data system. At this point, the computing system starts to schedule (start) compute nodes to perform massive data calculation, which may take a huge amount of time, as long as several minutes or even hours. At the same time, due to the timeliness of data accumulation, the data in the above calculation process must be historical data, which cannot ensure the “freshness” of data. Users can adjust their computational SQL at any time according to their needs, or even use AdHoc queries, which can be modified in real time. The calculation result is returned, and the data will be returned to the user in the form of result set after the calculation is completed, or the data may be stored in the data calculation system due to the huge amount of calculation result data, so the user can integrate the data into other systems again. Once the data result is huge, the overall data integration process is long, which may take several minutes or even hours.
Batch computing is a batch, high – delay, active – initiated computing. The order of batch calculations used by users is:
Preload data. Submit a calculation job, and you can modify the calculation job to suit your business needs and submit the job again. The calculation result is returned.
Flow calculation
Different from batch computing model, streaming computing lays more emphasis on computing data flow and low delay. Streaming computing data processing model is as follows:
Use real-time data integration tools to transfer real-time data changes to streaming data stores (i.e., message queues, such as DataHub); At this point, the data transmission becomes real-time, spreading a large amount of data accumulated for a long time to each time point and continuously transmitting in small batches, so the delay of data integration can be guaranteed. At this point, data will be written to the stream data store continuously, without the need for a pre-loaded process. Meanwhile, streaming computing does not provide storage service for streaming data. The data is continuously flowing and discarded immediately after the calculation is completed.
Data link in gap flow and batch processing model is bigger, because the data integration from accumulating to real-time, different from the batch calculation for data integration all ready before operation, to start the computation flow calculation work is a kind of permanent computing services, once started will have been waiting for events triggered by the state, when a small batch data into the streaming data storage, Flow computations compute immediately and get results quickly. At the same time, Alibaba Cloud stream computing also uses the incremental computing model, which makes incremental calculation of large quantities of data in batches, further reducing the scale of single operation and effectively reducing the overall operation delay. From the user’s perspective, for streaming jobs, the computing logic must be defined in advance and submitted to the streaming computing system. The flow calculation job logic cannot be changed during the entire run! The user can stop the current job and submit the job again. In this case, the data that has been calculated before cannot be calculated again.
It is different from batch data transfer to online system after batch data calculation results are completed. After each small batch of data is calculated, the data can be written to the online/batch system immediately. Instead of waiting for the results of the overall data calculation, the data can be delivered to the online system immediately, further realizing the real-time presentation of the results of the real-time calculation.
Stream computing is a continuous, low-latency, event-triggered computing job. , the order in which the user uses stream calculation is:
Propose ac computing operations. Wait for streaming data to trigger a streaming calculation job. The results keep coming out.
In most big data processing scenarios, limited by the simplicity of the whole calculation model of current stream computing, stream computing is an effective enhancement of batch computing, especially for the timeliness of event stream processing. Stream computing is an indispensable value-added service for big data computing.
Understand detailed knowledge content:
The difference between stream computing and batch computing
(The course explains related technologies in step computing of big data, including streaming computing and in-memory computing, explains the technologies used by Ali Cloud in dealing with these functions, and explains in detail the technical optimization methods of Ali cloud.)
Tutorial information
Tutorial class
课时1: overview of streaming computing
课时2: the difference between stream and batch computing 07:16
课时3: technical analysis of flow computing typical system
课时4: overview of ali computing core technology
Class 5: Implementation of stateful computing 17:35
课时6: StreamSQL 14:11
课时7: the combination of big data and database
课时8: analytical database service ADS 05:55
Period 9: Unified Computing Framework 16:01
Course objectives
Learn the techniques of step – by – step computing
Suits the crowd
Big data developers and enthusiasts
Official website of Ali Yun University (Official website of Ali Yun University, Innovative Talent Workshop under cloud Ecology)