The stream processing engine has gone through three generations of technical iterations from Storm to Spark Streaming to Flink, and the big data processing has also gone through the evolution from Lambda architecture to Kappa architecture. This section takes the data analysis of e-commerce platform as an example to explain how big data processing platform supports enterprise online services. E-commerce platforms will record users’ search, click and purchase behaviors in APP or webpage in the form of logs. Various behaviors of users form a real-time data stream, which is called user behavior log.
Lambda architecture
When the first-generation stream processing engine represented by Storm matured, some Internet companies adopted Lambda architecture as shown in the following figure to process data and provide online services in order to give consideration to real-time and accuracy of data. The Lambda architecture is divided into three main parts: the batch layer, the stream layer, and the online services layer. The data flows from message queues such as Kafka.
Batch layer
At the batch layer, data streams are persisted in a batch data warehouse, accumulated for a period of time, and then computed using a batch engine. This accumulation time can be an hour, a day or a month. The processing results are finally imported into a database which can be queried online by the application system. The batch data warehouse in the batch layer can be HDFS, Amazon S3, or other data warehouses, and the batch engine can be MapReduce or Spark.
If electric business platform of data analysis department want to view the entire network one day what goods to buy most times, use batch processing engine data were calculated for the day, like taobao, jingdong this level of electricity, is a large amount of log data, the user behavior in a very simple calculation on the log are likely to need a few hours. Batch processing engines typically start periodically, process data from the previous day or hours, and output the results to a database. A direct query to an online database can take milliseconds, compared with hours for user behavior logs. Calculation example of buying goods most times here is relatively simple, in the actual business scenarios, generally require more sophisticated statistical analysis and machine learning, such as building user portrait, according to user’s age, and gender based information, analysis of certain types of users are most likely to buy what kind of goods, such calculations take longer.
Batch layer can guarantee the accuracy of the results of a piece of data, and even if the program fails, it can be restarted directly. In addition, batch engines generally scale well, and can scale horizontally by increasing the number of nodes even if the volume of data increases.
Stream processing layer
Obviously, if the entire system has only one batch layer, the user must wait a long time to get the results, usually with a delay of several hours. The e-commerce data analysis department can only view the statistical analysis results of the previous day, but cannot obtain the current results. There is a huge time gap for real-time decision-making, which may lead to managers missing the best decision-making time. Therefore, in addition to the batch layer, the Lambda architecture adds a stream processing layer, where user behavior logs are simultaneously streamed into the stream processing layer, and the stream processing engine generates the pre-processing results and imports them into a database. Analysts can view the data results of the previous hour or a few minutes, which greatly increases the real-time performance of the whole system. However, the data flow will have events out of order and other problems. Using the early flow processing engine, only an approximate accurate calculation result can be obtained, which is equivalent to sacrificing certain accuracy for real-time performance.
As mentioned in previous articles, the early streaming engines had some disadvantages. In terms of accuracy, scalability, and fault tolerance, the streaming layer could not directly replace the batch layer and could only provide users with approximate results, but not consistent and accurate results. Hence the coexistence of batch and stream processing in the Lambda architecture.
Online Service Layer
The online service layer is directly oriented to the user’s specific request, and it needs to fuse the accurate but delayed preprocessing results from the batch layer with the real-time but not accurate preprocessing results from the stream layer. During the fusion process, data from the batch layer needs to be constantly overwritten with older data generated by the stream processing layer. Many data analysis tools, such as Apache Druid, put a lot of effort into data fusion. It is also possible to store the preprocessing results from the batch and stream layers in a very low-latency database, artificially controlling the blending of the preprocessing results in the application. The database that stores the pre-processing results can be the relational database MySQL, the key-value database Redis, or HBase.
Strengths and weaknesses of the Lambda architecture
Lambda architecture strikes a balance between real-time and accuracy, and can solve many problems of big data processing. It has been widely deployed in major Internet companies. Its benefits include:
-
Batch processing accuracy is high, and in the data exploration stage can try different methods for a piece of data, data can be repeated experiments. In addition, batch processing has strong fault tolerance and scalability.
-
The real-time performance of stream processing is high and can provide an approximate accurate result.
The drawbacks of the Lambda architecture are also obvious:
-
Two sets of big data processing engines are used. If the apis of the two sets of big data processing engines are different, the two sets of big data processing engines need to be updated at the same time, resulting in high maintenance cost and long iteration period.
-
Early results from the stream processing layer were only approximate.
Kappa architecture
Jay Kreps, the founder of Kafka, believes that in many scenarios maintaining a Big data processing platform based on the Lambda architecture is time-consuming and costly, and suggests that in some scenarios, it is not necessary to maintain a batch layer, but simply use a stream layer, as shown in the Kappa architecture below.
There are two main reasons for the rise of Kappa architecture:
-
Kafka acts not only as a message queue, but also as a way to store historical data for longer periods of time, replacing the batch-layer data warehouse part of the Lambda architecture. The streaming engine starts consumption at an earlier time as a starting point, acting as a batch processor.
-
Flink flow processing engine solves the problem of accuracy of calculation results under out-of-order events.
Kappa architecture is relatively simple and has better real-time performance, requiring far less computing resources than Lambda architecture. As the demand for real-time processing continues to grow, more and more enterprises begin to use Kappa architecture.
The popularity of the Kappa architecture does not mean that batch processing is no longer needed, and batch processing still has its advantages in some specific scenarios. For example, some data exploration, machine learning experiments, need to use batch processing to repeatedly verify different algorithms. Kappa architecture is suitable for some logically fixed data preprocessing processes, such as counting the exposure and purchase times of goods within a period of time, the search times of certain keywords, etc. Flink is good at stream processing, but also realizes batch processing. It is a big data processing engine integrating stream and batch, which provides more reliable data processing performance for architecture. In the future, Kappa architecture will gradually replace Lambda architecture in more scenarios.