define

In a data analysis scenario, we might encounter such problems. For example, if we want to make a recommendation system, if we do it with batch tasks, the recommendation frequency of a day or an hour is obviously too late. If you stream tasks, the latency problem is solved, but you use live data instead of historical data, then accuracy is not guaranteed. Therefore, it is necessary to combine the historical data of batch processing with real-time data of stream processing to ensure both accuracy and real-time performance. Another example is the anti-cheating system, which uses the user’s historical behavior when identifying cheating users in real time.

Nathan Marz, the author of Storm, proposed the Lambda architecture in response to the above problems. According to Wikipedia, the Lambda architecture is designed to take advantage of both stream and batch processing when working with large amounts of data. It balances latency, throughput, and fault tolerance by providing comprehensive and accurate data through batch processing and low-latency data through stream processing. The results of batch and stream processing are combined to satisfy downstream AD hoc queries.

As you can see from the above definition, the Lambda schema consists of three layers, Batch Layer, Speed Layer, and Serving Layer. The architecture diagram is as follows:

The functions of these three layers are described below.

  • Batch Layer: The Batch Layer is used to estimate the offline historical data so that the downstream can quickly query the desired results. Because batch processing is based on a complete historical data set, accuracy can be guaranteed. The batch layer can be computed using frameworks such as Hadoop, Spark, and Flink
  • Speed Layer: The Speed processing Layer, which processes real-time incremental data, focuses on low latency. The data of the acceleration layer is not as complete and accurate as that of the batch layer, but it can fill in the data gaps caused by the high latency of the batch process. Acceleration layers can be calculated using frameworks such as Storm, Spark Streaming, and Flink
  • Serving Layer Now that you have both historical and real-time data, the job of the Serving Layer is naturally to combine both data and output them to a database or other medium for downstream analysis.

Lambda architecture for Amazon AWS

Here, I will use AWS as an example to introduce the Lambda architecture, which is shown below

Batch Layer: Uses S3 buckets to collect data from various data sources, uses AWS Glues for ETL, and outputs to Amazon S3. Data can also be output to Amazon Athena (interactive query tool)

Speed Layer: The Speed Layer has three processes

  • Kinesis Stream processes the incremental data from the real-time data Stream, which is output to Amazon EMR at the Serving Layer or to Kinesis Firehose for subsequent processing of the incremental data
  • Kinesis Firehose processes incremental data and writes it to Amazone S3
  • Kinesis Analytics provides SQL capabilities to analyze incremental data

Only the first component above is really relevant to the Lambda architecture we are discussing today, the other two are just for real-time processing.

Serving Layer: The merge Layer uses Amazon EMR-based Spark SQL to merge data at the Batch Layer and Speed Layer. Batch data can be loaded from Amazon S3, real-time data can be loaded directly from Kinesis Stream, and merged data can be written to Amazone S3. Here is the code for merging the data

That’s a brief introduction to Amazon AWS implementing the Lambda architecture.

My story

Next, I would like to share my previous project experience. Actually, our project doesn’t quite fit the above Lambda architecture, but I think the idea is the same. Both are essentially complementary to batch and stream processing and play to the best of both.

Our service is to process user location data. Spark Streaming is used for incremental processing at the beginning, and the processed data is written into MongoDB in real time. The data read and write granularity is based on the user ID. Because the granularity is fine, a large amount of data is generated each day. If the front end queries data with a long time span, the data needs to be aggregated based on the granularity of the user each time, which results in slow query response and affects real-time write. Therefore, we use batch tasks to predict the historical offline data and then store it to MongoDB. We also developed a set of Services based on the gRPC implementation to act as Serving Layer, combining historical data with real-time data and returning it to the front-end without the front-end directly connecting to the database. In this project, our Batch Layer and Speed Layer are both using the Spark framework, so maintenance is relatively easy.

Previous examples of Lambda have used an acceleration layer to fill in the gaps in the batch layer, but the above example uses a batch layer to fill in the gaps in the batch layer. Therefore, the architecture design is just an idea, and the specific implementation should be flexible according to the business, not mechanically.

summary

This article provides a brief introduction to the Lambda architecture, as well as an example of Amazon AWS implementing the Lambda architecture. Finally, I gave an example from my own project. Throughout this article we’ve seen the advantages of the Lambda architecture, but what about its disadvantages? Obviously there are, and I can think of the following:

  • The batch, acceleration, and merge layers may use different frameworks, thus increasing the cost of development
  • The results of batch layer and acceleration layer processing can be inconsistent, and if the user sees the data change, the experience is not good
  • If the logic of one layer changes, does the logic of other two layers or one layer also change? Therefore, the coupling degree between layer and layer processing logic is large

Can you think of other problems, and is there a better architecture to solve this problem? Welcome to communicate

Welcome to the public account “Du Code”