1. Challenges of big data processing
Wisp of a wisp of the development of it, the first phase is the emergence of the major major system platform, is moved to the efficiency of online offline solution, and the next stage is the era of data and deal with the major platform of accumulated data, the accumulation of data, is compared commonly big, do what is big data, large-scale data processing, the main is offline is given priority to, So there is the three basic components of hadoop, respectively to solve large data storage, computing, large table storage, basic solved the big data calculation, this stage that can write programs, large-scale operations, the big data back again for real-time processing, the first is the storm, can deal with individual data in real time, This presents the latest data, but it also shows what happens if you want both the latest and the historical. So Nathan Mara, the author of Storm, proposed the Lambda architecture, which mainly deals with how the offline data calculation results are combined with the real-time processing results to provide the final result.
2. What characteristics should big data Lambda architecture have
First whereas demand, what we want is A kind of calculation results online and offline calculation results merging architecture, imagine A credit scenario, I have to get A user transaction of all lending institutions, assuming that use the result to calculate long points, demand scenario is to take the latest data in real time, on one second trading is A agency, for example, the next second transaction will have to get it, Therefore, the historical data must be stored for calculation, which is bound to take A certain amount of time, and the A organization that deals with the last second will not be put in the offline warehouse immediately, but can only put the data into real-time processing. Considering this structure, it has the following characteristics.
-
At least ensure the offline exact-once. The environment is sometimes unreliable, especially for online systems. It is even worse to guarantee the exact-once
-
Scalability, such as offline computing efficiency is not good, can be achieved by adding resources
-
Maintainability: Lambda architecture needs to ensure the consistency of online and offline computing logic, and try to achieve the consistency of online and offline computing logic in the same way
-
You can use the query interface to query the data calculated offline and online
In general, the essence is data recording + query service
3. Introduction to big data Lambda architecture
From the point of view of requirements, what characteristics should a Lambda architecture meet? We get the model of data record + query service. Because of the different writing methods of data records, lambda architecture divides the writing of data records into offline batch computing layer and online real-time computing layer
We have the following formula
In order to facilitate Query and often serve as a view, such a lambda architecture has many implementation schemes, such as batch computing layer, spark, Hive, etc., can be used to calculate offline batch big data, and real-time layer can use programs for real-time calculation, you can choose Flink and other frameworks. If the logic is not complex, it can be directly generated by the program. As for storage, the results of offline calculation and real-time calculation can be stored separately or the sequential database can be combined for storage. In addition, for query, all data can be combined by the program or combined at the view level.
4. Layering of the Lambda architecture
I mentioned three modules in the lambda architecture, which are offline computing layer, online computing layer and query service layer respectively
The first is the offline computing layer. Due to the large amount of historical data, it will be put on HDFS. The calculation method can be calculated using Mr Model.
The second is query view, for the offline pre-processed data and online calculation results of the merger to provide services.
5. An example implementation of the Lambda architecture
Hive and Spark are used for offline computing. In order to align with the online computing logic, the same JAR dependent method is used, but the offline computing logic is in udF, and there is enable_time to distinguish the online and offline data time points. Eggroll is an offline KV storage database similar to hbase.
6. Think about Lambda architecture
Lambda architecture has gone through many years of development. Its advantages are stability, controllable computing cost for real-time computing, and batch processing can be used for the whole batch calculation at night, which separates the peak of real-time computing from that of offline computing. This architecture supported the early development of the data industry, but it also has some fatal disadvantages. In the era of Big data 3.0, it is less and less suitable for the needs of data analysis business. Disadvantages are as follows:
-
Data caliber problems caused by inconsistency between real-time and batch calculation results: because batch and real-time calculation go through two computing frameworks and procedures, the results are often different, often see a number on the day is a data, the next day to see yesterday’s data has changed.
-
Batch calculation cannot be completed in the computing window: In the IOT era, the magnitude of data is increasing, and it is often found that there is only a 4 or 5 hour time window at night, so it is impossible to complete the accumulated data of more than 20 hours during the day. Ensuring the timely data delivery before going to work in the morning has become a headache for every big data team.
-
Complexity of development and maintenance: The Lambda architecture requires programming the same business logic twice in two different apis (Application Programming Interfaces) : One is batch ETL system and the other is Streaming system. Two code bases were created for the same business problem, each with different vulnerabilities. Such systems are actually very difficult to maintain
-
Large server storage: The typical design of a data warehouse will produce a large number of intermediate result tables, resulting in rapid data expansion and increased storage pressure on the server.
Reference 7.
www.cnblogs.com/cciejh/p/la…
Blog.itpub.net/69983799/vi… Check your profile for more.