Yan Fei draft, must be a fine ~
About the author: Yan Fei, a veteran driver of Kyligence big data, has more than 15 years of experience in the field of big data/data warehouse, and has in-depth research and practice on the construction planning, architecture design, technical system, method theory and products and solutions of mainstream manufacturers of big data/data warehouse.
Big data in the previous article “say, I just need a meal”, I used to cook the people side of things to big data and data analysis is introduced in the project, should be able to let everybody to data analysis, this looks very professional industry have a certain understanding, is glad the article also got a lot of data around the resonance of the professionals and interaction.
In this article, we will follow the previous ideas and talk a little bit more about data analysis architecture.
What is the data analysis architecture? In popular terms, it is actually how to divide functional modules (specialized division of labor) in data analysis processes such as data collection (buying vegetables), data modeling (serving vegetables), data processing (cooking vegetables) and data analysis (eating vegetables). Only convenient and flexible, large-scale and maximum to meet the vast number of data consumers (foodies) data analysis (food) needs.
Take eating as an example. We can do it in the kitchen, go to restaurants, or order take-out, etc. These eating ways are an evolution of human life style, but also meet the needs of foodie in different periods and at different levels through different specialized division of labor.
As a relatively more professional thing than eating, data analysis also needs to meet the needs of broader data consumption through process design and specialized division of labor, which is usually called architectural design.
Without further ado, let me break down the history of data analysis architecture to date into three stages:
Data Analysis Phase 1.0: Business reporting
This stage is the initial stage of data analysis. With the emergence of database technology, enterprises have begun information construction, business process information precipitates a large number of digital business data, and the demand for data analysis has always been in fact, since there is data precipitation, the demand for statement statistics and data analysis through these data naturally appeared.
In the 1.0 stage, data analysis began to sprout, and data processing and report statistics were carried out directly in the business system (data generation and data analysis were carried out in the same system, so there was no data collection at this time).
This is just like cooking at home. It can be imagined that due to the limitations of food materials (data), kitchen (database resources), skills (professional abilities) and other aspects, the eating experience will not be very good (the general experience is shown below), mainly to meet the needs of satiation (report statistics).
Data Analytics 2.0 phase: Data marts
Due to do data analysis direct experience in the business system is bad, may also affect the normal business processes, and the demand of the enterprise data analysis is more and more perfect, business people naturally want to build an outside business system specially used for data analysis of independence, the new system can be used to support data analysis, and can not affect the normal business processes, then, Data marts came into being.
From the beginning of data mart, data analysis began to emerge as a formal industry, with the demand for data collection and transmission (buying vegetables) from business system to data mart. In addition, data processing, data analysis and other professional positions and employees began to emerge.
This is just like the emergence of restaurants to make the emergence of specialization in eating this matter, but also created the catering industry. There are people in restaurants who specialize in buying, dressing and cooking dishes, and chefs start to appear. This way well meets the needs of the majority of foodies in terms of saving trouble, food selection and taste. The experience is naturally awesome.
Data Analysis Phase 2.5: Data warehouse
With the development of the enterprise data analysis activities in full swing, the data mart to build more more, the same data processing logic, indicators such as to avoid repeated computation in the dispersed data mart, waste of computing resources, often can appear inconsistent data statistical problems, let the leaders don’t know if I believe which data.
This is just like the opening of many restaurants, the same dishes will inevitably be the same in different restaurants, but the taste of the same “shredded pork with fish flavor” will inevitably be different in different restaurants. Foodies will definitely confuse which one is the most authentic, but also hope to know which one is the most delicious.
At this time, the concept of data warehouse came into being.
In order to solve the problems of inconsistent data and waste of resources caused by the decentralized construction of data mart, data warehouse advocates a centralized platform for data collection, data cleaning and data processing, and provides various data analysis products and services externally.
Data warehouse has created a real era in the history of data analysis and made indelible contributions to the development and maturity of the data analysis industry:
- Massively Parallel Processing (MPP), a data warehouse technology, and a large number of professional vendors have been born to solve the problem that a large amount of data needs to be stored, processed, and analyzed centrally
- Develop systematic data warehouse system construction methodology and best practices
- Trained a large number of data warehouse practitioners (DWer)
Since the data warehouse era has played such an important role in the history of data analysis, and continues to have a profound impact today, the question arises.
Why is the data warehouse phase only 2.5 and not 3.0?
First of all, from the point of view of architecture, personally think that relative to the data warehouse and data marts, there is no essential difference between this from the above three stages of the development of the “data analysis architecture” also can see in this diagram, data mart and data warehouse architecture is very similar, the data warehouse can be simply considered to be a super data mart, the difference only lies in the scale, It’s like in order to standardize the quality of dishes, so that people can eat a variety of dishes in one stop, we opened a super restaurant, although the restaurant is very big, but it is still a restaurant.
Secondly, data warehouse aims at solving data mart data dispersion and data caliber inconsistency, and proposes The single View of Business to create enterprise-level unified business view. Its construction method emphasizes The standardization of data collection, data management and data processing. This construction idea is very valuable from the perspective of data management and has produced many mature data management norms and data governance methodology.
But… Is……
From the point of view of data analysis, although the construction of data warehouse system is a certain extent, meet the demand of the business data analysis, however, the traditional data warehouse construction method in the flexible support various data requirements, agile response analysis request, popularize culture and enterprise data driven analysis, but the heart is unable to do.
Although there are technical and cost reasons for this situation, high architectural coupling and rigid construction method are also important reasons, such as:
- The centralized platform architecture of data warehouse supports both data processing and data service through a single platform, which will inevitably result in resource competition and fail to take into account both. It’s like a restaurant where the kitchen takes up too much space and the dining room is limited, limiting the number of consumers who can respond at the same time.
- The data processing of data warehouse is a progressive, interlocking way with strict processing process, and involves the cooperation of multiple roles. Any data analysis demand, from the proposal of demand to the final realization, takes several weeks as fast as possible, and several months as slow as possible. Naturally, it cannot keep up with the rapid change of business. Customers to the restaurant, as long as they want to order a menu of dishes not on the menu, the restaurant needs to buy food, wash food, with food, stir-fry these links have to go through, the food at least wait for 2 or 3 hours or even the next day, no consumer can tolerate it.
- Most warehouses adopt data-driven construction. No matter whether the data is needed or not, they are put into the warehouse first, but they always think they will be used in the future, which leads to the rapid expansion of the warehouse scale and the existence of a large number of non-output data, and the operation and maintenance costs and difficulties are very large. It’s like running a restaurant and buying whatever food is available to customers no matter what they like to eat. Not to mention the cost, the amount of work involved in shipping, cleaning and warehousing can kill people.
- Number of warehouse construction has a mature and perfect data governance theory, what metadata management, data standard management, quality management, etc., but these theories fall to the ground, often the most go into a paper specification, but can’t and organic combination of data warehouse construction process, finally became you set your specification, I built my system, or I’ll build system, you specification again, As the system got bigger and bigger, no one knew exactly what was in the warehouse, and the whole warehouse became difficult to manage and use.
Therefore, although the data warehouse has carried on the development for decades, many enterprises have also spent a lot of manpower and cost to build the data warehouse system, but the lack of agile platform construction way, less independent choice, slow service response, all kinds of data consumer satisfaction is not high.
Slowly, therefore, many enterprises in the data warehouse system, started to get a bit of ancient imperial kitchens, collect a variety of ingredients, the ingredients, process, style, there are strict specification, fully guarantee the quality of the dishes and level, but the serving speed, turn rate and the number of diners can service all of limit, Therefore, only the ability to provide a variety of specific dishes for a specific group (royal).
Therefore, although data warehouse has developed mature methodology for data storage, data collection, data processing and data governance (equivalent to professional restaurant kitchen management theory), its role in meeting various flexible, agile and popular data analysis needs has been criticized.
In today’s era of big data, this malady is even more obvious.
The challenge brought by the wave of big data is not only the explosive growth of data volume, but also the unprecedented importance individuals, enterprises and governments attach to data and data analysis, and the demand of the whole society for data analysis also shows explosive growth. Therefore, Gartner puts forward the concept of citizen data Scientist, and more manufacturers and industry leaders shout out the slogan “everyone is a data analyst”.
How do companies meet the needs of thousands of internal employees for data analytics? How can enterprises meet the data analysis needs of tens of millions of external customers? How can the government meet the needs of hundreds of millions of people for data analysis? This is the question that data architects in the era of big data need to answer.
It can be said that the contradiction between users’ increasing demand for data analysis and backward data service capability has become the main contradiction in the era of big data.
So, the emphasis on data warehouse data processing flow and ignore the efficiency of data services, was too stringent and complicated construction methods, development and data management disjointed problem, making it difficult to quickly scale extension, won’t be able to cope with the explosive data analysis and data service demand, despite restrictions on technology, cost, do not say, The traditional data warehouse construction methodology is obviously unable to solve the main contradiction in the era of big data.
Then, in the era of big data, where is the way out for big data analysis architecture? What kind of data platform construction method is the most effective? Can the mature construction methodology of data warehouse be modified to cope with the explosive demand of data analysis?
Takeout, this time to talk about here, eat takeout we continue to pull ~