This question is often asked, and data players have seen a lot of explanations that don’t seem intuitive, but I’ll try to use an example that everyone understands.
What is a data warehouse?
If you want to pick up a piece of furniture by yourself, you will usually write down the code on the item:
This code for the customer, certainly is not any meaning, see this code, it is impossible to know what he is a commodity.
But this code, for the warehouse manager is meaningful, they can clearly know, which shelf, which location.
Of course, customers can also find goods in the warehouse by following the shelves and location, but it is always not very intuitive, and the selection process still has to be carried out upstairs.
So a data warehouse is equivalent to the first floor of ikea warehouse, in this case, the data (furniture) according to the specific model, such as the FS – LDM (shelves – location) to organize, this kind of model, for customers (business people, the end user data) is not friendly, but for science and technology personnel (warehouse administrator, ikea employees) is relatively friendly, Because he managed the data (furniture) in accordance with a more intensive rule, the storage is centralized and orderly, and the extraction of data (picking up goods) does not need to search across the warehouse (warehouse), so the search efficiency is higher.
So what is a data mart?
As mentioned above, the data warehouse is not very friendly to business people. Similarly, you can’t let customers walk through the warehouse, can you? Customers’ demands are grouped according to the types of furniture and displayed in different rooms of the family, just like ikea’s upstairs exhibition hall:
It’s been mocked as a maze, but overall, the shopping experience is definitely better than going to the warehouse.
So, the data mart is like ikea showroom upstairs, just as its name “bazaar”, is a face to the end user (customer) data market, here, the data (furniture) in a more easily accepted by business people (customers) way together, these combinations may be varied, since the business personnel demand (the customer) is changeful, Therefore, we need to adjust the calculation calibre of the market regularly (display mode of exhibition hall), and often create new data market (decoration of new exhibition hall).
Understanding the concepts of data warehouse and marts helps solve some other related questions, such as why do you need to build a data mart when you have a data warehouse? And so on.
So what is a data lake?
Up to now, there is no special standard concept of data lake. Among various concepts, a relatively unified one is that data lake stores raw data, including all kinds of structured and unstructured data. Data players still try to explain with the example above.
As we all know, IKEA furniture needs to be assembled by itself, so ikea customers have some practical ability. They suddenly think, can all the furniture be separated into parts and stored, and customers can select parts and assemble them according to their actual needs?
Therefore, the data lake is a storage that stores all the original data (furniture parts) in the enterprise, which brings a series of problems. The storage of processed data is already very complicated, and the original data depends on more management functions, otherwise the data will be too much and complicated to manage, and the data lake will degenerate into a data swamp. In addition, raw data cannot be assembled together without a unified data standard, just as different furniture parts have different interfaces.
Therefore, the data lake must have perfect data management function, and also depends on unified data standard and good data quality management.
So what is the data center?
There is no specific definition of data platform, and it is difficult to explain it by using ikea’s example. Let’s take a look at various data platforms:
In our traditional data applications, as data becomes more business-friendly, its timeliness diminishes. And our goal, obviously, is fast and good data. Since the needs of each department are different, why not let the business analyze the data itself? So we have the target state in the upper right. But there is a huge gap between this ideal state and our current use of data. What can fill it? The answer is data center.
We can divide it into narrow data center and broad data center. Narrow middle data, refers to a set of data applications and tools, including distributed ETL, data label management, asset management, data sandbox, self-help analysis platform, metadata management, data quality management, etc., the bottom has the existing number of warehouse, large data platform as the data source, such as to provide data for the enterprise asset management ability, and continuously mining data value, Continue to provide data intelligence services.
The data center in the broad sense, on the basis of the narrow sense, contains the top-level data strategy, data governance system, data management and operation, data culture cultivation and organizational structure support, and is a set of continuous management and operation system.
It can be said that the narrow sense of data center is designed to achieve the mission of data center. One is to make data processing, integration and processing faster, such as distributed ETL tools.
As traditional data is gradually replaced by big data platforms, the adaptation of ETL tools to big data platforms needs to keep pace with The Times, support distributed computing, elastic computing, and reduce the development volume
The other is to make data better generate business value, such as data label management, self-service analysis platform and so on. Data labels with everyone, but the real depth using the enterprise always felt: it’s really difficult to build very easy to use, if does not have a tag management system, whether the label repeated processing, the usage and accuracy of the labels are not control, such as business units on the recent marketing activities to create a new label, still have to go the development process, timeliness is not certain.
Data label management system is established to solve the problem of using data label. Self-service analysis platform is convenient for business personnel to conduct data analysis, processing and exploration by themselves. Combined with data sandbox, it directly provides business personnel with privacy-free production data for analysis, so that data can generate value faster and support key decisions.
The broad data center is the mechanism to assist the narrow data center to achieve its mission. Although it looks very “virtual”, it is the necessary guarantee for the successful landing of the data center.
Does all this have to be done?
This question depends on the specific situation of the enterprise. In general, a general principle is to meet the business development of the first priority, do not build infrastructure for the sake of building infrastructure, must be able to solve the business needs as the ultimate goal.
“Man-month Myth” has long declared that there is no silver bullet, naturally, data warehouse, data mart, data lake, data center are not silver bullet, do not think that they will be done automatically, digital transformation will be completed automatically.
In short, using a range of snazzy new technologies is not necessarily a digital leader, nor is it a backwater workshop of the classical Internet era. The key is to recognize their own digital status quo, to formulate digital goals, to formulate digital path, to optimize the scene, to achieve value.
New technologies and all kinds of data infrastructure are only a set of feasible action plans on this road. It is to reorganize all kinds of digitalization attempts of the past banks with systematic and structured methodology and endow them with the latest technical framework for implementation.