First acquaintance of data Lake

Why are we talking about Data Lake these past two years?

In fact, it is still user demand that drives Data services. The fundamental reason why people start to pay attention to Data Lake is that user demand has changed qualicatively, and the past Data warehouse mode and related components cannot meet the increasingly progressive user demand.

The data lake concept was born out of some of the challenges faced by companies, such as how data should be processed and stored. In the beginning, enterprise management of a wide variety of applications went through a relatively natural evolution cycle.

So what are the needs and challenges that drive technological change and lead to new technologies

Definition of a data lake

Wikipedia says a data lake is a type of system or storage that stores data in a natural/raw format, usually object blocks or files, including copies of raw data produced by the original system and converted data for various tasks. This includes structured data (rows and columns), semi-structured data (such as CSV, logs, XML, JSON), unstructured data (such as email, documents, PDF, etc.), and binary data (such as images, audio, and video) from relational databases.

AWS defines a data lake as a centralized repository that allows you to store all structured and unstructured data at any size.

Microsoft’s definition is more ambiguous, and there is no clear what Data is given Lake, but the function of the opportunistic Data Lake as a definition, Data including Lake makes developers, Data scientists, analysts can more simple’s ability to store, process Data, such ability make users can store any size, any type, any Data rate, And can do all types of analysis and processing across platforms and languages.

But with the convergence of big data technologies, that early definition may no longer be so precise, and data lakes are evolving to bring together a variety of technologies, including data warehousing, real-time and high-speed data streaming, data mining, deep learning, distributed storage, and others. It has gradually developed into a unified data management platform that can store all structured and unstructured data of any scale, run different types of big data tools, and conduct big data processing, real-time analysis and machine learning on data

Therefore, the data warehouse is not the warehouse it used to be, and the data lake is not the “summer rain beside daming Lake” it used to be. Sorry, it is not the green lake

trend

Here’s an important trend: real-time data

Of course there are many other trends, such as low cost and design cloud biogenics, but overall I think real-time data is the hottest, most obvious and easiest trend to see the benefits of in recent years.

We may all know the pattern of the data warehouse in the past. The whole data warehouse is divided into ODS, DWD and DWS. Hive is used as the data storage medium, and Spark or MR is used for data cleaning calculation.

The design of this data warehouse is clear and the data is easy to manage, so people happily use this theory and practice for about 10 years.

In these ten years, the mainstream of Internet companies in the technology of data on the game and there is not much change, such as recommend to use user portrait, electricity commodity label, friends in the data transmission in the figure, the financial risk control system, stand in a higher point of view, we will find that doing ten years ago, such as user portrait table, If you’re doing a referral service now, you’ll still need this form. What is the result of this? Ten years of talent accumulation, knowledge accumulation and experience accumulation in the Internet industry make it easier for us to do some things. For example, it was difficult to recruit talents who knew recommendation data ten years ago, but now the level is just the average of an industry.

Now that these things have become easier to do and more talented, we expect to be more sophisticated at what we do. Because from a business point of view, I recommend short videos and let users buy things. This demand is endless and can be done forever. So in the past, I might have been T+1 to know what the user likes, but now I can easily meet this requirement, and I want the user to come in 10 seconds later and tell me what the user likes; In the past, we may do some coarse-grained operations, such as full-population delivery, but now we may need to change our thinking and do more refined operations, providing personalized customized results for each user.

Technology evolution – Real-time

Real-time data is fine, but what about technology? Should we build a data system and pattern similar to offline data warehouse in real time?

Yes, a lot of companies are really will be divided at different levels in order to real-time data flow – that is, we say that the number of real-time, overall level and the idea of dividing the offline warehouse are similar, but the carrier of real-time data, it is not Hive or Hdfs, but to choose a more real-time message queue, such as Kafka, so it brings many problems, Such as:

  • Message queues are stored for a limited time
  • Message queues do not have query analysis capabilities
  • Backtracking is less efficient than file systems

In addition to the real-time data carrier, there is also the introduction of real-time data warehouse, and offline data warehouse unification problem,

  • For instance real-time data management of storehouse, authority management, should do a set alone?
  • How to unify the calculation caliber of real-time data and off-line data?
  • Two sets of data system waste serious, cost increase?

To take a realistic example, suppose we construct a real-time calculation indicator, and we need to correct the real-time data of yesterday after finding a calculation error. In this case, we usually write another offline task, get the data from the offline data warehouse, recalculate it, and write it to the storage. This means that for every real-time requirement we write, we have to write another offline task, which is a huge cost for an engineer.

Technology evolution – cost reduction

The cost of real-time systems is so high, which is one of the reasons many companies are afraid of real-time requirements. So the idea of building a real-time warehouse is definitely not going to work, as I need to hire twice as many people (and probably more) and spend twice as much time on a feature that will only increase my business by 10%. From a technical point of view, it is the difference in the technology stacks of the two systems that causes the engineering failure to be unified. Therefore, Data Lake is used to solve such a problem. For example, an offline task can generate both real-time and offline indicators, similar to the following figure:

The most important premise is that my data source is real-time, which poses a new challenge to our big data storage, mainly HDFS and S3 — real-time data update. If the original technology or components cannot meet the requirements, new technologies will be born driven by the requirements

In addition to computing level, can data management, such as schema management of intermediate tables and data permission management, be unified? After unified architecture, we can minimize the redundancy degree of real-time offline when dealing with real-time demand, and even achieve almost no extra cost.

We are also actively exploring this, and the mainstream practice of Domestic Internet companies still stays in the stage of “technological evolution — cost reduction”. We believe that with everyone’s efforts, excellent and successful practices will soon emerge.

Technology evolution – de-structuring

James Dixon, Pentaho’s CTO, came up with the concept of “Data Lake” in 2011. In the face of big data challenges, he declared: Don’t think of the “warehouse” of data, think of the “lake” of data. The key difference between the data “warehouse” concept and the data lake concept is that data in a data warehouse needs to be classified before it enters the warehouse for future analysis. This was common in the OLAP era, but it didn’t make sense for offline analysis to save a lot of raw data lines, which cheap storage now makes possible.

Data warehouse is a highly structured architecture, data cannot be loaded into the data warehouse before transformation, users can directly access the analysis data. In the data lake, the data is directly loaded into the data lake and then converted according to the needs of analysis.

Data lake is Schema on Read. However, in my opinion, it is not so much Schema as attitude towards data that has changed. In the past, we only extracted the data that is currently useful to the data warehouse. Now we want the data lake to hold all the data, even if it’s not currently available, but it’s available when we want to use it

In fact, it is a little inaccurate to say “unstructuring” here, but I can’t think of a better description, and I think of a feeling that still doesn’t have that taste. Just leave it as it is, and you will understand

Differences between a data lake and a data warehouse

Data warehouse is a mature and stable technical architecture. They store structured data processed by ETL to complete the process of integral decision support. A data warehouse combines data into an aggregated, summary form for enterprise-wide use and writes metadata and schema definitions as data write operations are performed. Data warehouses usually have fixed configurations; They are highly structured and therefore less flexible and agile. Data warehouse costs are associated with processing all data before storage, and the cost of bulk storage is relatively high.

Data lake is a relatively new technology with an evolving architecture. Data lakes store raw data in any form (including structured and unstructured) and in any format (including text, audio, video, and images). By definition, a data lake is not subject to data governance, but experts agree that good data management is essential to prevent a data lake from becoming a data swamp. Data lakes create schemas during data reads and are less structured and more flexible than data warehouses; They also provide greater agility. No processing is required before the data is retrieved, and the data lake purposely uses cheaper storage.

The data of lake The data warehouse
It can handle all types of data, such as structured data, unstructured data, semi-structured data, etc. The data type depends on the original data format of the data source system. Only structured data can be processed, and the data must conform to the model defined in advance by the data warehouse.
Design the schema while reading and store the original raw data A data warehouse is designed at write time to store raw data after processing
It has sufficient computing power to process and analyze all types of data, and the analyzed data will be stored for use by users. Process structured data and convert it either into multidimensional data or into reports for subsequent advanced reporting and data analysis needs.
Data lakes usually containMore on thatThis information has a high probability of being accessed and can be used to uncover new operational requirements for the enterprise. Data warehouses are typically used to store and maintain long-term data, so data can be accessed on demand.

The differences between data lakes and data warehouses are obvious. However, the roles of data lakes and data warehouses in the enterprise are complementary and should not be considered as a replacement for data warehouses. After all, the roles of data lakes and data warehouses are completely different

  1. Data value: The data warehouse stores structured data, while the data lake can store both original and structured data to ensure that users can obtain data at all stages. Because the value of data is strongly related to different businesses and users, it may be meaningless to user A, but significant to user B, so all data need to be stored in the data lake.

  2. Real-time data: The Data lake supports ETL functionality for real-time and high-speed data streams, which helps fuse sensor data from IoT devices with other data sources into the data lake. Visually, the data lake architecture ensures the integration of multiple data sources and allows unlimited schema to ensure data accuracy. Data lake can meet the needs of real-time analysis and batch data mining as a data warehouse. Data lakes also make it possible for data scientists to find more inspiration from data.

  3. Data fidelity: The Data Lake stores a complete “identical” copy of the data in the business system. Unlike a data warehouse, a data lake must keep a copy of the original data, and neither the data format, data mode nor data content should be modified. In this regard, the data lake emphasizes the “original” preservation of business data. At the same time, the data lake should be able to store data of any type/format.

  4. Data flexibility: Data lakes provide flexible, task-oriented data binding without the need to define data models in advance. “Write schema” versus” read Schema “is essentially a matter of when data schema design takes place. Schema design is essential for any data application, even for some “schema-free” databases such as MongoDB, the best practice still suggests that records should have the same/similar structure.

    The logic behind the “write Schema” emphasized by data warehouse is that before data is written, the data schema needs to be determined according to the access mode of services, and then the data is imported according to the established schema. The benefits are good adaptation of data and services. However, this also means that the upfront cost of ownership is high, especially when the business model is unclear and the business is still in the exploratory stage, the flexibility of the warehouse is not enough.

    The underlying logic behind Data Lake’s emphasis on “read schema” is that business uncertainty is the norm: We can’t anticipate business changes, so we have the flexibility to defer design and make the entire infrastructure capable of tailoring data” on demand “to the business. In my opinion, therefore, “fidelity” and “flexibility” go hand in hand: since there is no way to predict business changes, the data should be kept in its original state and processed as needed. Therefore, data Lake is more suitable for innovative enterprises and enterprises with rapid business change and development. At the same time, the users of the data Lake are also more demanding. Data scientists and business analysts (with certain visualization tools) are the target customers of the data Lake.

conclusion

Offline infrastructure is popular for decades, Internet decades technology accumulation and business development put forward new requirements for data, real-time computing technology development to meet the people to the requirement of real-time data, but it fails to meet the Internet people clinging pursuit of low-cost high-performance, technology of tide waves, and if you miss the sun, and mountain, please don’t miss the stars and the sea

Of course, there are criticisms of the data lake architecture. Some critics say that collecting all kinds of messy data should be a data swamp. Martin Fowler also raised doubts about the security and privacy of data in the data Lake. History has witnessed that every new technology is born and always met with setbacks and doubts, but has it ever let you down

Knives and swords are all around the world. Fame and fortune flick clothes, mountains high water far away.