First, the concept of data lake
The concept of data lakes was first proposed in 2011 by Dan Woods, CTO of CITO Research. The analogy goes like this: if we compare data to water in nature, then water from various rivers and rivers is unprocessed and continuously converges into the data lake. The data lake has been widely understood and defined in the industry. “Data lake is a platform for centralized storage of massive, multi-source, multi-type data, and rapid processing and analysis of the data. It is essentially a set of advanced enterprise data architecture.”
The core value of “Data Lake” lies in providing enterprises with a data platform operation mechanism. With the advent of DT era, enterprises are in urgent need of reform. They need to use the sharp tools of information, digitalization and new technology to form a platform system, empower the company’s personnel and business, and quickly respond to challenges. The data base for all this is what the data lake provides.
Ii. Characteristics of data Lake
The data lake itself has the following characteristics:
1) Raw data
Massive original data is stored centrally without processing. A data lake is typically a single store for all of an enterprise’s data, including raw copies of source system data, as well as transformational data for tasks such as reporting, visualization, analysis, and machine learning. A data lake can include structured data from relational databases (rows and columns), semi-structured data (CSV, logs, XML, JSON), unstructured data (email, documents, PDF), and binary data (images, audio, video). It’s a data lake that brings together different kinds of data.
2) On demand
Users can process it on demand without moving the data. Databases typically provide a variety of data computing engines for users to choose from. Common examples include batch, real-time query, streaming processing, machine learning, etc.
3) Delayed binding
Data lakes provide flexible, task-oriented data orchestration without the need to define a data model in advance.
3. Advantages and disadvantages of data lake
Every coin has two sides, and data lakes have advantages as well as disadvantages.
Advantages include:
- The data in the data Lake is closest to native. This is very convenient for data exploration class needs, you can directly get the original data.
- Data lakes unify the data of all service systems in an enterprise and solve the problem of isolated information. Provides a possibility for data applications across multiple systems.
- Data lake provides a global, unified, enterprise-level overview of data, which is important for data quality, data security… Data governance as a whole, even down to the data asset level, is beneficial.
- The data Lake has changed its original working mode to encourage everyone to understand and analyze data. Rather than relying on dedicated data teams to “feed” the approach, it can improve the efficiency of data operations, improve customer interaction, and encourage data innovation.
The disadvantages are mainly reflected in:
- The degree of data collection processing is obviously missing, which is too “raw material” and too redundant for users trying to use the data directly. This problem can be solved by “data access + data processing + data modeling”.
- The performance of the base layer of the data lake is highly required, and the data processing process must rely on high-performance servers. This is mainly caused by massive data, heterogeneous diversified data, delayed binding mode and other problems.
- High data processing skills are required. This is mainly due to the problem of raw data.
4. Data lake and associated concept
4.1 Data lake vs data warehouse
The idea of data lake construction subverts the traditional data warehouse construction methodology in essence. The traditional enterprise data warehouse emphasizes integration, subject-oriented, hierarchical and subordinate ideas. The two concepts are not equivalent, but more inclusive. That is, data warehouse exists as a kind of “data application” of data lake. The two can be compared from the following dimensions:
1) Store data types
- Data warehouse is to store cleaned and processed, reliable and well-structured data.
- A data lake stores a large amount of raw data, including structured, semi-structured and unstructured data. In our world, it’s mostly raw, messy, unstructured data. As “messy data” escalates, so does interest in better understanding it, deriving value from it, and making decisions based on it. This requires a flexible, agile, economical, and relatively easy solution, which is not the strength of data warehousing. And when new requirements are put forward, traditional data warehouse is difficult to change quickly.
2) Data processing methods
- If we need to load data into a data warehouse, we first need to define it, which is called schema-on-write.
- With a data lake, you simply load the raw data, and then, when you’re ready to use the data, give it a definition, called schema-on-read.
These are two very different ways of processing data. Because the data lake defines the model structure at the point of use, it improves the flexibility of the data model definition and can meet the demands of more efficient analysis for different upper-layer businesses.
3) Way of working cooperation
- The traditional working mode of data warehouse is centralized. Business personnel give requirements to the data team, and the data team processes and develops dimension tables according to requirements for the business team to query through BI reporting tools.
- The data lake is more open and self-service, opening data to all, while the data team provides tools and environment for all business teams to use (although centralized dimension table construction is still needed), and the business team develops and analyzes.
4) other
There are many other aspects, which we can compare briefly in the figure below.
4.2 Data Lake vs Big Data
The technical realization of data lake is closely combined with big data technology.
- Hadoop saves massive original data, local data and converted data in Hadoop due to its low storage cost. This allows all data to be stored in one place, providing a basis for subsequent management, reprocessing, and analysis.
- Deliver data to a large database platform for processing with low-cost processing power (compared to RDBMS) such as Hive and Spark. In addition, Storm and Flink can also support special computing methods such as streaming processing.
- Due to the scalability of Hadoop, it is easy to implement full data storage. Combined with data life cycle management, data control can be achieved in the full time span.
4.3 Data Lake vs Cloud computing
Cloud computing uses virtualization and multi-tenant technologies to maximize the utilization of basic resources such as servers, networks, and storage, reducing the cost of IT infrastructure and bringing huge economic benefits to enterprises. At the same time, cloud computing technology realizes the rapid application and use of host and storage resources, which also brings more management convenience to enterprises. Cloud computing can play a big role in building the infrastructure for a data lake. In addition, companies like AWS, MicroSoft, EMC, and others offer cloud-based data lake services.
4.4 Data lake vs ARTIFICIAL intelligence
Once again, in recent years, artificial intelligence technology rapid development, training and reasoning and so on need to handle large, at the same time and even multiple data sets, these data sets are usually unstructured data such as video, images, text, from multiple industries, organization, project, the data collection, storage, cleaning, conversion, feature extraction, etc are a series of long and complicated project. Data lake needs to provide a platform for rapid data collection, governance and analysis for artificial intelligence programs, as well as high bandwidth, access to massive small files, multi-protocol communication and data sharing capabilities, which can greatly accelerate the process of data mining and deep learning.
4.5 Data Lake vs Data Governance
Traditionally, data governance has been done in data warehouses. So the need for data governance is actually stronger after building an enterprise-level data lake. Different from the “pre-modeling” data warehouse, the data in the lake is more scattered, disordered and irregular, etc., which requires governance to achieve the “usable” state of the data. Otherwise, the data lake is likely to “corrupt” into a data swamp, wasting a large amount of IT resources. Whether the platform-based data lake architecture can drive enterprise business development, data governance is crucial. This is also one of the biggest challenges to the construction of the data lake.
4.6 Data Lakes vs Data Security
The data lake contains large amounts of raw and processed data, which can be very dangerous to access without supervision. There are necessary data security and privacy issues to consider, and these are the capabilities provided by the data lake. But from another perspective, centralizing data in the data lake is actually good for data security. This is better than having data scattered across the enterprise.
5. Data lake architecture
5.1 Data Access
In terms of data access, an adaptive multi-source heterogeneous data resource access method is required to provide a channel for data extraction and aggregation of the enterprise data lake. Provide the following capabilities:
- Data source configuration: Supports multiple data sources, including but not limited to databases, files, queues, and protocol packets.
- Data collection: Supports data collection operations for corresponding data sources, including structure analysis, cleaning, and standardized formats.
- Data synchronization: Supports data synchronization to other data sources, including necessary cleaning, processing, and conversion.
- Data distribution: Supports data sharing and distribution, and publishes data in various forms (objects, apis, etc.).
- Task scheduling: Task management, monitoring, logging, and policies.
- Data processing: support data encryption, desensitization, normalization, standardization and other processing logic.
5.2 Data Storage
Many enterprises usually ignore the value of data accumulation. Data need to be continuously collected and stored from all aspects of the enterprise, so that it is possible to mine value information based on these data, guide business decisions and drive the development of the company. So one of the core capabilities that a data lake needs to provide is storage capability. A set of data storage pools can effectively solve the problem of data chimneys in an enterprise, provide unified namespace and multi-protocol interworking, realize efficient sharing of data resources and reduce data movement. Of course, data cannot be stored in the lake in disorder, so there needs to be a concept of data life cycle. It is necessary to design feasible storage solutions according to the value and cost factors of different data stages.
5.3 Data Calculation
The data lake needs to provide a variety of data analysis engines to meet the needs of data calculation. Specific computing scenarios such as batch, real-time, and streaming are required. In addition, massive data access is required to meet the requirements of concurrent read and improve real-time analysis efficiency.
5.4 Data Application
In addition to basic computing capabilities, data lakes need to provide batch reporting, AD hoc queries, interactive analysis, data warehousing, machine learning and other upper-layer applications, as well as self-service data exploration capabilities.
Author: Han Feng
First published in the public account “Han Feng channel”, welcome to follow.
Source: Creditease Institute of Technology