This article is a translation of a series of technical articles by Databricks on Delta Lake. Databricks is known for leading the open source big data community Apache Spark, Delta Lake, ML Flow and other hot technologies, and Delta Lake as the core storage engine solution for data lakes brings many advantages to enterprises. This series of technical articles will cover Delta Lake in detail.

preface

This article is a translation of a series of technical articles by Databricks on Data Lake Delta Lake. Databricks is known for leading the open source big data community Apache Spark, Delta Lake, ML Flow and other hot technologies, and Delta Lake as the core storage engine solution for data lakes brings many advantages to enterprises.

In addition, Ali Cloud partnered with Apache Spark and Delta Lake’s original Databricks engine team to launch the enterprise version of The fully managed Spark product, Databricks Data Insight, based on Ali Cloud. The product is a native integrated enterprise version of the Delta Engine that provides high-performance computing capabilities without additional configuration. Interested students can search ‘Databricks Data Insight’ or ‘Ali Cloud Databricks’ to enter the official website, or directly visit the Ali Cloud official website to learn more.

Han Zongze, technical expert of Alibaba Cloud Computing Platform Business Division, is responsible for the research and development of open source big data ecological enterprise team.

Delta Lake Technology Series – One Lake Barn (Lakehouse)

Best benefits of integrating data lakes and data warehouses

directory

  • Chapter-01 What is The Integration of Lake and Storehouse?
  • Chapter-02 delves into the inner workings of Lakehouse and Delta Lake
  • Chapter-03 Exploring the Delta Engine

Content of this article

The Delta Lake ebook series, published by Databricks and translated by the Big Data Ecology Enterprise team of Ali Cloud Computing Platform Division, aims to help leaders and practitioners understand the full capabilities of Delta Lake and the scenarios in which it is located. In this article, Del**** TA Lake series – Lake Barn in One (Lakehouse) focuses on Lake barn in one.

subsequent

By the end of this article, you can understand not only what Features Delta Lake offers, but how those features can lead to substantial performance improvements.

What is a data lake?

Delta Lake is a unified data management system that brings data reliability and rapid analysis capabilities to cloud data lakes. Delta Lake can run on top of existing data lakes and is fully compatible with the Apache Spark API.

Within Databricks, we have seen how Delta Lake brings reliability assurance, performance optimization, and lifecycle management to the data Lake. Delta Lake can be used to solve problems such as data formatting errors, data compliance deletions, or modifications to individual data. At the same time, with Delta Lake, high-quality data can be quickly written to the data Lake and deployed through cloud services (secure and scalable) to improve data utilization efficiency.

Chapter-01 What is The Integration of Lake and Storehouse?

Over the past few years, Lakehouse has emerged independently as a new data management paradigm for many users and applications of Databricks. In this article, we will explain this new paradigm and its advantages over previous approaches.

Data warehousing has a long history in decision support and business intelligence applications. Data warehouse technology has evolved since its creation in the late 1980s, and THE MPP architecture allows systems to handle larger volumes of data.

Although warehouses are well suited for structured data, many modern enterprises must deal with unstructured data, semi-structured data, and data with high diversity, speed, and volume. Data warehousing is not suitable for many of these scenarios and is not the most cost-effective.

As companies began to collect large amounts of data from many different sources, architects began to envision a single system that could accommodate the data generated by many different analytical products and work tasks.

About 10 years ago, we started building a data lake — a storage database for raw data in many formats. Data lakes, while good for storing data, lack some key features: they do not support transaction processing, do not guarantee data quality, and lack consistency/isolation, making it almost impossible to mix append and read data, and to do batch and streaming jobs. For these reasons, many of the capabilities of a data lake are not yet implemented and in many cases the benefits of a data lake are lost.

The need for flexible, high-performance systems for a wide range of data applications, including SQL analysis, real-time monitoring, data science and machine learning, has not diminished for many companies. Most of the latest advances in AI are based on models that better handle unstructured data (text, images, video, audio), but these are exactly the types of data that data warehouses are not optimized for. A common solution is to use a converged data lake, multiple data warehouses, and other systems such as streams, time series, graphs, and image databases. However, maintaining this entire system is very complex (and relatively expensive to maintain). In addition, data professionals often need to move or copy data across systems, which can lead to delays.

Lake warehouse integrates the advantages of data lake and data warehouse

Lakehouse is a new paradigm that combines the benefits of data lakes and data warehouses, addressing the limitations of data lakes. Lakehouse uses a new system design that implements data structures and data management functions similar to those found in data warehouses directly on low-cost storage for data lakes. If you need to redesign your data warehouse now that you have inexpensive and highly reliable storage (in object-stored format) available, consider Using Lakehouse.

Lakehouse has the following key features:

  1. ** thing **** thing support: **Lakehouse In enterprise applications, many data pipes often read and write data simultaneously. Typically, multiple parties use SQL to read or write data simultaneously, and Lakehouse guarantees consistency in support of ACID transactions.
  2. ** Schema implementation and Governance: **Lakehouse should have a way to support schema implementation and evolution that supports DW schema specifications such as Star/Snowflak-Schemas. The system should be able to reason about data integrity and should have robust governance and auditing mechanisms.
  3. **BI support: **Lakehouse can use BI tools directly on source data. This reduces obsolescence and wait time, improves recency, and reduces the cost of having to operate two copies of data in the data lake and warehouse.
  4. ** Storage **** separated from computing: ** In practice, this means that storage and computing use separate clusters, so these systems can scale to more concurrent users and larger amounts of data. Some modern data warehouses also have this property.
  5. ** Compatibility: ** The storage format used by Lakehouse is open and standardized, such as Parquet, and it provides a variety of apis, including machine learning and Python/R libraries, so that a variety of tools and engines can directly and efficiently access data.
  6. ** Supports multiple data types from unstructured data to structured data: **Lakehouse can be used to store, optimize, analyze and access the data types required by many new data applications, including images, video, audio, semi-structured data and text.
  7. ** Supports a variety of work scenarios: ** includes data science, machine learning and SQL analysis. These may depend on working scenarios supported by multiple tools, all of which rely on the same data repository.
  8. ** End-to-end streaming tasks: ** Real-time reporting is a daily requirement for many enterprises. Support for stream processing eliminates the need for a separate system dedicated to serving real-time data applications.

These are the key features of Lakehouse. Enterprise-level systems need more functionality. Security and access control tools are basic requirements. Especially in light of recent privacy regulations, data governance functions including audit, retention and inheritance have become critical, and data discovery tools such as data catalogs and data usage metrics need to be enabled. With Lakehouse, the above enterprise features need to be deployed, tested, and managed in a single system.

Read the following study Delta Lake: High-performance ACID table storage based on cloud object Storage

Abstract:

Cloud object storage (such as Ali Cloud OSS) is some of the largest and most cost-effective storage systems available, and it is the primary choice for storing large data warehouses and data lakes. The limitations are that the way they are implemented as key-value stores makes it difficult to achieve ACID transactions and high performance because metadata operations (such as listing objects) are expensive and consistency guarantees are limited. In this article, we introduced Delta Lake, an open source ACID table storage layer originally developed by Databricks based on cloud object storage. Delta Lake uses transaction logs in the Apache Parquet compressed format to provide ACID properties for large tabular datasets, time travel, and fast metadata manipulation (for example, the ability to quickly search queries across billions of partitions). It also leverages this design to provide advanced features such as automatic data layout optimization, updates, caching, and audit logging. Delta Lake tables can be accessed from Apache Spark, Hive, Presto, Redshift, and other systems. Delta Lake is deployed among thousands of Databricks customers who process EB-level data on a daily basis, with the largest instances managing EB-level data sets and billions of objects.

The author: Michael Armbrust, Tathagata Das, Liwen Sun, Burak Yavuz, Shixiong Zhu, Mukul Murthy, Joseph Torres, Herman Van HOvell, Adrian Ionescu, Alicja ł uszczak, Michał Szafra ń ski, Xiao Li, Takuya Ueshin, Mostafa Mokhtar, Peter Boncz, Ali Ghodsi, Sameer Paranjpye, Pieter Senster, Reynold Xin, Matei Zaharia

Inner Workings of the Lakehouse.

Early case

Databricks Unified Data Platform supports Lakehouse architecturally. Alibaba’s DDI service, which has been integrated with Databricks, implements a lakehouse-like model. Other managed services, such as BigQuery and Redshift Spectrum, have some of the LakeHouse features listed above, but they are primarily for BI and other SQL applications. For companies that want to build and implement their own systems, refer to open source file formats suitable for building Lakehouse (Delta Lake, Apache Iceberg, Apache Hudi).

Combining data lakes and data warehouses into one system means that data teams can move data faster because they can use the data without having to access multiple systems. In these early Lakehouses, SQL support and integration with BI tools were usually sufficient for most enterprise data warehousing needs. Instantiated views and stored procedures are available, but users may need to employ other mechanisms that are different from those found in traditional data warehouses. The latter is particularly important for “lift-down scenarios,” which require that the semantics of the system be nearly identical to those of old commercial data warehouses.

What is the support for other types of data applications? Lakehouse users can use a variety of standard tools (Apache Spark, Python, R, machine learning libraries) for non-BI work, such as data science and machine learning. Data exploration and refinement is the standard for many analytical and data science applications. Delta Lake is designed to allow users to gradually improve the quality of data in Lakehouse until it is ready for use.

Although distributed file systems can be used for the storage tier, object storage is more appropriate for Lakehouse. Object storage provides low-cost, highly available storage and performs well in massively parallel reading, which is the basic requirement of modern data warehouse.

From the BI to AI

Lakehouse is a new data management architecture that radically simplifies enterprise data infrastructure and accelerates innovation in an era of machine learning across industries. In the past, most of the data involved in a company’s products or decisions was structured data from the operating system. Today, many products integrate AI in the form of computer vision and speech modeling, text mining, and so on. Why use Lakehouse instead of data Lake for AI? Lakehouse provides you with data versioning, governance, security, and ACID properties, even for unstructured data.

Currently Lakehouse has reduced costs, but their performance still lags behind that of dedicated systems (such as data warehouses) that have been put in place and deployed for years. Users may prefer certain tools (BI tools, IDE, Notebook), so Lakehouse needs to improve its UX and its connectors with popular tools to attract more users. As technology matures and develops, these problems will be solved. As technology advances, Lakehouse will close these gaps while retaining core attributes that are simpler, more cost-effective, and better suited to a variety of data applications.

Chapter02 delves into the inner workings of Lakehouse and Delta Lake

Databricks wrote a blog outlining the growing adoption of the Lakehouse model by businesses. The blog has generated a lot of interest among technology enthusiasts. While many have praised it as a next-generation data architecture, some see lake warehouse integration as the same thing as a data lake. Recently, several of our engineers and founders wrote a research paper describing some of the core technical challenges and solutions for keeping the lake warehouse integrated architecture separate from the data lake, The paper was accepted and presented at The International Conference on Very Large Databases (VLDB) 2020, “Delta Lake: High-performance ACID Table Storage Over Cloud Object Stores “.

More than a decade ago, the cloud opened up new directions for data storage. Cloud object stores like Amazon S3 have become some of the largest and most cost-effective storage systems in the world, making them even more attractive data storage warehouses and data lake platforms. However, their nature as key-value stores makes it difficult for many companies to achieve the ACID transactional characteristics required. Also, expensive metadata operations (such as listing objects) and limited consistency guarantees affect performance.

Based on the characteristics of **** cloud object storage, there are three solutions:

Data lakes

Data Lakes store tables as file directories of collections of objects (that is, Data lakes), usually in column format (such as Apache Parquet). This is a unique approach. Because a table is just a set of objects, it can be accessed through multiple tools without using other data storage systems, but this can cause performance and consistency problems. Performance corruption of hidden data due to transaction execution failures occurs, resulting in inconsistent queries, long wait times, and unavailability of basic management functions such as table versioning and audit logging.

Custom Storage Engines

The second approach is to customize storage engines, such as proprietary systems built for the cloud, such as the Snowflake data Warehouse. These systems can provide a single data source and avoid the consistency challenges of a data lake by managing metadata in a separate and highly consistent service. However, all I/O operations need to connect to this metadata service, which can increase cloud resource costs and reduce performance and availability. In addition, the amount of engineering required to implement connectors from existing computing engines such as Apache Spark, Tensorflow, and Pytorch can be a challenge for data processing teams using a variety of computing engines. Unstructured data exacerbates engineering challenges because these systems are often optimized for traditional structured data types. Most unacceptable of all, proprietary metadata services lock customers into specific service providers, forcing them to face consistently high prices and time-consuming migration costs if they decide to adopt new services in the future.

Lakehouse (one lake and barn)

Delta Lake is an open source ACID table storage layer on top of cloud object storage. It is as if we seek to build a car, not a faster horse. Lake warehouse integration is a new architecture that combines the advantages of data lake and data warehouse. Not only does it have better data storage performance, but there are fundamental changes in the way data is stored and used. The new system design supports Lakehouse: data structures and data management functions similar to those found in data warehouses are implemented directly on low-cost storage for data lakes. This kind of inexpensive and reliable storage (in the form of object storage) is what you want if you want to design a new storage engine.

Delta Lake maintains part of the object information of the data table in ACID mode using a pre-written log compressed into Parquet, which is also stored in the cloud object store. This design allows clients to update multiple objects at once, replacing a subset of them with another object in a serializable manner, thus achieving high parallel read/write performance. The log also provides significantly faster metadata operations for large tabular datasets.

Delta Lake also provides: time travel (data versioning support for rollback), automatic optimization of small files, update support, caching and auditing logs. Together, these capabilities improve manageability and performance for processing data in cloud object storage, ultimately opening the door to the Lakehouse architecture. The architecture combines the key functions of data warehouses and data lakes to create a better, simpler data architecture.

Today, Delta Lake is used by thousands of Databricks customers as well as many organizations in the open source community, processing billions of bytes of structured and unstructured data every day. These use cases cover a variety of data sources and applications. The types of data stored include change data capture (CDC) logs from enterprise OLTP systems, application logs, time series data, graphics, aggregation tables for reporting, and image or feature data for machine learning. These applications include SQL analysis work (most common), business intelligence, flow processing, data science, machine learning, and graphics analysis. Overall, Delta Lake has proven to be a good fit for most data Lake applications that use structured storage formats (such as Parquet or ORC) and many traditional data warehouse workloads.

In these use cases, we found that customers often used Delta Lake to significantly simplify their data architectures by running more workloads directly against cloud object storage. More often, they replace some or all of the functionality provided by message queues (e.g. Apache Kafka), data lakes, or cloud data warehouses (e.g. Snowflake, Amazon Redshift) by creating Lakehouses with data lakes and transaction capabilities.

In the research of the above article, the author also provides the following introduction:

• Features and challenges of object storage

• Storage format and access protocol for Delta Lake

• Current characteristics, strengths and limitations of Delta Lake

• Core and proprietary use cases that are commonly used today

• Performance experiments, including TPC-DS performance

In this article, you’ll get a better understanding of Delta Lake and how it enables DBMS-like performance and management capabilities for data in low-cost cloud storage. You’ll also learn how Delta Lake’s storage format and access protocol help make it easy to operate, highly available, and capable of providing high-bandwidth access to object storage.

Chapter03 Explores the Delta Engine

The Delta engine connects with the 100% vectorized Query engine compatible with Apache Spark and optimizes Spark 3.0’s query optimizer and caching capabilities by leveraging modern CPU architectures. These features are available as part of Databricks Runtime 7.0. Taken together, these capabilities can significantly improve query performance on data lakes, especially those supported by Delta Lake, making it easier for customers to adopt and extend the Lakehouse architecture.

Extended execution performance

One of the biggest hardware trends over the past few years has been that CPU clock speeds have leveled off. The exact reasons for this are beyond the scope of this chapter, but it is important that we find new ways to process data faster than raw computing power. One of the most effective approaches is to increase the amount of data that can be processed in parallel. However, data processing engines need to be specifically designed to take advantage of this parallelism.

In addition, as the pace of business increases, there is less time left for r&d teams to provide good data modeling. Poor modeling for better business agility results in poor query performance. Therefore, this is not an ideal state, and we want to find ways to maximize agility and performance.

The Delta Engine with high query performance is proposed

Delta Engine improves the performance of Delta Lake’s SQL and DataFrame workloads with three components: an improved query optimizer, a caching layer between the execution layer and the cloud object store, and a native vector execution Engine written in C++.

The improved query optimizer extends the functionality already available in Spark 3.0 (cost-based optimizer, adaptive query execution, and dynamic runtime filters) with more optimized statistics, resulting in an 18-fold improvement in star architecture workload performance.

Delta Engine’s cache layer automatically selects the input data to cache for the user and transcodes the code in a more EFFICIENT CPU format to better take advantage of the higher storage speed of NVMe SSDS. Scan performance can be improved up to five times for almost all workloads.

In fact, Delta Engine’s biggest innovation is the native execution Engine that addresses the challenges faced by today’s data teams. We call it Photon (as we all know, it’s an Engine within an Engine). This fully refactored Databricks execution engine was built to maximize the performance of new changes in modern cloud hardware. It delivers performance improvements for all workload types while still being fully compatible with the open source Spark API.

Introduction to the Delta Engine

By linking these three components together, customers will more easily understand how Databricks aggregates multiple pieces of code for improvement, greatly improving the performance of workloads for analysis on the data lake.

We are excited about the value Delta Engine brings to our customers. It is of great value in terms of time and cost savings. More importantly, in the Lakehouse pattern, it enables data teams to design data architectures to improve uniformity and simplicity, and makes many new advances.

For more details on Delta Engine, see the Keynote at Spark + AI Summit 2020: Delta Engine: High-performance Query Engine for Delta Lake.

subsequent

Now that you’ve learned about Delta Lake and its features, and how to optimize performance, this series includes more:

  • Delta Lake Technology Series – Basics and Performance
  • Delta Lake Technology Series – Features
  • Delta Lake Technology Series -Streaming
  • Delta Lake Technology Series – Customer Use Cases

The original link

This article is the original content of Aliyun and shall not be reproduced without permission.