takeaway

OPPO is a smart terminal manufacturer with hundreds of millions of end users who generate large amounts of unstructured data such as text, pictures, audio and video every day. On the premise of ensuring data connectivity, real-time performance and data security governance requirements, how to fully mine the value of data with low cost and high efficiency has become a big problem for companies with massive data. At present, the popular solution in the industry is data lake. The data lake storage CBFS developed by OPPO introduced in this paper can solve the current pain point to a large extent.

A brief description of the Data Lake

Data lake definition: A centralized storage repository that stores data in its original data format, usually binary blobs or files. A data lake is typically a single data set consisting of raw data as well as transformed data (reporting, visualization, advanced analytics, machine learning, etc.)

1. Value of data lake storage

Compared to traditional Hadoop architecture, data lakes have the following advantages:

  • Highly flexible: data can be read, written and processed easily, and all original data can be saved
  • Multiple analytics: supports loads including batch, streaming, interactive queries, machine learning, and more
  • Low cost: Storage computing resources are independently extended. Using object storage, hot and cold separation, lower cost
  • Easy management: perfect user management authentication, compliance and audit, data “storage and use” can be traced

2. OPPO Data Lake overall solution

OPPO mainly builds data lakes from three dimensions: the lowest layer of lake storage is CBFS, which is a low-cost storage that supports S3, HDFS and POSIX file access protocols; The middle layer is real-time data storage format, which we use Iceberg; The top layer supports a variety of different computing engines

3. OPPO data Lake architecture characteristics

In the early big data storage, streaming and batch computing were stored in different systems. The upgraded architecture unified metadata management and integrated batch and streaming computing. At the same time, it provides unified interactive query, more friendly interface, second-level response, high concurrency, and supports data source Upsert change operation; At the bottom layer, large-scale low-cost object storage is used as a unified data base to share data with multiple engines and improve data reuse capability

4. Data lake storage CBFS architecture

Our goal is to build a data lake storage that can support EB-level data and solve the cost, performance and experience challenges of data analysis. The entire data lake storage is divided into six subsystems:

  • Protocol access layer: Supports multiple protocols (S3, HDFS, and Posix files). Data can be written using one of the protocols and read using the other two protocols
  • Metadata layer: the hierarchical namespace of the file system and the flat namespace of the object are presented externally. The entire metadata is distributed, supports sharding, and is linearly extensible
  • Metadata cache layer: Manages the metadata cache and provides metadata access acceleration
  • Resource management layer: The Master node in the figure is responsible for managing physical resources (data nodes, metadata nodes) and logical resources (volumes/buckets, data fragments, metadata fragments)
  • Multi – copy layer: support append write and random write, large and small objects are more friendly. One function of this subsystem is to store multiple copies of persistence. Another function is the data caching layer, which supports elastic replicas to speed up data lake access and later unroll.
  • Erasure code storage layer: it can significantly reduce storage cost, support multiple availability zone deployment, support different erasure code models, and easily support EB storage scale

Next, I will highlight the key technologies used in CBFS, including high-performance metadata management, erasure code storage, and lake acceleration

▌CBFS key technology

1. Metadata management

The file system provides a hierarchical namespace view. The logical directory tree of the entire file system is divided into multiple layers. As shown in the figure on the right, each MetaNode contains hundreds of metapartitions. Each fragment consists of InodeTree (BTree) and DentryTree (BTree). Each dentry represents a directory entry. The dentry consists of parentId and name. In DentryTree, PartentId and name are used as indexes for storage and retrieval. In the InodeTree, the index is the inode ID. Use multiRaft protocol to ensure high availability and consistent data replication. Each node set contains a large number of fragment groups, and each fragment group corresponds to one RAFT Group. Each fragment group belongs to a volume. Each shard group is a metadata range (an inode ID) of a volume. Metadata subsystem achieves dynamic expansion by splitting. When a shard group resource (performance, capacity) is close to the adjacent value, the resource manager service will estimate an end point and inform the node device to serve only the data before this point. At the same time, a new group of nodes will be selected and dynamically added to the current service system. Each directory supports a capacity of one million, and metadata is fully in-memory to ensure excellent read and write performance. Metadata fragments in memory are persisted to disks through snapshot for backup and recovery.Object storage provides flat namespaces; /bucket/a/b/c = /bucket/a/b/c = /bucket/a/b/c = /a/b This process involves multiple interactions between nodes, and the deeper the level, the worse the performance. Therefore, we introduce the PathCache module to speed up ObjectKey resolution. The simple way is to cache the Dentry of ObjectKey’s parent directory (/bucket/a/ B) in PathCache. By analyzing the online cluster, we find that the average size of the directory is about 100. Assuming the storage cluster scale is in the order of 100 billion, the directory entries are only 1 billion. The single-machine cache efficiency is very high, and the read performance can be improved by the difference of multiple nodes. Designed to support both “flat” and “hierarchical” namespace management, CBFS is simpler and more efficient than other systems in the industry. It can easily implement a single piece of data without any transformation, communicate with multiple protocols, and have no data consistency issues.

2. Erasure code storage

One of the key technologies to reduce storage costs is Erasure Code (EC). The principle of EC is as follows: K pieces of original data are encoded to generate m pieces of new data. If no more than M pieces of K + M pieces of data are lost, the original data can be restored by decoding (the principle is similar to raid). EC provides lower data redundancy but higher data durability compared to traditional multi-copy storage; There are many different implementations, most of which are based on xOR or reed-Solomon (RS) codes, which are also used in our CBFSCalculation steps: 1. Coding matrix, n line above is the identity matrix I, m line below is the coding matrix; 2. When a block is lost: delete the corresponding row of the block from matrix B to obtain the new matrix B ‘, and then multiply the inverse matrix of B ‘on the left to recover the lost block. Detailed calculation process you can read relevant information offlineThe common RS encoding has some problems: For example, suppose that X1 to X6, Y1 to Y6 are data blocks, and P1 and P2 are parity blocks. If any of them is lost, the remaining 12 blocks need to be read to recover data. Disk I/O loss is high, and the bandwidth required for data recovery is high. LRC coding proposed by Microsoft solves this problem by introducing local check blocks. As shown in the figure, two new local check blocks PX and PY are added on the basis of the original global check blocks P1 and P2. Assuming X1 is damaged, only six blocks X1 ~ X6 associated with it can be read to repair the data. Statistics show that in the data center, a stripe in a certain time monolithic disk failure probability is 98%, two disc damage probability is 1% at the same time, so the LRC England in most scenarios can dramatically improve data repair efficiency, but the disadvantage is that its not the maximum distance separable coding, can’t do like global RS coding loss any m copies of data can throw all repair back.

EC type

Offline EC: after k data units of the stripe are fully written, m parity blocks are generated. 2. Online EC: After data is received, it is split, calculates parity blocks in real time, and writes K data blocks and M parity blocks

CBFS Multi-mode Online EC across AZs

CBFS supports online EC storage with multi-mode strips across AZs. Different coding modes can be flexibly configured based on equipment room conditions (1/2/3AZ), objects of different sizes, service availability, and data durability. For example, six data blocks and three parity blocks are deployed in a single AZ. In 2AZ-RS mode, six data blocks and ten parity blocks are deployed in 2AZ mode, and the data redundancy is 16/6=2.67. 3AZ-LRC mode, using 6 data blocks, 6 global parity blocks and 3 local parity blocks; Different encoding modes are supported in the same cluster.

Online EC storage architecture

Contains several modules Access: Data access layer and EC codec capability CM: the cluster management layer manages resources such as nodes, disks, and volumes, and performs migration, repair, balancing, and inspection. A cluster supports different EC coding modes. Allocator: allocates volume space

Rt delete code to write

1, 2, the data flow collection data slicing generate multiple data block, and calculate calibration block 3, apply for storage volume 4, concurrent distribute data to different storage nodes or calibration blocks of data to use simple NRW agreement, guarantee minimum copies can be written, so that the normalized node and the network failure, the request will not block, ensuring availability; The data receiving, dividing and checking block coding adopts asynchronous pipeline mode, which also guarantees high throughput and low delay.

Rt delete code to read

NRW model is also adopted for data reading. Taking k= M =2 encoding mode as an example, as long as two blocks (no matter data block or parity block) are correctly read, the original data can be obtained by FAST RS decoding and calculation. In addition, to improve availability and reduce latency, Access preferentially reads ec-nodes with adjacent or low load storage nodes. As can be seen, online EC and NRW ensure strong data consistency and guarantee high throughput and low latency, which is suitable for the big data service model.

3. Data lake access acceleration

One of the significant benefits brought by the data lake architecture is cost savings, but the memory separation architecture also faces bandwidth bottlenecks and performance challenges. Therefore, we also provide a series of access acceleration technologies: Local cache, which is deployed on the same machine as the compute node, supports metadata and data cache, and supports different types of media such as memory, PMem, NVme, and HDD. The access latency is low but the capacity is small. Distributed cache, with flexible number of copies, provides location awareness, supports active preheating and passive cache at user/bucket/object level, data elimination strategy can also be configured with multi-level cache strategy, which has a good acceleration effect in our machine learning training scenarios. In addition, the storage data layer also supports predicate push down operation. It can significantly reduce the massive data flow between storage devices and compute nodes, reduce resource overhead, and improve computing performance. There is still a lot of detailed work on data Lake Acceleration, and we are in the process of continuous improvement

▌ Future Prospects

Cbfs-2.x is currently open source, and version 3.0 supporting key features such as online EC, lake acceleration, and multi-protocol access is expected to be open source in October 2021. In the future, CBFS will add features such as direct HDFS cluster mounting (without data relocation) and intelligent layering of hot and cold data to support smooth storage data in the original AI architecture and big data.

About the author:

Xiaochun OPPO storage architect

More exciting content, welcome to pay attention to [OPPO number intelligence technology] public number