Introduction: In the special session of cloud native enterprise data Lake of 2021 Cloud Computing Conference, Zhou Hao, senior solution architect of Ali Cloud Intelligence, shared the best practices of enterprise data Lake for us.

This article mainly shares the core capabilities of the data Lake and several best practice cases.

Here are the highlights:

I. Unified data storage, multi-engine docking, operation and storage separation

Before we start this section, let’s review some of the core capabilities of the Data Lake:

Centralized storage and interconnection with multiple engines

All types of data are stored in OSS in a centralized manner, seamlessly connected to various computing engines such as EMR, supporting the open source computing ecosystem

Data is stored directly without processing

Multiple data input sources are connected to provide convenient data access and data consumption channels. Multiple types of data can be directly stored in the original form and processed on demand. Compared with traditional data warehouse Schema restriction structure, it is more suitable for application scenarios with rapid service development

More flexible architecture and separation of operation and storage

The decoupling of computing and storage provides a more flexible system architecture design space, so that computing and storage resources have better scalability, fully improve resource utilization, greatly reduce the difficulty of operation and maintenance management, and optimize TCO. This is also an important reason why the customer chose the data lake solution in this case.

Second, best practice cases

Yeahmobi- Mobile Internet Advertising practice case

Yeahmobi, as a technology-driven enterprise international intelligent marketing service company, is mainly involved in intelligent marketing business, and the daily business fluctuations are very large. If the traditional architecture is used, resources must be prepared according to the peak of the service, resulting in a lot of CPU resources can not be utilized to the maximum. This is also a pain point for many smart marketing Internet companies. Based on this, most enterprises choose the data lake scheme.

Storage and computing are decoupled so that the computing resource usage can increase and decrease dynamically according to the change of online service volume and the resident resource quantity can be reduced
A variety of different types of computing engines, easy to meet the analysis requirements of various scenarios of online advertising
Through the data lake solution, the overall TCO optimization reached 30%, making the business form more competitive

As long as the data is stored in the data lake, computing resources can be dynamically scaled and created as the business changes, and only a minimal resident computing resource is required to be maintained. In this case, combined with the dynamic scaling computing and analysis capabilities of EMR in both semi-managed and fully managed modes, the difficulty of operation and maintenance can be greatly reduced. This is why many smart marketing companies choose this data lake solution. Yeahmobi’s TCO dropped 30 percent after choosing this data lake solution.

Shu He Technology – Internet finance practice case

Suhe Technology is an Internet financial technology company. Due to the characteristics of its industry and its own business scenarios, it has high requirements on the security and reliability of data and the fine granularity of data access control. Digital Wo serves a large number of internal and external users, which is sensitive to data security and requires strict data permission isolation. Secondly, the overall business changes also require very strong throughput capacity to support computing and storage.

In fact, in the development process of Suwo, the earliest use is the most common and most common way to build big data cluster both through the server, but soon found that this way can not keep up with the rapid development of business: First, the storage cost increases significantly. A standard HDFS cluster has three redundant backups. The storage cost increases significantly after considering the water level and the overhead of the entire file system. If HDFS cluster nodes are added frequently, service availability may be affected.

Based on the above reasons, Digital Wo chose ali Cloud data Lake scheme. The data lake uses object storage OSS as the base, without worrying about capacity expansion or small file growth. The rapid increase in the number of files causes great pressure on the NameNode of the HDFS cluster, but the object storage structure does not need to worry about the increase in the number of files, even if the number of objects is up to trillions. After adopting the data lake method, multi-bucket segmentation with ali Cloud RAM system can achieve very delicate access control. For example, the JindoFS scheme that OSS and EMR cooperate and optimize at the software layer can output more than the throughput capacity of TBS to support the needs of the whole business, and the actual operation experience exceeds self-built HDFS. In addition, by virtue of the elastic resource capability on the cloud, the task can be flexibly scaled by thousands of nodes on demand to achieve the effect of cost reduction and efficiency increase.

Data lake classic use scenario – hot and cold data stratification

Model features

A large amount of cold data accumulates during the long-term operation of application and service systems. The increasing cold data exerts great pressure on the storage space of existing clusters
The storage space for cold data needs to be allocated and the space for performance optimization for frequently accessed hot data needs to be reserved
Optimize the long-term storage cost of cold data to be much lower than that of hot data, and the cold data should be easy to read

Layering hot and cold data is a classic use of data lakes. The long-term running of application and service systems will generate a large amount of cold data, which will exert great pressure on the operation and maintenance of the whole cluster. On the one hand, there is the pressure of scale. The server architecture in the general big data cluster is relatively homogeneous, which leads to little space for the optimization of cold data. If high-density or outpurchased models are added, the difficulty of cluster operation and maintenance management will be greatly increased in practice. On the other hand, in an IDC environment, rapid expansion of a physical cluster is limited by many factors. This is also the reason why many data lake customers migrate from the traditional big data cluster architecture to data lake. Many customers are already embracing data lakes and using OSS across the board. If not, the customer will first settle the temperature data and cold data to OSS. As early as 2016, OSS has been fully integrated with the Hadoop ecosystem. Hadoop 3.0 can access OSS directly, and written tasks can be run without any modification, greatly reducing the difficulty of migration. After migration, intelligent lifecycle management on OSS can simply configure a lifecycle policy to further precipitate cold data into archiving and cold archiving types according to rules, further reducing costs.

Educational technology platform practice case

Customer value

OSS provides multiple storage types and data life cycle management to optimize the cost of long-term storage of cold data. By carrying cold data on the cloud, IDC self-built clusters do not need to be expanded, solving the equipment room space problem
The high scalability of the OSS data lake helps customers effectively solve the performance problem of big data storage and avoids the performance bottleneck of self-built HDFS file systems on metadata nodes
The customer is planning to use elastic resources on the cloud to expand computing resources and reduce one-time resource investment

This is a practical case of hot and cold stratification. The business scenarios involved in the education platform include the collection of various logs to help students improve their learning. The customer also faces a problem that a large number of log collection can cause a lot of pressure on the space footprint. As the client built IDC by himself, it was difficult to complete the expansion of physical space within a period of time, so the data lake scheme was finally selected. Through the dedicated line, THE connection between IDC and Ali Cloud was opened, and the existing IDC was expanded with ali Cloud’s resources. Through the dedicated line, the offline cold data was migrated to the data lake, freeing up space for the offline business. After the pressure was released, the space was very flexible. Furthermore, many application logs are directly deposited into the lake to provide more guarantee for data reliability through object storage and multi-version capability. At the same time, cold archiving capability is used to further sink temperature data to reduce costs. Data entering the lake is transferred to local computing through dedicated lines. However, in the use of the data lake, if customers want to further use computing resources on the cloud to expand computing capacity, there is no need to purchase offline computing servers at one time, which further reduces costs.

Practice cases of globalizing online games

Customer value

Through the log service, the collection and delivery of application logs can be made through the real-time computing engine, providing data support for the subsequent user thermal map, user trajectory, user login and online statistics
The OSS data lake is used for long-term storage of all log data. Combined with an offline analysis engine, the OSS data lake can be used for in-depth analysis of log data
Unified global architecture deployment. For a globalized game, the same deployment mode can be used in any region of the world, simplifying operation and maintenance deployment

A global game generally needs to serve global players, which requires global unified architecture deployment to reduce operation and maintenance difficulties. Ali Cloud Data Lake can use the same deployment mode in any region of the world, which can fully match the needs of customers. In addition, the collection of game industry logs is very critical, such as the large screen display of online game numbers, which is analyzed through the collection of application server logs. For this customer, we adopted aliyun’s log service, which collected logs from thousands of application servers in real time, pushed them to Flink for real time calculation, and wrote the results to ClickHouse in real time to provide real-time query. OSS, in this scenario, acts as a permanent store for logs. SLS periodically delivers the collected logs to the OSS, and directly uploads some application logs to the OSS through the OSS SDK and some command line tools. Logs stored in the OSS can be further analyzed offline, for example, Spark and Hive. The results of in-depth analysis are written to ClickHouse to provide more analysis queries.

Xiaopeng Automobile – a practical case of autonomous driving

Data lakes and storage products are seamless. In this autonomous driving case, we provide a complete set of solutions from acquisition to storage to analysis. Lightning cubic provides car deployment capabilities, has solved the automatic driving scenarios every day to collect a lot of road data storage problem, after you have collected by the nearest access point data stored in the OSS lake to go quickly and solve the problem of the last mile, after the data uploaded to the OSS, ali cloud can be used directly in all kinds of calculation engine, Including EMR, MaxCompute and so on to carry out a variety of cleaning annotation and analysis of data. CPFS is a storage product that supports massively parallel computing on Aliyun. It has very high throughput and POSIX semantics. The seamless data flow between OSS data lakes and CPFS enables training data to be transferred to CPFS, analyzed on data Gpus, and written back to OSS for long-term storage.

Not only in the field of the Internet, including automatic driving, high performance computing and other applications have been widely used data lake. We hope that more users can introduce Ali Cloud data lake into production business.

The original link

This article is the original content of Aliyun and shall not be reproduced without permission.

Best practices for enterprise data lakes

I. Unified data storage, multi-engine docking, operation and storage separation

Second, best practice cases

Related Posts

Tencent Cloud – Cloud host building LNMP

Dubbo agreement

Do [Paxos] algorithm from root!!