Abstract:At the time of the 618 promotion, Xiao Zhang encountered a thorny problem, which required a joint analysis of the revenue of the company’s e-commerce department in the past year and the operation data of offline stores within a week.
What kind of data challenges does this create?
- Data island: The data of e-commerce department is stored in warehouse A and the operating income data of stores is stored in warehouse B. How to conduct multi-warehouse joint analysis conveniently?
- Pb-level data volume: Multi-e-commerce platforms + nationwide offline stores will generate TB-level data volume every day, and the annual data volume is as high as PB-level!
He immediately contacted the CTO of the group, hoping to export the data of each department to him within a day.
At this point, the CTO was stumped:
The existing resource pool of the company can cope with TB-level data freely, while the data volume of Xiaozhang is roughly estimated to reach PB level, which is far beyond the range of the existing resource pool of the company and can only be exported at the cost of time. And the overall cost of expanding the company’s resource pool for unusual scenarios is too high.
In the face of zhang’s difficult problems, Yunhuhu recommended a Huawei cloud big data query and analysis magic device — Data Lake Exploration (DLI) service; A DLI can pry eB-level data volume joint query, only 0.35 yuan/hour per CU (1CU=1Core4G Mem), 1CU monthly package only 150 yuan.
Data Lake Exploration (DLI) Service 2.0 is a Serverless big data computing and analysis service fully compatible with Spark and Flink ecosystem. Users can query and analyze heterogeneous data sources using standard SQL or programs.
How does DLI solve the small zhang problem?
DLI service architecture – Serverless
DLI is a serverless big data query and analysis service. Its advantages are as follows:
(1) Charging by volume: real charging by usage (scanning volume /CU), 0 charge when no operation.
(2) Automatic capacity expansion: The system automatically expands and scales computing resources based on service loads.
The DLI Serverless architecture can easily solve the problems of small costs, insufficient resources, and AD hoc business requirements.
1. Spark+Flink, the core DLI engine
Spark is a unified analysis engine for large-scale data processing, focusing on query computational analysis. Based on open Source Spark, DLI performs a lot of performance optimization and service transformation. It is compatible with The Apache Spark ecosystem and interface, and its performance is 2.5 times higher than that of open source. It can query and analyze EB-level data at the hour level. DLI also provides a Flink engine for real-time processing.
2. DLI trump card function — cross-source analysis
DLI supports a variety of cloud services on the cloud, self-built databases and offline databases, and can directly implement cross-database analysis of multiple data sources to build a unified view of the enterprise.
When Xiao Zhang connects offline warehouse A and warehouse B to DLI at the same time, he can conduct joint query directly on DLI. It avoids the process of data migration and re-establishment of warehouse for joint query, and easily handles cross-database query.
Other benefits of data Lake Exploration (DLI) services
- Pure SQL operation: Provides standard SQL interfaces, enabling users to query and analyze massive data using ONLY SQL.
- Separation of storage and computation: decouple storage and computation, separate application and accounting, reduce costs and improve resource utilization.
- Enterprise multi-tenant: Computing resources are isolated by tenant and data permissions are controlled to queues and jobs, helping enterprises share data between departments and manage permissions
- O&m free, HA: Users do not need to be aware of underlying O&M, upgrade, cross-AZ HA, and cross-AZ hypermetro.
Application scenario of Data Lake Exploration (DLI) service
1. Database analysis +DLI 2.0: One-click warehouse building retains the easy-to-use experience of database
Pain points:
(1) Most databases cannot do full analysis
(2) Complex database relationships cannot be queried
(3) Other online data services are affected
Solution:
Big data query analysis can be completed using only standard SQL
2. Precision marketing +DLI 2.0: E-commerce intelligent recommendation cross-database cross-source massive data second-level query
Pain points:
(1) Too many data sources, how to make joint analysis
(2) Intelligent recommendation needs to be realized in a short time
Solution:
DLI cross-source capability easily breaks data silos. It now supports 10 types of data sources and offline self-built data.
3. Log analysis +DLI 2.0: Company mandatory Scenario Charging by volume reduces the cost
Pain points:
(1) Log analysis has a long time span
(2) Large idle resources with low utilization
Solution:
DLI charging by volume, single CU only 0.35 yuan per hour.
4. Real-time risk control +DLI 2.0: finance, operation and maintenance and other real-time scenarios to reduce risk events
Pain points:
(1) Data is not refreshed in time, and risk events occur frequently
(2) It is necessary to have an in-depth understanding of Flink background architecture for real-time data analysis
Solution:
The risk control system has high requirements on real-time performance. DLI adopts high-performance computing resources, and a single CPU can handle 1000 ~ 20000 messages per second.
Serverless big data service is a future-oriented form. As the current problems are broken one by one, its proportion in big data analysis will increase year by year. Truly turning big data analytics into an accessible tool that every enterprise can afford, just like water and electricity. Huawei Cloud Data Lake Exploration (DLI) service enables enterprises to easily complete batch processing and stream processing of heterogeneous data sources, and excavate and explore data values.
For more information, visit huawei Cloud Data Lake Exploration (DLI) Service Officer
Click to follow, the first time to learn about Huawei cloud fresh technology ~