In recent years, with the continuous development of IT technology, big data, machine learning and algorithms, more and more enterprises have realized the value of the existence of data, managing data as their precious assets, and mining, identifying and utilizing data assets with big data and machine learning capabilities.

If the lack of effective data missing or part of the overall architecture design ability, can lead to the business layer is difficult to directly use big data data, big data and business has a huge gap, the appearance of this gap, the enterprises in the process of using big data in data agnostic, demand is difficult to achieve, data sharing, and a series of problems, This paper introduces some data platform design ideas to help businesses reduce the pain points and difficulties in data development.

I researched 10 companies for this article.

I. Big data technology stack

The overall process of big data involves many modules, each of which is complicated. The following figure lists these modules and components as well as their functions and features. Later, there will be special topics to introduce the domain knowledge of relevant modules in detail, such as data collection, data transmission, real-time computing, offline computing, big data storage and other related modules.



Lambda architecture and Kappa architecture

At present, almost all big data architectures are based on lambda and Kappa architectures. Different companies design data architectures according to these two architectural patterns. The Lambda architecture enables developers to build large-scale distributed data processing systems.

It is flexible and extensible, and it is fault-tolerant to hardware failures and human error. There are many articles about lambda architecture available online. However, Kappa architecture solves various cost problems caused by two sets of data processing systems existing in lambda architecture. This is also the current research direction of flow batch integration, and many enterprises have begun to use this more advanced architecture.

Lambda architecture



Kappa architecture



Big data architecture under Kappa architecture and Lambda architecture

At present, major companies basically use Kappa architecture or Lambda architecture mode, and the overall architecture of big data in these two modes may be as follows in the early development stage:



4. Data end-to-end pain points

While this architecture will look a variety of large data components together implemented integrated management, but the contact data development people will feel more strongly, the bare architecture development need to pay attention to a lot of business data base tool use, there are a lot of spot and the difficulty in the actual data development, embodied in the following several aspects.

  1. Without a data development IDE to manage the entire data development process, the long-term process can not be managed.
  2. There is no standard data modeling system, resulting in errors in different data engineers’ understanding of indicators.
  3. The development of big data components has high requirements. Problems may arise when technical components such as Hbase and ES are used for common services.
  4. Basically, every company’s big data team is very complicated, involving many links, and it is difficult to locate and find the corresponding person in charge when encountering problems.
  5. It is difficult to break the data island, and it is difficult to share data across teams and departments, and each other does not know what data each other has.
  6. Need to maintain two sets of computing model batch computing and stream computing, it is difficult to get started development, need to provide a set of stream batch unified SQL.
  7. Due to the lack of metadata system planning at the company level, it is difficult to reuse and calculate the same data in real time and offline, and various kinds of sorting are required for each development task.

Basically, most companies have these problems and pain points in terms of data platform governance and the ability to provide openness. In the context of complex data architecture, for the data applicable party, each link is unclear or a function is not friendly, which will make the complex link change more complicated. To address these pain points, each piece needs to be polished to seamlessly connect the technical components so that the business can use data from end to end as easily as writing SQL queries to a database.

5. Excellent overall architecture design of big data

Provide a variety of platforms and tools to assist the data platform: data acquisition platform for multiple data sources, one-click data synchronization platform, data quality and modeling platform, metadata system, unified data access platform, real-time and offline computing platform, resource scheduling platform, one-stop development IDE.



Metadata – the cornerstone of big data system

Metadata is to get through data source, data warehouse and data application, and record the complete link of data from generation to consumption. Metadata contains static table, column, partition information (aka MetaStore).

Dynamic task and table dependent mapping; Data warehouse model definition, data life cycle; As well as ETL task scheduling information, input and output metadata are the basis of data management, data content and data application. For example, metadata can be used to build data graphs between tasks, tables, columns, and users. Construct task DAG dependency and arrange task execution sequence; Construct task portrait and manage task quality; Provides an overview of personal or BU asset management and computing resource consumption.

It can be considered that the whole data flow of big data is managed by metadata. Without a complete set of metadata design, it will be difficult to track the above data, difficult to control permissions, difficult to manage resources, difficult to share data and other problems.

Many companies rely on Hive to manage metadata, but I think at a certain stage of development, I still need to build my own metadata platform to match the relevant architecture.

7. Flow and batch integration calculation

Maintaining two sets of computing engines, such as Spark for offline computing and Flink for real-time computing, will cause great difficulties for users, who need to learn both stream computing knowledge and batch computing domain knowledge.

If you use Spark or Hadoop offline with Flink in real time, you can develop a set of custom DSL description language to match the syntax of different computing engines. The upper layer users need not pay attention to the specific execution details of the bottom layer, but only need to master a DSL language. You can connect Spark, Hadoop, Flink and other computing engines.

Real-time and offline ETL platform

ETL is Extract- transform-load, which describes the process of extracting, transforming, and loading data from the source to the destination. The term ETL is more commonly used in data warehouses, but its object is not limited to data warehouses. Generally speaking, ETL platform plays an important role in data cleaning, data format conversion, data completion, data quality management and other aspects. As an important data cleaning intermediate layer, ETL generally has at least the following functions:

  1. Supports multiple data sources, such as message systems, file systems, and so on
  2. Support a variety of operators, filtering, segmentation, conversion, output, query data source completion operator capabilities
  3. Dynamic change logic is supported. For example, the above operator can be submitted by dynamic JAR method to continuously accept and publish changes.

Ix. Intelligent unified query platform

Most data queries are demand-driven. One or several interfaces are developed for one requirement, and interface documents are written, which are open to the business side for invocation. This mode has many problems in the big data system:

  1. This kind of architecture is simple, but the interface granularity is very coarse, the flexibility is not high, the expansibility is poor, the reuse rate is low. As service requirements increase, the number of interfaces increases greatly, resulting in high maintenance costs.
  2. At the same time, the development efficiency is not high, which will obviously cause a lot of repeated development for massive data system, difficult to achieve data and logic reuse, seriously reduce the experience of business applicable parties.
  3. If there is no unified query platform to directly expose databases such as Hbase to services, subsequent data operation and maintenance management is difficult. Accessing big data components is also painful for service application parties, which may cause various problems.

Through a set of intelligent query to solve the above big data query pain points

                

X. Standard system of warehouse modeling

With the increase of business complexity and data scale, chaotic data invocation and copy, waste of resources caused by repeated construction, ambiguity caused by different definition of data indicators, and higher and higher threshold of data use. As an example of the actual business burying point and warehouse use I witnessed, some table fields with good_id, others with spu_id, and many other names for the same commodity name can be very confusing for those who want to take advantage of this data. Therefore, without a complete big data modeling system, it will bring great difficulties to data governance, which can be embodied in the following aspects:

  1. Data standards are not consistent, even with the same name, but the definition of caliber is not consistent. Uv alone, for example, has a dozen definitions. Which brings up the question: It’s all UV, which one should I use? It’s all UV, why is the data different?
  2. Every engineer needs to understand every detail of the R&D process from beginning to end, and everyone will step on the same “pit” again, resulting in a waste of time and energy cost for r&d personnel. This is also a problem for the target author, because it is too difficult to actually develop and extract data.
  3. There is no unified standard management, resulting in a waste of resources such as double computing. However, the unclear hierarchy and granularity of data tables also make repeated storage serious.

Therefore, big data development and data warehouse table design must adhere to design principles. Data platforms can develop platforms to restrain unreasonable design, such as Alibaba’s OneData body. In general, data development is conducted according to the following guidelines:



The above content is some of my own feelings, share out welcome to correct, incidentally beg a wave of attention, have a question or better idea partners can comment on the private letter I oh ~ or click Java learning sharing group to chat together oh

   

Akiko: IT technology management stuff


Source:
Today’s headline|