What exactly is the data center has been debated for several years.
The author thinks that the data center should not be a simple system or a software tool, but a set of architecture and a set of data flow mode.
The data center needs to collect data as raw materials for data processing and modeling, and then store them in different categories. Then, according to the actual business scenarios, it creates various data services (including data application platforms) to realize the acceleration of business empowerment.
However, the realization of the above process needs to be supported by corresponding systems and products, so which systems or products should the basic data center be composed of?
Here we can first take a look at the data center architecture of several enterprises.
It can be seen that although the data center system derived from each enterprise is different due to its own business, the big architecture is basically unified, and they all need to go through a stage of “data collection and access” – “processing and storage” – “unified management” – “service application”.
Here, the author believes that the data platform architecture summarized in the book “Data Platform Product Manager: From data system to Data platform combat” is relatively universal. Whether it is the Internet industry, or the traditional industry, it can be reformed on this architecture to design and build its own platform architecture.
In general, the functional architecture of data Center consists of big data platform, data asset management platform and data service platform, among which self-service analysis platform and label management system are the most widely applied scenarios.
1. Big data platform
The big data platform is the base of the data center. We can also call it the big data development platform. It needs to have the development ability related to big data, providing data storage, data cleaning/computing, data query display and permission management and other functions. So, how to build these functions and services? Is having these capabilities equivalent to building a successful big data platform?
In fact, we can find that the system architectures of big data platforms of various companies are basically the same, including data collection components, data storage components, data computing engines, data permissions and security components, as well as cluster management and monitoring components.
Except for a small number of enterprises like Alibaba, which have made great efforts to build their own “Flying Sky” system, other enterprises still choose the underlying components mainly based on the technical system constructed by Hadoop ecology, and rely on various open source components for optimization, improvement and secondary development. For example, the data storage component can be HBase or Hive, and the data computing engine can be Spark or Flink.
Since everyone chooses the same or similar components, why does the service capability of each enterprise’s big data platform differ in the end? This is similar to buying parts to assemble a desktop computer. You don’t need to choose the most expensive parts, but the most suitable ones according to your actual needs.
A good big data platform requires the ability to solve problems for users. Data, therefore, China’s big data platform construction is not competing introduction of new technology, how many covers many technical components, but rather to see whether it can solve the data in the construction of China facing the complex data, can work as a technical support of data China to break the barriers, and can provide a concise and effective data processing tools, Such as providing self-configurable data collection and data cleaning tools, and whether it can provide more added value.
The construction of the big data platform in the data Center can avoid the waste of resources caused by the construction of big data clusters by the technical teams of each business division. For enterprises, a unified and mature big data platform cannot be achieved overnight. It needs to be implemented step by step to build the enterprise’s big data platform ecology in continuous iteration.
2. Data asset management platform
Data asset management platform mainly manages data resources. Data assets are distributed in various big data components, including Hive tables, hbase tables, Druid datasource, and Kafka flows. It is difficult for the management and control systems of each component to communicate with each other. Therefore, a unified data asset management service is needed to coordinate the management of big data resources.
With the construction of the big data platform, it is possible to build the data system of the data center. Through the classification and integration of the data of various business lines, we can build various data subject domains, complete the standardized storage of data, form data assets, and then complete the management of data assets.
In the data center system, the data asset management platform is mainly composed of metadata management and data model management. Let’s have a look at them respectively.
- Metadata management
To talk about metadata management, we need to understand what metadata is.
MetaData is usually defined as data about data, or data about data. MetaData describes data and information resources. Metadata is the most important data of all.
Here’s the most popular example. When we go to the library to borrow books, directly in the face of tens of thousands of books, naturally difficult to find, but you through the library inquiry system input the name of the book, author, publisher and other information, to obtain the accurate location of the book. Then these titles, authors and other information, can be understood as metadata, and the book storage location, borrowing history records, etc., is the common data in our system.
In a database, the metadata of each data table includes the table name, creation information (creator, creation time, department), modification information, table fields (field name, field type, field length, etc.), and the relationship between the table and other tables.
As a matter of fact, metadata can be classified in various ways. The author prefers to classify metadata according to its use, which can be divided into three categories: business metadata, technical metadata and management metadata.
ø Service metadata: describes the service meaning and rules of data, including service rules, data dictionary, and security standards. By clarifying business metadata, people can have unified data cognition and eliminate data ambiguity, so that business parties who do not understand the database can understand the content of the data table.
Excellent technical metadata: describe data source information, data conversion and data structured information, mainly in the service of data developer, allows developers to clear the data table structure and rely on upstream and downstream tasks, including database table field (storage location, database tables, field length, and type), data models, ETL scripts (scheduling information) and the SQL scripts and so on.
ø Management metadata: describes the management ownership information of data, including service ownership, system ownership, operation and maintenance ownership, and data permission ownership, which is the basis of data security management.
So some people say that metadata records the whole process of data from nothing to existence, just like a “dictionary” of data, so that we can query the meaning and origin of each field, and it is also like a “map”, so that we can trace the path of data generation.
Through the construction of the data system, the metadata of the data center gathers the data information of each business line and each system of the enterprise, so that the data center has the ability to provide the view of the whole domain data assets, and realizes the goal of unified data assets query and access.
Metadata management includes metadata addition, deletion and editing management, version management, metadata statistical analysis and meta-model management. Through the above functional modules, the data system is implemented in a planned way to realize the structuring and modeling of metadata in the data, which can not only avoid the disordered and redundant phenomenon of metadata, but also facilitate users to query and locate data.
- Data model Management
When we introduced metadata, we mentioned that technical metadata includes data models, where data models are the work products of data modeling using metadata.
Based on the usage of underlying data, such as the association information of data tables, SQL script information (data aggregation and query information, etc.), metadata can be obtained, which can better complete the abstraction of services and improve the efficiency of modeling.
Data model is an effective means of data integration. It completes the mapping between data sources and provides “implementation drawings” for data theme construction.
At the same time, in the process of data modeling, data consistency can be ensured and redundant data can be digested by specifying data standards.
As for data model management, it means that in the process of data modeling, through the established data model management system, the data model can be added, deleted, modified, and checked, while complying with the requirements of data standardization and data unification to ensure data quality.
3. Data service platform
- Self-service analysis platform
Self-service analysis platform, also known as business intelligence platform (BI platform). BI platform has been the standard configuration of many enterprises. At present, the industry competition in BI commercial market is becoming increasingly fierce. Participants can be divided into the following three categories:
► Domestic BI manufacturers, the typical representative is Fansoft, which has occupied the first place in the domestic market for many consecutive years
ø Foreign BI manufacturers, such as Tableau
► Internal incubation of Internet giant factories
BI platform is the main provider of service capabilities of the data Center. In order to give full play to the due value of the data center, the construction of BI platform is essential, so it is necessary to divide the construction of BI platform under the data center system. In summary, BI platform should have the following capabilities.
(1) Data access
In addition to the data center’s own data sources, BI platform also needs to support the access of external data sources. There are three main access modes.
ø File type: Supports uploading Excel file data.
► Data connection: Supports databases such as Mysql and Oracle, and big data platforms such as Hadoop and Spark (big data platforms in data Center are also listed in this column);
►API reading: Third-party system data can be obtained through APIS.
Legend: data source supported by Fansoft BI platform
(2) Data processing
BI platform needs to provide data modeling tools for users to help users create target data (data set). Its functions include dragging and dropping table fields, automatic identification of dimensions/indicators, custom view statements, preview data, setting virtual fields, function calculation, setting parameters and other basic operations. And multi-source heterogeneous JOIN/UNION and other data processing functions.
FineBI self-service data set data processing interface
(3) Data analysis and visualization
On the basis of data processing, BI platform also needs to provide users with rich chart making and online analytical processing (OLAP) operations, so that users can complete data analysis and data visualization in the front page.
The operation process is as follows: Users select the processed data set, filter the dimensions and indicators, and then complete the analysis of business requirements through operations such as roll up and drill down, chart linkage, report jump, etc. Meanwhile, THE BI platform will provide users with visual graphics components, enabling them to finally complete the design of visual content.
(4) Content distribution and basic services
BI platforms need to have the ability to distribute visual content and control viewing permissions and data permissions. The main distribution methods include BI platform, mobile BI (App), data large screen, email, link access, and third-party embedding.
At the same time, BI platform also needs to have basic operation management, role management, help center, message push and other functions.
Only a BI platform that meets the above functions and has the service capabilities of multidimensional analysis, data visualization and data large screen can maximize its value in the data center system and effectively help analysts and operation teams improve work efficiency.
- Label management system
In addition to BI platform, label management system is also one of the important application directions of data services. At present, business departments are faced with a large number of precise marketing scenarios. These thousands of recommendations and pushes need to be realized based on a perfect and accurate user portrait, which needs to be supported by a large number of comprehensive user labels.
Therefore, as the basic data of personalized business application, the credibility and effectiveness of label data have become a key indicator to measure the maturity of user portrait.
We can regard the label management system as the base of the user portrait system. Based on the data system created by the data center, it can naturally break through the data barriers in the label governance, build an enterprise-level and unified recognized user label system, and thus create an enterprise-level user portrait system.
The label management system of data center mainly has the following functions.
(1) Identification of user uniqueness
In many enterprises, each line of business has its own independent user identification system. For example, in 58 Group, there are 58 device fingerprint, anjuke unique user, recruited natural person, financial natural person and other user identification methods. However, most of these identification methods serve a single line of business. Labels within each line of business are also developed for the independent user identity of the business.
The label management system of data center can provide unified user identification service, associate and unify the independent user identification of each business line, so as to get through the independent user identification and label interactive conversion scheme for the whole enterprise.
(2) Label system management
Label management system’s main job is to make the label data and information interaction scheme, through the user portrait barriers to development and service of information and data, provide the tag access, visual information display and visualization tag access control, visual user directional extraction and analysis, and visualization crowd people visual similar extension (Lookalike), and other functions.
(3) Label data service
The label management system needs to provide services such as label extraction and query involved in user portrait research and development and application, provide relevant solutions to all business parties in the way of standardized service interface (API), and support business parties to create personalized services based on data center capabilities of business lines.
In addition to BI and label management, enterprises also need to maximize data application value mining according to the characteristics of their own industries.
Reference Documents:
- The strongest and most comprehensive specification guide for warehouse construction
- Meituan data platform and data warehouse construction practice, over 100,000 words summary
- Fifty thousand words | spent a month unscrambles the Hadoop vomiting blood
- Number warehouse construction nanny level tutorial PDF document
- The most powerful and comprehensive big data SQL classic interview questions complete PDF version