Takeaway: About in the process of iteration and rapid development of business, in order to make big data can assign business better, efficient to provide users with the business value of data products and services, baidu’s love’s team building real-time data and offline basic data platform of journey, including how to deal with the business, technology, organization and other challenges and solve actual pain points in the process of thinking and practice.

The full text is 9911 words and the expected reading time is 24 minutes.

One, foreword

As a one-stop public and private intelligent marketing and sales accelerator, Aipanfan not only carries the clue data of various promotion platforms within Baidu’s internal ecology (e.g. Search, information flow, wooden fish from the website and other marketing platform of business communication, inquiry and form to keep information and user behavior collection form a clue) latent, control, follow up and translate business ability, there are also external public domain advertising platform and tenants private domain of self-built system in various types of user behavior and attributes related clues access automation, At the same time, it also provides the business functions imported by self-expansion of clues. And then form the thinking and practice of big data analysis system construction with valuable data such as business data, user data and event behavior as the core.

How to continuously deliver data analysis services that are valuable to customers and meet their timeliness, accuracy and stability in agile iterations is the long-term goal that data teams need to think about. Therefore, to create a set of strategic architecture design is crucial to the foundation, this paper on Baidu Aipanfan data analysis team to take advantage of the situation of data-driven technology practical experience and experience to express.

1.1 Definitions of Nouns

Noun/abbreviation describe
Watts per Watt Data flow open platform, used for MySQL, BaikalDB and other database Binlog integration platform, can support multi-table Join, UDF and other extended functions.
Phoenix pavilion platform The business optimization oriented data asset center and data R&D and governance center built by the commercial platform R&D department manage the core data and core scenarios of the business end in a unified manner, fully improving efficiency. It makes full use of the company’s data warehouse engine, resource scheduling, data transmission and other standardized capabilities, for business side to provide process hosting, data application management, blood analysis, resource planning, monitoring and governance and other functions.
Bigpipe/BP Distributed data transmission system. It can not only complete the real-time transmission of messages and commands that the traditional message queue can achieve, but also can be used for the real-time transmission of log data. It can help the communication between modules to achieve decoupling, and can ensure that the message is not lost and not heavy. It also helps to unify operation and maintenance and traffic.
AFS Big data file storage and system, similar to the open source HDFS function.
Palo The MPP data warehouse based on Apache Doris (an analytical database engine developed by Baidu) supports high concurrency and low delay queries, and supports large data sets above PB level, which can effectively support online real-time data analysis.

Ii. Challenges and pain points

2.1 [Business]

In order to provide customers with a clear, comprehensive and multi-perspective view of multi-dimensional indicators, the detailed funnel situation of the process of creating clues, allocating clues, following up clues and forming orders in aipanfan system and understand the follow-up situation of employees, data statistical analysis is required to provide customers with valuable data products for each link.

1) Valuable: How to provide customers with valuable data products and services is a key issue to be considered throughout the demand review stage; 2) Timeliness: for example, customers assign a clue to their employees or the details of employees’ follow-up clues, hoping to quickly display them in the system at a second level; 3) Richness: it is easy to achieve a single Count and Sum, but it is more difficult to provide data analysis products with guiding significance for customer clue management and follow-up and public and private domain marketing activities.

2.2 [Technical aspects]

Business data, user behavior event data, internal and external business data of Baidu Ecology need systematic access number warehouse for unified management, processing and output. 1) With the rapid development of business, the volume of business data is large, leading to Mysql sub-database and sub-table, and OLAP analysis scenarios of core data related to sales domain clues cannot be directly queried from the database; 2) Metadata of collected data sources and business data sources are scattered in various codes, which cannot be managed uniformly. Data extraction is affected by database relocation, password modification or business offline; 3) There is a copy of data repeatedly stored and used in multiple places, and it is urgent to manage and reduce the maintenance cost and operation and storage resource cost of public data; 4) The data team has a limited number of r&d personnel, including operation and maintenance tasks, internal business support, and customer analysis business product research and development. In the case of a small scale, how to use limited human resources to bring more value into play; 5) How to ensure the stability of hundreds of tasks scheduled and running online in data extraction, logical processing, dump synchronization and other links?

2.3 [Organizational]

In the business department of Aiphanom, the production and research, marketing, commercialization, customer success and other personnel are divided into about 15 Agile teams. Each team has its own clear business goals, and the data team needs agile support for each team. 1) The monthly, quarterly and annual internal OKR of each team, as well as the monitoring of customer growth, business growth and operation of each team involved in its business, and how to respond quickly as a data team; 2) The input and output are relatively low and common in the case of one-off number taking; 3) The core indicators of Aiphanan business are consistent, how to unify the closure, manage and connect data as a service and platform.

= = =

Third, practice and experience sharing

Before the practice of architecture described in this paper was implemented, business data was basically processed by point-like engineering thinking, which could not comprehensively and systematically solve the current situation of data application. However, later we continued to explore, study, think and pay attention to the application of classical methodology in practice and finally implemented it into the technical architecture. In this process, human efficiency of research and development, human efficiency of operation and maintenance, computing and storage resources and other related costs will be considered, so as not to duplicate the wheel. Therefore, “Baidu on cloud” will be used in the construction (focusing on the mixture of internal private cloud and external public cloud, Platform big data components or solutions of “Baidu Intelligent Cloud” (committed to providing the world’s leading artificial intelligence, big data and cloud computing services and tools).

Next, the overall data technology architecture and practical experience will be shared through integration, storage, computing, scheduling, governance, query, analysis, insight into customer value, etc.

3.1 Data Architecture

3.1.1 What is data Architecture

When it comes to data architecture, it has to be said that In order to efficiently process massive advertising, search and marketing data, Google began to release the “trove” in 2003, including distributed File System (GFS) for scalable large data-intensive applications, simple data processing engine (MapReduce) for large clusters, Three papers on Structured Data Distributed Storage System (BigTable). Based on these three big data storage and computing components, later architects systematically designed two sets of mainstream big data solutions, namely Kappa and Lambda architectures. In fact, there is still a Unified architecture with certain limitations. So I don’t want to talk about Unified architecture here, but it’s important to realize that no single architecture can solve all problems.

3.1.1.1 Lambda architecture

Lambda architecture proposed by Nathan Marz is a common architecture for real-time and offline processing at present. Its design aims at low delay processing and offline data processing as well as supporting linear expansion and fault tolerance mechanism.

Lambda is a data processing architecture that takes advantage of the respective processing strengths of batch and stream processing. It balances latency, throughput, and fault tolerance. Use batch processing to generate a stable and aggregation-friendly view of the data, while using streaming processing to provide online data analysis capabilities. Before displaying data, there is an action of merging real-time layer and offline layer in the outermost layer. This action is a very important action in Lambda, which can fuse the results of the two together to ensure the final consistency.

Wikipedia points out that the three main elements of Lambda Classic are: Batch, also known as speed or real-time, and serving the query.

3.1.1.2 Kappa architecture

The Kappa architecture proposed by Jay Kreps focuses only on streaming computing, builds on the Lambda architecture, and is not intended to replace the Lambda architecture unless it perfectly fits your use case. In Kappa, the data is captured as a stream, and the real-time Layer puts the results into the Serving Layer for query.

Kappa integrates real-time data processing and continuous data retrieval into a single first-class processing engine, and requires that the collected data streams can be replayed quickly from a specific location or all of them. For example, if the calculation logic is changed, a streaming calculation, the dotted line in the figure, is restarted and the previous data is quickly replayed, recalculated and the new results are synchronized to the service layer.

3.1.2 Architecture selection

Because the previous point-like engineering thinking to deal with big data cannot comprehensively and systematically solve the current situation and pain points of data application, we urgently need to design and develop a data processing system that is suitable for the size and development stage of Aipanfan.

3.1.2.1 Comprehensive and systematic combing

[Data form] a large number of business data, behavioral event data and file data, including user behavior event data, user attribute data, clue manager, IM communication, marketing activities, account management, potential guest booking and so on, covering both private and public domains;

[Data integration] The data team does not produce data, so it needs to integrate data from various business lines, terminals, internal and external channels and other data sources to carry out OLAP related work in a unified manner.

[Data storage] Including offline T+1 data storage requirements can be stored in AFS and data storage requirements with high timeliness requirements can be stored in MPP architecture analytical database.

[Data computing] Includes real-time computing scenarios and offline computing scenarios. Spark Streaming or Flink is used for real-time computing, and MR or SparkSQL recommended by the Infrastructure Department can be used for offline computing.

[Data governance and Monitoring] including platform stability, metadata management, basic information and blood relationship management, scheduling management, data source management, exception handling mechanism, etc.

[Data R&D] includes the consideration of whether human resources of R&D and operation are sufficient, data reuse, operation specifications, and the implementation of general schemes from business modeling to logical modeling and then to physical modeling.

Including data business scenario 】 【 online business operation process of data analysis, online users to participate in activities, click and browse data to conduct analysis and statistics, such as the user identity attribute class data for ID once for precision marketing, internal and external operating decision class index and reporting scenes, ad-hoc query and download, universal service, OpenAPI, etc.

3.1.2.2 Fast and slow

In view of the current data form and business requirements of Aipanfan, after several rounds of internal discussions and cognitive alignment, the two paths are finally decided to be parallel. On the one hand, the “short and smooth” support part of urgent customer business requirements, and on the other hand, the systematic data architecture with long-term goals is built to achieve.

At the initial stage of construction in September 2020, it coincides with the sub-database and sub-table of sales domain (current clue management and IM communication). Due to the development of business, it is necessary to separate multiple sets of original systems in the same business database, and some business tables with data volume of hundreds of millions of levels are faced with sub-tables, which are divided into 128 sub-tables after being modeled by tenant ID. At this time, both the online business and the OLAP business are facing reconstruction and upgrading. After several technical review meetings, the data extraction of THE OLAP scenario is determined to choose one of Sqoop (open source), DTS (in-factory) and Watt (in-factory) from the direct pull method of the original SparkSQL task.

The final conclusion of the survey is to select the solution of Watt platform by extracting the business database, because it has advantages such as good support for sub-table access, high timeliness of Binglog reading, perfect monitoring, BNS load balancing, multi-table Join, abundant UDF and full-time operation and maintenance support personnel of the platform.

After more than two months, the three r & D students implemented the 1.0 version of the architecture in the figure above, which only met the urgent needs. However, the stability of the whole link was far fetched, and they often received internal and external complaints. They even managed to develop a set of alarm call function, so they often got up in the middle of the night to repair the work.

In retrospect, the data processing architecture was stable until January 2021 with version 1.0, which, apart from interacting with CDC files, was more like a variation of the Kappa architecture while we investigated industry best practices.

3.2 Service Requirements and Architecture Evolution

There are two normal types of software products. One is to have an endless stream of customer needs driving product iterations, and the other is to plan features that are valuable to customers from a professional perspective. Aipanpan happens to be in the development period, the above two appeals are quite strong.

3.2.1 Pursue timeliness

Not only did customers give feedback on the serious data delay of our online products, such as clue analysis and follow-up analysis, but also the awkward situation of sales staff talking with customers after operation to prolong the loading time in order not to highlight the weakness of this product when expanding new customers.

According to the online stability report recorded at that time, the most serious delay was 18 minutes before the data was released. In response to this dilemma, we made three improvements.

[Measure 1] Solve the problem of resource preemption of Spring Streaming operation, migrate the job to an independent cluster and isolate related resources;

[Measure 2] The remote DISASTER recovery scheme of Bigpipe was implemented. Under normal circumstances, The Bigpipe in Suzhou machine room was mainly used. In case of failure, it was immediately switched to Beijing machine room, and data compensation was made during the failover.

[Measure 3] By taking advantage of Watt’s multi-binglog joinable feature, more complex calculations are advanced to Watt platform, and part of data processing is also done in Spark Streaming, which is equivalent to the original online real-time calculation method and optimized to add two ETL calculations in the real-time stream process.

With the improvement of the above three measures, the timeliness of OLAP analysis scenarios was improved to 10 to 15 seconds for results.

3.2.2 REQUIREMENTS for BI scenarios

In order to have clearer strategic planning, sharper sense of smell and faster insight, our marketing, operations, commercial sales and customer success teams have endless demands, and the few r&d staff in the data team are unable to support such a large number of demands.

  • It is urgent to solve the core demands, and the data team sorted out the common needs and implemented the products.

  • For periodic data requirements, we make periodic push tasks through automatic emails and other methods

3.2.3 Common data Warehouse

As of March 2021, we still have more access to or use of the number of scenarios to support, for example: the business domain and unified data source, data capacity model reuse, self-service access platform or one-time access and so on power output, given that we have been learning to industry best practices, find for love’s present situation of the technical architecture need to be adjusted.

  • Beyond the 1.0 version mentioned earlier, it was difficult to extend, especially in areas such as ad-hoc, data modeling or development platformization, and common data governance.

  • Through research and development department of technology and business platform during the communication, found on Watt platform before using experience with phoenix pavilion platform integration, to achieve convenient dump, behind its product design for users to save a lot of work, not only can improve our work, but also can solve our technical and business aspects of the appeal.

  • At this time, the product and operation personnel urgently need us to support the function of self-service number retrieval platform and AD hoc query, and lead the OKR written into the team.

After the completion of POC, we decided to transform our architecture from the original 1.0 version to 2.0 version, and cooperated with the Fengge team to migrate our offline data to Fengge. It took one and a half months, which not only supported the strong demand of Ad-hoc, but also well supported hierarchical management of data warehouse, metadata, sub-topic, data source management and blood relationship. Data asset governance functions such as status monitoring.

Once the 2.0 version of the architecture is promoted, it helps the operation support team to solve the problem of the unification of the underlying data, and abandons the need for each agile team to spend their own human resources to develop the underlying data. What is more valuable is that after providing visual development tools with fenggui platform, standardization and process, an ordinary R&D staff, With a short training period, teams can process personalized monthly, quarterly, and annual internal OKR based on the existing underlying detailed data, as well as the customer growth, business growth, and operation reports involved in monitoring their business, freeing up a lot of labor for the data r&d team.


3.3 Warehouse modeling process

Based on The Fengge platform, the sub-topic modeling of data mainly adopts Kimball dimension modeling methodology. Firstly, the consistency dimension and business process are determined, and the bus matrix is produced. Then, the fact content of each topic is determined, and the granularity is declared to carry out related research and development work.

3.3.1 layered ETL

Judge the good and bad standard of a warehouse construction, not only from the rich and perfect, standard, reuse and other points of view, but also from the cost of resources, whether the amount of task expansion, query performance and ease of use, so our warehouse hierarchical planning, to avoid affecting the whole chimney structure.

3.3.2 Model selection

In the process of data model design and implementation, star type is chosen as the main choice. The following uses the facts and dimensions in the process of clue follow-up as an example to show the details from logical model to physical model.

3.4 Data Governance

Broadly, data governance is to data management, the whole life cycle of any link to the calculation include data acquisition, storage, cleaning, such as work of the traditional data integration and application, at the same time also contains data assets directory, data standard, quality, safety and development of data, data value, and data services and applications, The business, technical, and management activities that are carried out throughout the data life cycle fall under the data governance umbrella.

Data governance can be divided into governance of data assets and governance of data quality.

3.4.1 Data Asset Governance

Data asset governance is to establish standard data research and development and application standards, to achieve the sharing of internal and external data, and to apply data as valuable assets of the organization in business, management, and strategic decision-making, so as to give full play to the value of data assets.

3.4.1.1 Topic Management

Distinguish data subject, classify management data content, let users quickly find the data they want.

3.4.1.2 Basic Information and blood relationship

It reflects the attribution, multi-source, traceable and hierarchical nature of data, which can clearly indicate the provider and demander of data as well as other detailed information.

3.4.1.3 Permission Control and Self-service

Whether the product, operation, research and development after granting its data permission can be convenient to query and download data on the platform, at the same time, fengge platform data in “one pulse platform” or other impromptu query platform, by dragging and dropping the way to select flexible number.

3.4.2 Data quality management

After the architecture upgrade, the operation and maintenance work is put on the agenda, such as daily incremental data difference monitoring, abnormal data caused by the operation link block, cluster stability monitoring, network or related components jitter caused by data loss, how to compensate and restore, etc., are in urgent need of improvement.

Through the development of operation and maintenance scripts or tools, the scope of long-term monitoring or routine inspection is shown in the following figure.


3.5 Easy to Expand

After 2.0 version of the data structure, stable, we ushered in the new target, as a new system of a one-stop intelligent marketing online, found the tenant of marketing activities, the journey into the private domain of marketing functions such as, the customer can’t intuitive clear by means of quantitative data of the scene of the marketing effect, so we faced on the basis of the version 2.0 do stretch out in all directions.

3.5.1 Analysis of marketing effect

Since the private domain marketing is based on the data in CDP’s Impala&Kudu, which includes event data and user identity attributes, we directly used Imapla as the query engine in the initial analysis. Later, we found that the design of the online table structure did not give much consideration to the characteristics of the analysis scenario. Impala also has limited concurrency and does not meet aifanan’s two-second stability criteria. The first few requirements were barely live, but as we analyzed the scenarios and features, we saw a significant increase in customer requests.

Therefore, the business scenario of marketing effect analysis was moved from Impala+Kudu solution to Palo (Doris) as the storage of data analysis scenario. During this period, we also referred to other mainstream analytical MPP architecture database products of the same kind, and finally we chose Palo. The detailed comparison and description are as follows:

3.5.2 Real-time capability improvement

Based on the 2.0 version of real-time link, we communicated with the data architecture platform department about the application of the following analysis scenario, and did the related work of POC and pressure test in advance to confirm its feasibility, and then added the following part of the data link for real-time analysis:

Since the timeliness requirement is increased to less than 2 seconds, the above architecture is added on the premise that the original Broker Load import method of business data is not affected. In cooperation with CDP, detailed data is delivered to Kafka in real-time stream in the data processing link. Then, Flink to Palo and Kafka to Palo import methods will be used, corresponding to Doris’s design principle, namely, the Stream Load method is to use Flink to consume Kafka’s business data topics. Write data to Palo through HTTP using Stream Load; Routine Load is a Routine Load that submits an import task resident in Palo and writes data to Palo by constantly subscribing to and consuming JSON-formatted messages in Kafka.

Palo is chosen as the final analysis scenarios of query engine and storage solutions, Palo by BE module and FE module, the user through the client tools the Mysql protocol FE nodes, and one of them BE FE node to interact, will all BE node calculation result to BE responsible for coordinating node and returned to the query to the client.

3.5.2.1 Palo Mechanism

Data import in Palo is implemented using an LSM-tree algorithm. Write operations run asynchronously when Compaction occurs, without blocking the write, sequentially writes data to disk, and merge sort. In terms of query, it supports large-width tables and multi-table joins, and has features such as CBO optimizer, which can deal with complex analysis scenarios well.

3.5.3 Index system construction

The service model based on private domain marketing can be regarded as a B to B to C model. Data product managers establish an index system for tenants to clearly grasp the marketing effect in multiple dimensions. We combine the data product manager to generate visual reports and relevant real-time analysis services according to the product form of the private domain and the logical calculation rules of the valuable index system of the private domain.

Solved the tenant for the private domain users to participate in the activities of marketing situation after the master of time love’s system itself also custom made for index system of the tenant’s use products, such as DAU, MAU, daily, weekly PV and UV, and so on, this is very good and build Bridges between tenants, more clearly know which function is the tenant to use and rely on every day, It is necessary to retain and optimize the iteration for a long time. For product features that are not used by users, they should be cleaned up and offline on a regular basis.

As the core indicators, indicators of business analysis and general standards are maintained uniformly in the operation report system. With the process of business iterative development, the calculation logic is regularly arranged and sorted out, and a general indicator service interface is designed to provide unified and standardized invocation for internal and external.

3.6 Panorama of data analysis system

So far, Aipanfan’s data analysis system has become a complete set of overall solutions suitable for the present stage and easy to be extended in the later stage. It can be clearly seen from the panorama that the basic layer, platform layer, intermediate processing layer, public service layer and data transition are the essential foundation of big data analysis system.


3.7 Benefits of data analysis system establishment

[Business aspect] Data product managers or analysts go deep into customers’ core business scenarios and sort out business processes, and establish operation analysis indicators for key business processes. For example, the business scenario of “full staff promotion” abstracts three business processes of “customer acquisition”, “operation” and “re-cultivation”. Then refine out with lines of business product manager in the process of the “time”, “location” and “tenant”, “sales promoters” and “access”, “promotion of integral rule”, “promotion channel” and “fission” consistency dimension, so that it can be concluded that the modeling methodology mentioned above in the “bus matrix”, based on can bus matrix logic modeling, After the star or constellation model is generated, rich indicators can be calculated from multiple perspectives and multi-dimensions, and then the technical and product means such as user research (UBS) team feedback requirements, behavioral buried points in product functions and other customer interviews can be used to solve their own or customers’ business pain points. At the same time, reasonable thresholds will be calculated according to the indicators of horizontal peer level of the industry in which tenants are located. Then, technical means will be adopted to regularly and actively send business-related early warnings to customers. Finally, on the basis of early warnings, customers will be guided to operate relevant functions to improve the height of their businesses in the industry. Furthermore, it is beneficial to general indicators or labels to guide users to complete relevant operational tasks or to solve the pain points of data products with business value and guiding significance for business growth through various forms of direct contact with users and other data analysis services.

[Technical aspect] Technology serves for products, and products serve for customers. The timeliness, stability and accuracy of data analysis products are also required by technical architecture. After the data analysis architecture and governance system of Aipanfan has been improved in the process of complementary products and technologies, its timeliness is not up to the standard before, data is inaccurate, technical support for sub-database and sub-table is not in place, data is scattered everywhere and not unified reuse, and the demand for data collection of business lines is overpiled. Statistical logic inconsistencies and other situations can be resolved by the data analysis architecture described above or by quick trial and error.

[Organization] After fengge platform, standardization, process to provide visual development tools, an ordinary R & D personnel, Trained in brief can be the underlying detail data processing based on the existing teams personalized monthly, quarterly and annual internal OKR and teams monitoring its business customers involved in the growth, growth situation, operation report, technology can assign a business development team for the data liberation of a large amount of Labour; With instant query and download functions, one-time query, get the number becomes self-service; With the mode of data servitization and platform docking, the division of output and maintenance responsibilities of general operation analysis indicators and product line personalized operation indicators of our enterprise is clearer.

Iv. Summary and outlook

  • In the past few months, we have been thinking and implementing a real-time solution to realize offline (day – or hour-level) data indicators and reports, including the completed real-time data collected based on user behavior. Real-time analysis of the effects of private domain marketing scenarios for tenants also includes real-time status analysis, real-time time analysis and sales funnel analysis of public domain cue data, etc. Next, I will consider how to integrate data analysis with CDP (internal customer data platform) architecture. Combine user behavior events with business data and further cooperate with global user unified IDENTITY ID-mapping technology to achieve refined operation and bring greater business product value into play.

  • The integration of lake and warehouse technology is the future trend. Next, we will investigate the offline and real-time data warehouse docking internal private cloud or Baidu intelligent cloud data lake solution.

  • Design and develop platform-based data processing schemes to make research and development more convenient and improve human efficiency;

  • Further simplify data processing links, improve data processing efficiency, improve the timeliness of data products;

  • The introduction of the idea and service ability of Taiwan, the implementation of data standards, quantitative data health score, improve reuse ability and other intelligent scoring system, to achieve the ultimate goal of reducing cost and increasing efficiency.

———- END ———-

Baidu said Geek

Baidu official technology public number online!

Technical dry goods, industry information, online salon, industry conference

Recruitment information · Internal push information · technical books · Baidu surrounding

Welcome to your attention