Introduction: With the advent of marketing 3.0 era, enterprises increasingly need to rely on the powerful CDP ability to solve their serious data island problem, help enterprises warm up clues, promote the activation of customers. But what is a CDP and what are the key features of a good CDP? While answering this question, this paper describes the construction practice of aipanfan rent-level REAL-TIME CDP in detail, including component selection under advanced architecture goals, platform architecture and key implementation of core modules.
The full text is 19,135 words, and the expected reading time is 26 minutes
1. What is CDP
1.1 CDP origin
The Customer Data Platform (CDP) is a popular concept in recent years. With the development of The Times and the change of the environment, with the increase of the enterprise’s own media, customer management and marketing become more difficult, and the problem of data island becomes more serious. In order to better market the customer CDP was born. Longitudinally, CDP mainly experienced two stages before its emergence. In the CRM era, companies interact with existing and prospective customers through phone calls, text messages, and emails, and perform data analysis to help drive retention and sales; In the DMP stage, enterprises implement advertising and media promotion activities by managing various Internet platforms.
The core functions of CRM, DMP and CDP are different, but it is easier to understand CDP by comparing them vertically. There are great differences among them in data attribute, data storage and data usage.
A few key differences are as follows:
1.CRM vs CDP
-
Customer management: CRM focuses on sales documentary; CDP is more focused on marketing.
-
Contact point: CRM customers are mainly telephone, QQ, mailbox, etc. CDP also includes the user accounts associated with tenants’ own media (for example, the enterprise’s own website, APP, official account, and mini program).
2.DMP vs CDP
-
Data type: DMP is based on anonymous data; The CDP is dominated by real-name data.
-
Data storage: DMP data is only short-term storage; CDP long-term data storage.
1.2 CDP definition
In 2013, David Raab, an analyst of MarTech, first proposed the concept of CDP, and his CDP Institute later gave an authoritative definition: Packaged software that creates a persistent, unified customer database that is accessible to other systems.
There are three main levels:
-
Packaged Software: Deploy based on the enterprise’s own resources, use a unified software package deployment and upgrade platform, and do not do customized development.
-
Persistent, unified Customer database: Extract enterprise multi-type business system data, form a unified view of customers based on some identification of data, store it for a long time, and carry out personalized marketing based on customer behavior.
-
Accessible to other Systems: Enterprises can use CDP data to analyze and manage customers, and can take out customer data reorganized and processed in various forms.
1.3 classification of CDP
The C (Customer) of the CDP itself refers to all customer-related functions, not just marketing. Different scenarios also correspond to different TYPES of CDP. Different types of CDP have different functions, but the relationship between different types is progressive.
It is mainly divided into four categories:
-
Data CDPs: mainly customer Data management, including multi-source Data acquisition, identity identification, as well as unified customer storage, access control, etc.
-
Analytics CDPs: In addition to Data CDPs-related functions, it also includes customer segmentation, sometimes extended to machine learning, predictive modeling, revenue attribution analysis, and more.
-
Campaign CDPs: In addition to the relevant features of Analytics CDPs, Campaign CDPs also include cross-channel Customer Treatments, such as personalized marketing, content recommendation and other real-time interactive actions.
-
Delivery CDPs: In addition to related functions of Campaign CDPs, it also includes Message Delivery, such as mail, site, APP, advertisement, etc.
Campaign CDPs and Delivery CDPs are more similar to MA (Marketing Automation) in China than those of Analytics CDPs. The CDP mentioned in this article belongs to the Analytics CDPs in terms of the range of functions it provides. There is also a special MA system in Aipanfan, and CDP of this paper provides data support for it.
Challenges and goals
2.1 Challenges
With the advent of the era of marketing 3.0 to love his private domain products, mainly with the help of powerful CDP provides online and offline data through management at the same time, the enterprise can use intensification of customer group, more scenes of education activities (such as automated means of marketing, holiday sales promotion notice, birthday blessing SMS, live events, etc.). More importantly, enterprises can carry out more personalized, accurate and timely secondary real-time marketing based on pure real-time user behavior, helping enterprises to warm up leads, activate customers and improve the transformation effect of private domain marketing. Therefore, how to do a good job in driving upper-level marketing services through real-time CDP (RT-CDP for short) faces many challenges.
[Business Level]
1. There are multiple data channels and different data forms in enterprises
In addition to the official website, files, App, and its own system, an enterprise also includes a large number of its own media (such as wechat public account, Douyin Enterprise Account, Hundred number, and various small programs) and other scenarios with different data structures. How can an enterprise efficiently access its data to RT-CDP? This is a systematic problem that thousands of business owners need to solve on the subject of customer data convergence.
2. Different ecosystems cannot be connected, and users cannot have a 360-degree insight
Data dispersion makes it difficult to identify the unique user identity, and it is impossible to establish a comprehensive and continuously updated user portrait, leading to the cognitive fragmentation and fragmentation of users and insufficient insight. For example, in the actual marketing scenario, when the enterprise expects to issue coupons to the same users who visit the official website and its mini-program at the same time to promote activity, but because a person’s behavior is scattered in the data of various channels with different identifiers, it is impossible to conduct cross-channel user behavior analysis, so the enterprise’s demands cannot be realized.
3. The rules of crowd division are complicated
Our business is different, so we can according to business characteristics, for different customers personalized label on, such as the enterprise marketing activities, want to give a user journey through iteration nodes, participate in the live and so on in different scenarios, such ability subdividing different populations and do more sophisticated marketing.
4. How to use one platform to serve B2B2C and B2C enterprises well, but the industry can learn from less experience
Aipanpan’s customers involve a variety of industries, some B2C and B2B2C. Compared to B2C, the complexity of B2B2C business scenarios increases exponentially. In addition to managing the B and C images, it should also take into account the logic of the upper level services, such as identity fusion strategy and behavior-based selection. In addition, there are many business scenarios where business boundaries are not clear.
[Technical level]
1. High requirements for real-time and accurate identification of all channels
In today’s era, a customer’s behavior crosses sources, devices and media, and the behavior trajectory is seriously fragmented. If enterprises want to achieve good marketing effect, accurate and real-time identification of customers and customer behavior trajectory is an important prerequisite. How to achieve high performance real-time identification in multi-source multi-identity is also a great challenge.
2. The ability to process massive data in real time and with low delay is required
Now customers can be more selective, not clear intention degree, real-time marketing based on customer behavior, and based on customer feedback, real-time secondary interaction is the key to improve the effect of marketing, such as an activity of enterprise Marketing Department group, the customer points points, not ordered what further action, representing the customer degree of different intentions, Enterprise marketing and sales personnel need to follow up customer actions in a timely manner. Only by grasping these changes in real time can we promote the transformation of marketing activities more efficiently. How to handle massive data-driven services in real time?
3. Need an extensible architecture
In the multi-tenant context, Aipanfan manages the massive data of thousands of small and medium-sized enterprises. With the increasing number of service enterprises, it is necessary to design an advanced technical architecture to rapidly improve the service capability of the platform. In addition, how to achieve high performance, low latency, scalability, high fault tolerance, is also a great technical challenge.
4. How to balance multi-tenant features and performance
Iphanom’s private domain products serve smes in the form of Saas services, and a CDP with multi-tenant characteristics is a basic capability. Although the customers of small and medium-sized enterprises generally vary from one hundred thousand to one million, with the increase of marketing activities, the data volume of enterprises will also increase linearly. For large enterprises, the size of their customers determines that their data volume grows faster. In addition, different enterprises have different dimensions of data query and it is difficult to preheat the model. Under this premise, how to give consideration to scalability and service performance is a difficult problem.
5. Diversified deployment scalability
CDP mainly provides Saas services to smes at present. However, it does not rule out the need to support on-premise (OP) deployment of large clients in the future. How to select components to support the two service modes?
2.2 RT-CDP Construction objectives
2.2.1 Key business capabilities
After analysis and business abstraction, we believe that a truly good RT-CDP needs the following key features:
-
Flexible data interworking capability: it can interwork with customer systems of various data structures and data sources. In addition, data can be accessed at any time.
-
Both B2C and B2B data models are supported: for customers in different industries, supported by a set of services.
-
Unified user/business portrait: includes attributes, behaviors, tags (static, dynamic (regular) tags, predictive tags), intelligence scores, preference models, and more.
-
Real-time omni-channel identification and management: in order to break data silos and get through multi-channel identity, it is the key to provide unified users and the premise of cross-channel user marketing.
-
Strong user segmentation ability (user segmentation) : enterprises can divide users into multi-dimensional and multi-window combinations according to user attribute characteristics, behavior, identity, tag, etc., and carry out accurate user marketing.
-
Real-time interaction and activation of users: in the face of rapid change in user habits, real-time awareness of user behavior and real-time automated marketing ability is particularly important.
-
Secure user data management: Long-term and secure storage of data is the basic requirement of data management platform.
2.2.2 Advanced Technology Architecture
While clarifying the business objectives of the platform, an advanced technical architecture is also the goal of the platform construction. To achieve platform architecture, we have the following core goals:
1. Stream data driven
In the traditional database, data processing, is mainly “passive data, active query”. Data sits still in the database until the user issues a query request. Even if the data changes, the user must actively re-issue the same query to get the updated results. However, with the increasing amount of data and higher requirements for timely perception of data changes, this method can no longer meet the whole paradigm of our interaction with data.
The current system architecture design is shown below, with a preference for architectures that actively drive other systems, such as domain event-driven business. Data processing is also required: “data active, query passive”.
For example, when the enterprise wants to find users who have visited the enterprise small program to send text messages, how to do the two kinds respectively?
-
In the traditional way, the user data is stored in a storage engine, the query criteria are converted into SQL before the enterprise sends SMS messages, and the qualified users are screened from massive data.
-
Modern way: when the user data flows into the data system, the user portrait is rich, and then the judgment of whether the user portrait conforms to the enterprise query conditions is made based on it. It is just a rule of judgment on a single user’s data, rather than filtering from a mass of data.
2. Flow calculation and processing
Traditional data processing is more offline calculation, batch calculation. Offline computing is Data at rest, Query in motion; Batch computing is the accumulation of data to a certain extent, and then processing based on specific logic. Although the two are different in the way of data processing, but fundamentally are batch processing, there is a natural delay.
Streaming computing completely eliminates the concept of batch and processes streaming data in real time. That is, continuous computation for unbounded, dynamic data, with millisecond delay. This is especially true for corporate insight in today’s competitive era of massive data, where the faster the data is mined, the higher the business value.
3. Integration practice
[Batch flow integrated]
In the field of big data processing, there are two typical architectures (Lamda, Kappa and Kappa+). Lamda architecture is a combination of batch computing and real-time computing, which sometimes leads to the development of two sets of code with the same logic, which is prone to inconsistent data indicators and also brings maintenance difficulties. The Kappa and Kappa+ architectures are designed to simplify the distributed computing architecture and take the real-time event processing architecture as the core and take into account the two scenarios of batch flow. In most real enterprise production architectures, there is still more of a mix of the two because of the difficulties of a thorough real-time architecture, such as data storage, and large window aggregations that are easier to process for some batch computing.
[Unified programming]
In actual business scenarios, batch and stream processing still exist simultaneously. Although Apache Flink is active in batch integration support, it is not yet mature, considering that distributed processing frameworks will evolve as distributed data processing computing evolves. In addition, it is still common for companies to use multiple computing frameworks. Therefore, the unified data processing programming paradigm is an important programming choice, which can improve the programming flexibility, support the development of data processing jobs in batch and stream scenarios, and ensure that a set of processors can be executed on any computing framework, which is also conducive to the subsequent platform to switch to a better computing engine.
4. Scalability is a prerequisite
This mainly refers to the scalability of the architecture. An extensible architecture can stabilize services and reasonably control resource costs, so as to sustain the rapid development of services.
[Separation of computation and storage]
In today’s era of massive data, sometimes only high processing capacity is needed in different scenarios, and sometimes only massive data storage is needed. Traditional integrated storage architecture requires highly configured (multi-core, multi-memory, high-performance local site, etc.) service nodes to meet the two scenarios. Obviously, there are unreasonable resource utilization and cluster stability problems, for example, too many nodes lead to data dispersion and data consistency reduction. The distributed architecture complies with the distributed architecture. Computing and storage resources are controlled separately for service scenarios to achieve reasonable resource allocation. It also helps ensure data consistency, reliability, scalability, and stability of the cluster.
[Dynamic scaling]
Dynamic scaling is mainly to improve resource utilization and reduce enterprise costs. In actual business, sometimes the platform needs to deal with the peaks and troughs of traffic (real-time message volume) in a short period of time in the business stable period and needs short-term expansion. For example, a large number of enterprises need to do a lot of marketing activities at the same time in each important festival, resulting in a sudden increase in message volume. Sometimes, with the continuous growth of the number of enterprises serving Aipanpan, the message volume will also increase linearly, thus requiring long-term expansion. For the former, on the one hand, it is not predictable, on the other hand, there are high operation and maintenance costs. Therefore, a cluster resource management capability that can dynamically expand and shrink based on time and load combination rules is also an important consideration for architecture construction.
Three, technology selection
There is no universal framework, only appropriate trade-offs. Proper selection should be made based on business characteristics and architectural objectives. Combined with rT-CDP construction objectives, we conducted component research and determination of the following core scenarios.
3.1 New attempt of identity relation storage
In the CDP, ID Mapping is the core of data flow channel services, which requires consistent data, real-time data, and high performance.
How does traditional IDMapping work?
1. Using relational database to store identity relation is generally to store identity relation into multiple tables and rows for management. There are two problems with this scheme:
-
High concurrent data writing capability is limited;
-
The general identification requires multi-hop data relational query, and the relational database needs multiple joins to find out the expected data, so the query performance is very low.
2. Spark GraphX is generally used to store user behaviors in Graph or Hive. Spark is used to load the identity information of user behaviors into memory at a time, and then GraphX is used to calculate user connectivity based on cross relation. There are also two problems with the scheme:
-
Not in real time. Previously, more scenarios were offline aggregation and timed actions to users.
-
As the amount of data increases, the calculation time becomes higher and higher, and the delay of data results becomes higher and higher.
What do we do?
With the development of graph technology in recent years, there are more and more cases to solve business problems based on graph. The product capability and ecological integration of open source graph framework are more and more perfect, and the community is more and more active. Therefore, we try to model identity relationship based on graph, and make use of the natural multi-degree query ability of graph for real-time identity judgment and fusion.
Diagram frame comparison
You can also combine the latest chart database ranking trends, focus on research. In addition, there are more and more cases about mainstream gallery comparison, so you can refer to them by yourself. Among the distributed, open source graph databases are HugeGraph, DGraph, and Nebula. We used DGraph and Nebula primarily in production. Because of the cloud-native nature of the alifan service, DGraph was selected for the initial stage of the platform construction, but it was later revealed that the horizontal scaling was limited and had to be migrated from DGraph to Nebula.
There are few online comparisons between DGraph and Nebula, but here’s a brief description of the differences:
-
Cluster architecture: DGraph is the integration of computation and storage, its storage is BadgerDB, go implementation transparent; Nebula read/write separation, but default to RocksDB storage (unless replaced with a storage engine based on source code, which some companies are doing), has read/write magnification issues;
-
Data sharding: DGraph is predicate-sharding based (think point type), which is prone to hot spots. To support multi-tenant scenarios, you need to create tenant granularity predicates dynamically to distribute data as evenly as possible. (DGraph Enterprise also supports multi-tenant features, but charges for them and still doesn’t consider hot spots.) Nebula is based on edge partitioning and VID partition. There are no hot spots, but you need to budget the number of partitions when creating a space with Nebula.
-
Full text search: DGraph support; Nebula provides listeners that can connect to ES.
-
Query syntax: DGraph is its own Query syntax; Nebula has its own query syntax, but also supports the Cypher syntax (Neo4j’s graphical query language), which is more graphically expressive.
-
Transaction support: DGraph supports transactions based on MVCC; Nebula is not supported, and the script transactions are only supported in the latest version (2.6.1).
-
Synchronous write: DGraph and Nebula support asynchronous and synchronous write.
-
Cluster stability: DGraph clusters are more stable; Nebula’s stability needs to be improved, with occasional crashes for certain operations.
-
Ecological clustering: DGraph is more mature in ecological integration, such as integration with cloud native; Nebula is a bit more diverse in terms of the scope of ecological integrations, such as Nebula – Flink-connector, Nebula – Spark-connector, etc., but there is still room for improvement in terms of the maturity of the various integrations.
3.2 Streaming computing engine selection
For comparison of mainstream computing frameworks, such as Apache Flink, Blink, Spark Streaming, Storm, there are a lot of information on the web, please do your own research, such as the following links:
Blog.csdn.net/weixin\_394…
Apache Flink was selected as the streaming batch computing engine
The widespread Spark is still used for streaming computations in microbatches. And Flink is the way to flow. Apache Flink is an open source platform for distributed streaming and batch data processing that has developed rapidly in recent years. It is the distributed computing framework that best fits the DataFlow model implementation. High performance computing based on flow computing, with good fault tolerance, state management mechanism and high availability; The integration of other components into Flink is becoming more and more mature; So we chose Apache Flink as our streaming batch computing engine.
Select Apache Beam as the programming framework
With the development of distributed data processing technology, excellent distributed data processing frameworks will emerge one after another. Apache Beam is an incubator project that Google contributed to the Apache Foundation in 2016. Its goal is to unify the programming paradigm of batch and stream processing, so that enterprises can develop data processing programs that can be executed on any distributed computing engine. Beam offers great scalability while unifying the programming paradigm, and support for new versions of the computing framework is timely. So we chose Apache Beam as our programming framework.
3.3 Choice of Massive Storage Engines
Among storage components in the Hadoop ecosystem, HDFS supports high throughput batch processing and HBase supports low latency and random read and write requirements. However, it is difficult to use only one component to achieve these two capabilities. In addition, how to update data in real time under streaming computing also affects the selection of storage components. Apache Kudu, Cloudera’s open source column storage engine, is a typical HTAP(online transaction/online analysis hybrid mode). In terms of the direction of exploring HTAP, TiDB and Oceanbase are in this list, but we focus on different scenes at the beginning. We can also compare them. ApacheKudu’s vision is fast Analytics on Fast and Changing Data. The positioning of Apache Kudu can be seen in the following figure:
In line with our platform philosophy, real-time, high-throughput data storage and update is the core goal, which is not high in QPS for complex data query and data application (because the core business scenario is real-time customer processing based on real-time stream), coupled with Cloudera Impala’s seamless integration Kudu, Impala+Kudu was finally determined as the data storage and query engine of the platform.
Analysis enhancement: Doris
Based on the Impala+Kudu selection, there is no problem in supporting OP deployments because of the limited data volume and data query QPS of each enterprise. In this way, enterprises only need a very simple architecture to support their data management needs, improve the stability and reliability of the platform, and reduce enterprise operation and maintenance, resource costs. However, due to Impala’s limited concurrency capability (of course, Impala4.0 introduced multi-threading and improved concurrency), Ippan’s private domain services are still focused on Saas services, and it is difficult to achieve high concurrency millisecond data analysis in Saas scenarios. So we introduced the analysis engine Doris in the analysis scene. Doris is the OLAP engine based on MPP architecture. Compared with open source analysis engines such as Druid and ClickHouse, Doris has the following characteristics: L Supports multiple data models, including aggregation model, Uniq model and Duplicate model; L Support Rollup, materialized view; L The query performance on single table and multiple tables is very good; L Support MySQL protocol, low cost of access and learning; L No need to integrate the Hadoop ecosystem, and the cluster operation and maintenance costs are much lower.
3.4 Rule engine research
The real-time Rules engine is mainly used for customer clustering, combined with meituan Rules comparison, several engine (of course there are some other URule, Easy Rules, etc.) features as follows:
In RT-CDP, there are many classification and combination of customer group rules, complex rule calculation, multiple operators, and long time window span, or even no window. There is no open source rule engine in the industry that can well meet business needs, so we choose to develop it ourselves.
Iv. Platform architecture
4.1 Overall Architecture
The private domain products of Aipanpan are mainly divided into two parts: RT-CDP and MA. The combination of the two parts is approximately equal to the function range contained by Deliver CDP. Rt-cdp mentioned in this paper includes functions equivalent to Analytics CDPs. Simply speaking, it is mainly customer data management and data analysis insight.
Rt-cdp is also divided into two parts, including five parts: data source, data collection, real-time data warehouse, data application and common component. The common component is horizontal support, and the other four parts are the four stages of standard data matching and data application:
-
Data sources: The data sources here not only include customers’ private data, but also private media data in various ecosystems, such as wechat public account, wechat mini program, enterprise micro clue, Baidu mini program, Douyin Enterprise account, third-party ecological behavior data, etc.
-
Data acquisition: Most small and medium-sized enterprises have no research and development ability or are very weak, so how to quickly connect their own systems to Iphanphan-RT-CDP is a key consideration in this layer. For this reason, we package a general acquisition SDK to simplify the cost of data acquisition for enterprises, and it is compatible with uni-APP and other excellent front-end development frameworks. In addition, due to various data sources, data structure, in order to simplify the access of the new data source, we build a unified collection service, is responsible for managing the new data channels, and data encryption, cleaning, data conversion, data processing, this service is aimed to provide a flexible data access ability, to reduce the cost of data docking.
-
Real-time storage: After the data is collected, cross-channel data identification is carried out, and then transformed into a structured unified customer portrait. In terms of data management, this layer also contains fragmented customer data that enterprises access to CDP for subsequent analysis of enterprise customers. Through this layer of processing, a cross-channel customer identity graph, a unified portrait, and then a unified view for the upper level data interface are formed. In addition, data warehouse routine data quality, resource management, operation management, data security and other functions.
-
Data application: This layer mainly provides enterprises with product functions such as customer management, analysis and insight, such as rich potential customer portraits, customer group with free combination of rules and flexible customer analysis. Also provides a variety of data output methods, convenient for each other system to use.
-
Common components: RT-CDP relies on The advanced infrastructure of Iphanom, manages services based on the concept of cloud native, and operates, maintains and monitors services with the help of iphanom’s powerful log platform and link tracking. In addition, rapid iteration of CDP capability is also carried out based on complete CICD capability. From development to deployment, continuous integration and delivery are carried out under an agile mechanism.
4.2 Core Modules
In simple terms, the function of the RT – CDP is real-time, timing, multi-channel data acquisition, and then through the Identity for identifying the status of data services, then to data mapping, data processing, data processing (such as the dimension of the Join, data aggregation, data layering, etc.), then structured persistence, finally foreign real-time output.
Rt-cdp is mainly divided into six modules: Acquisition Service, Connectors, Identity Service, real-time computing, unified portrait and real-time rule engine. The figure above depicts the interaction between RT-CDP core modules from the perspective of data interaction form and data flow. From left to right is the mainstream direction of the data, representing the data into the platform to the data output to the external system that interacts with the platform; The upper middle side is real-time computing and two-way data interaction with Identity Service, real-time rules engine, and Unified portrait.
The functions of each core module are described below in combination with the data processing stage:
1. Data source & collection
In terms of data interaction modes between data sources and RT-CDP, data can be divided into real-time inflow and batch pull. For two scenarios, we abstract two modules: Real-time acquisition service and Connectors.
-
Real-time collection service: This module is mainly a third-party platform for connecting with the enterprise’s own media data sources, events in the field of Aiphanan business system and cooperation with Aiphanan. There are mainly problems in this layer, such as API protocols of different media platforms, filling of business parameters when scene behaviors are connected in series, and increasing user events, etc. In this module, we abstract data Processor& custom Processor Plugin to reduce manual intervention of new scenes.
-
Connectors: This module is mainly the data source for connecting with the enterprise’s own business system, such as MySQL, Oracle, PG and other business libraries. This part does not need real-time access, but only needs to be scheduled according to batch. The main problem here is the support of multiple data source types, for which we also abstract Connector and extension capabilities, as well as general scheduling capabilities to support. In the two scenarios, the same problem exists: How to deal with fast data reading and fast access of diverse data structures? To do this, we abstracted the Data definition Model (Schema), which will be described in more detail later.
2. Data processing
-
Identity Service: This module provides cross-channel customer identification capability. It is an accurate ID Mapping and is used to access customer data entering RT-CDP in real time. The service persists the customer identity correlation diagrams in Nebula, updates Nebula in real time and synchronously based on real-time data, identity fusion strategies, and populates the identification results into live messages. Identity Service determines whether the customer interaction in the marketing journey meets expectations and the throughput limit of RT-CDP.
-
Real-time computing: This module contains all data processing, processing, distribution and other batch operations. At present, we abstract the Job development framework based on Apache Beam, and try to do batch flow on Flink, but some operation and maintenance jobs also use Spark, which will be gradually removed.
-
Unified portrait: This module is mainly to persist massive latent guest portrait. For hot data, it is stored in Kudu, and for warm and cold time sequence data, it is periodically transferred to Parquet. Latent customer portraits include customer attributes, behaviors, tags, customer groups, and aggregated customer extension data. Although tags and customer groups are aggregation roots that exist independently, they are consistent storage mechanisms at the storage level. In addition, standard RT-CDP should also manage customer fragmentation data, so how to unify portrait and data lake data interaction is the focus of subsequent construction.
-
Unified query service: In RT-CDP, customer data is scattered in graph database, Kudu, enhanced analysis engine and data lake, but there are only business objects such as attributes, behaviors, tags and customer groups for users. How to support transparent use on products? We built this unified query service through unified view and cross-source query, which supports cross-source access of Impala, Doris, MySQL, Presto, ES and other query storage engines and apis.
-
Real-time rule engine: This module is mainly based on Flink to provide real-time rule judgment, to support loop group, rule-based static marking, rule labeling and other business scenarios.
3. Data output
Data output is already supported in many ways, including OpenAPI, Webhook, message subscription, and so on. On the one hand, it is also convenient for enterprises to obtain real-time behaviors of potential customers after CDP integration, and then carry out user chain management with their own downstream business systems. On the other hand, it provides real-time behavior flow for upper-level MA to drive the marketing loop. In particular, MA’s journey nodes also require a lot of real-time rule judgments with various judgment diameters, and some of them are difficult to implement in memory. Therefore, RT-CDP also realizes data output that can provide MA with real-time judgment results.
4.3 Key Implementation
4.3.1 Data definition model
Why do YOU need Schema?
As mentioned earlier, the multiple channels of the enterprise have different data characteristic structures. In addition, different tenants have different service characteristics, and enterprises need to customize data scalability. Rt-cdp needs to be able to flexibly define data structures to connect enterprise data.
In addition, RT-CDP itself manages two types of data: fragmented customer data and unified user portraits. For the former, it does not need relational data content itself, and can provide data storage, query and analysis capabilities for enterprises by using data lake and other technologies, which is Schemaless data management. For the latter, there is more need to combine queries, circles, and analysis by different dimensions, which itself requires structured data management. Can the latter provide services through Schemaless? Enumerate add delete change check the scene, prove the limitation is obvious.
What is Schema?
Schema is a description of data structures. Schemas can reference each other, constrain the fields, field types, and values in data, and customize fields. Enterprises can use a unified standard to access and flexibly manage their own data. For example, enterprises can abstract different business entities and attributes according to their own industry characteristics, and define different schemas for different business entities. An enterprise can extract the new Schema for information that intersects business entities, and then multiple schemas reference the new Schema. You can also customize your own business fields for each Schema. The enterprise only needs to access the data according to the corresponding Schema structure and can use the data according to specific standards.
The following figure illustrates the characteristics of Schema from these entities:
-
Field: A Field is the most basic data unit and is the least granular element of a Schema.
-
Schema: A collection of fields and schemas. It can contain multiple fields that can be customized, such as Field names, types, and value lists. You can also reference one or more other schemas, or reference them in the form of arrays. For example, a Schema can contain multiple Identity structures.
-
Behavior: refers to different behaviors of potential customers or enterprises, which themselves are also carried by Schema. Different behaviors can also customize their unique fields.
As shown in the figure above, After industry abstraction, Aipanpan RT-CDP has built many schemas common to the industry, including common Identity, Profile, Behavior and other schemas. Identity, Profile, Tag, Segment, etc. are all business aggregation roots in the unified latent customer portrait of AIpanpan RT-CDP management. In order to support both B and C data models, there are also some B granularity aggregation roots.
How does Schema simplify data access?
Here we need to introduce the concept of a Dataset. Dataset is a Dataset defined by Schema. Enterprises define different data sources into different data sets. During data source management, an enterprise can import structured data according to different data sets. A data set can correspond to multiple data sources or a type of data in a data source, which is often used. In addition, a data set can also contain multiple batches of data, that is, enterprises can periodically import the same data set by batch. During data access, the following figure shows that enterprises can bind different schemas for different datasets. Each Schema can reference and reuse other sub-schemas. After Schema parsing by RT-CDP, data is automatically persisted to the storage engine. Will persist to different tables. The corresponding real-time customer behavior is also defined by defining different schemas to define data structures, and then continuous data access.
Extension 1: Solve multi-tenant infinite column expansion problem with field mapping
What are the problems?
Efanfan RT-CDP supports multi-tenant platforms. However, under multi-tenant, each enterprise has its own business data. Small and medium-sized enterprises may have hundreds or thousands of latent customer data fields, and the number of KA fields is larger. As a Saas service, how does CDP support so many field storage and analysis in one model? Normally engines that can expand columns indefinitely can be leveled by tenant + fields. In order to carry out structured real-time storage, Edpanpan CDP chose Kudu. Kudu’s official advice is that a single table should not exceed 300 columns, and at most, it supports thousands of columns, which cannot be solved by the previous method.
What is our solution?
We solve this problem by using field reuse under the premise of tenant isolation. The attR field is also present in the Schema Schema diagram. The attR field is present in the actual Profile and Event tables. The key point is:
-
The fact table only contains fields with no business meaning.
-
During data access and query, the user interacts with the front-end and tenants after data conversion through the mapping between service fields (logical fields) and fact fields.
4.3.2 Identity Service
This service can also be called ID Mapping. However, compared with traditional ID Mapping, different business scenarios have different functions. ID Mapping in the traditional sense is the offline and predictive recognition based on complex model based on anonymous data of advertising scenes. ID Mapping in CDP is based on more accurate identification of data, which enables more accurate access and requires more call rate and real-time performance.
Therefore, we design an identity relation model supporting B2B2C and B2C business. After standardized tenant data access, continuous identity relationship graph fission is added based on continuously accessed data. At the functional level, we support custom identity types and identity weights, as well as custom identity fusion actions for tenants with different identities. In addition, according to our industry analysis, common identity and integration strategies are built in, which is convenient for tenants to use directly.
At the architectural level, Identity Service (ID Mapping) is built with cloud Native +Nebula Graph to provide tenant data isolation, real-time read and write, high performance read and write, and horizontal scaling.
Nebula Graph +Nebula Graph
Reducing operational costs by deploying Nebula Graph to K8s. We are mainly:
-
Automated maintenance of our Nebula cluster under K8S with Nebula Operator;
-
Managing Nebula related stateful node Pods using Statefulset;
-
Each node uses local SSDS to ensure graph storage service performance.
2. Optimize read and write
Generally speaking, Identity Service requires a lot of read and write. However, new tenants and new scenarios also require high write capability and read and write performance. Need to optimize read and write under the premise of concurrent locking:
-
Design the data model to minimize the number of Nebula internal IO;
-
Take advantage of Nebula syntax to avoid Graphd redundant memory operations
-
Query, as far as possible to reduce the depth of the query; On update, the write granularity is controlled and the impact of no transaction on services is reduced.
Extension 1: How to solve the problem of unlogged guest access
For the scenario where one person has multiple devices and one device is used by multiple people, offline correction is adopted to get through.
4.3.3 Real-time storage
4.3.3.1 flow calculation
Aipanfan RT-CDP core capabilities are based on Apache Flink+Kafka. Stream computations performed on top of live streams with millisecond data delays.
The core data flow is shown in the figure above, which mainly consists of the following parts after simplification:
-
The data mainly collected and formatted will be uniformly sent to CDP-ingest topic.
-
Rt-cdp has a unified Entrance Job that cleans, verifies, analyzes schemas, and identifies data, and then distributes data based on tenant attributes. This is a stateless Job because it is an RT-CDP entry Job and needs to support horizontal scaling.
-
After data distribution, different Job groups will carry out data processing, persistence, data aggregation and other data processing logic respectively. On the one hand, the image of potential customers is enriched, and on the other hand, the data foundation is provided for potential customer groups with more dimensions.
-
Finally, the data will be distributed to the downstream, including external system, data analysis, real-time rule engine, policy model and other business modules, so as to carry out more real-time drive.
Extension 1: Data routing
Why route?
As a basic data platform, AIpanpan RT-CDP not only serves the tenants outside Baidu, but also baidu’s internal and even Aipanpan itself. It not only serves small and medium-sized enterprises, but also serves large and medium-sized enterprises. For the former, different levels of service stability are required. How to avoid non-interaction between internal and external service capabilities? For the latter, enterprises of different sizes have different potential customers, and time-consuming resources such as people in RT-CDP are also different. How to avoid unfair allocation of resources?
What did we do?
To solve the above problems, we use the mechanism of data routing to solve them. We maintain a mapping relationship between tenants and data flow topics, which can be divided according to tenants’ characteristics or dynamically adjusted according to tenants’ needs. Then, data is distributed to Job groups with different resource ratios for data processing based on tenant mapping in the Entrance Job. The internal and external resources can be separated, and resources can be controlled according to tenants’ personalized needs.
Extension 2: Custom Trigger write in batches
Kudu performs worse than HBase in random read and write. In order to achieve hundreds of thousands of TPS write ability, we also made a certain logical optimization of Kudu write. It is mainly a customized Trigger (quantity + time window two kinds of Trigger), which changes a single write to a batch policy on the premise of millisecond delay.
The solution is as follows: The write operation is triggered when the number of batches of data is greater than N or the time window is greater than M milliseconds.
In general, a marketing activity of tenants will generate a large number of latent customer behaviors, including system events and real-time user behaviors. This batch writing method can effectively improve throughput.
4.3.3.2 Real-time Storage
Rt-cdp mainly includes three parts of data: fragmented tenant data, unified profile of potential guest and offline analysis data. We mainly classify two clusters for data storage. One cluster stores the unified portrait of potential guests and hot data with timing attributes, and the other cluster stores cold data and data used for offline computing. Each cluster integrates data lake capabilities. We then developed a unified Query Engine that supports cross-source and cross-cluster data queries and is transparent to the underlying storage Engine.
Extension 1: Enhanced storage based on data layering
Why stratification?
On the one hand, data storage based on Kudu is expensive (Kudu clusters are built on SSDS to have good performance). On the other hand, the marketing scenario pays more attention to the real-time behavior changes of customers in a short period of time (such as nearly one month, three months, half a year, etc.), and the frequency of using historical data with a long time is very low.
layering
For comprehensive consideration and from the perspective of saving resource cost, we choose Parquet as the extended storage to store the massive data in line with time series for cold and hot stratified storage.
According to the frequency of data use, we divide data into hot, warm and cold layers. Hot data, which represents the data frequently used by tenants, within a time range of three months; Temperature data, refers to the data with low frequency of use, generally only used for the selection of individual customer groups, the time range is three months to one year; Cold data: data that tenants rarely use. The time range is outside one year. To balance performance, we put hot and warm data in the same cluster, and cold data in another cluster (and the cluster provided to the policy model in the same cluster).
Specific plan:
-
A unified view is created on hot, warm, and cold, and the upper layer queries data based on the view.
-
Then, the sequential offline migration from hot to warm and from warm to cold is carried out regularly every day, and real-time update of views is carried out after the migration.
Extension 2: Mapping relationship management based on latent guest fusion path to solve data migration problems
Why do YOU need to manage mappings?
There are a lot of latent guest portrait behavior data, and frequent fusion may also exist. If the data is migrated every time during latent guest fusion, on the one hand, the data migration cost is very high; on the other hand, when the latent guest behavior involves warm and cold data, it is impossible to delete the data. For similar situations, there are more trade-offs in the industry, such as migrating users’ hot data only for a period of time, and not processing the history of the past. This solution is not ideal.
Mapping Management Mechanism
To this end, we change a way of thinking and solve this problem by maintaining the path of latent guest fusion.
Specific plan:
-
Add a latent guest fusion relationship table (user_change_rela) to maintain the mapping relationship;
-
Create views on top of fusing relational tables and timing tables (such as Events) to be transparent to the business layer.
For the fusion relation table, we have made a certain strategy optimization: we do not maintain the process relations on the path, but only maintain the direct relations from all process points to the end point of the path. In this way, the performance of relational query will not be greatly increased even when there are too many potential guests involved in the potential guest fusion path.
For example, the change of user_change_rela when the latent guest fuses twice (affId=1001 fuses to 1002 and then to 1003) is shown in the following figure:
4.3.3.3 Analysis enhancement
We choose Apache Doris, baidu’s open source, as the analysis engine for data enhancement, providing customer insight capabilities for The Aifanfan Version, such as journey analysis, crowd analysis, marketing effect analysis, fission analysis, live streaming analysis, etc.
In order to facilitate the flexible removal of OP in the subsequent deployment, we used the data outputed by the CDP as the data source for enhanced analysis, and then conducted logical processing based on Flink Job, such as cleaning, dimensional Join and data typing equality. Finally, flink-Doris-connector contributed by Apache Doris was used to write data into Doris.
Writing Doris directly in connector mode has two advantages:
-
Write data to Doris using flink-doris-connector. Kafka takes one less time than using Routine Load.
-
Using Flink-Doris-connector can also be more flexible in data processing than Routine Load.
Flink-doris-connector is implemented based on Doris’s Stream Load mode and performs data import processing via FE redirect to BE. When we actually use Flink-Doris-connector, we Flush every 10s and write the configuration that each batch can submit a maximum of one million lines of data. For Doris, single batch data is more unflushed more frequently and friendly.
If you want to know more about Doris’s practice in Aipanfan, you can read “Architecture and Practice of Baidu Aipanfan Data Analysis System”.
Extension 1: RoutineLoad differs from Stream Load
The Routine way of the Load
It submits an import task to resident Doris and writes data to Doris by constantly subscribing and consuming JSON-formatted messages in Kafka.
From an implementation point of view, FE is responsible for managing the import Task, and the Task imports data through Stream Load on BE.
The Stream Load way
It uses Flink, a Stream data computing framework, to consume Kafka’s business data, and uses Stream Load to write to Doris through HTTP protocol.
From the perspective of implementation, in this way, the framework directly synchronizes data to Doris through BE, and the Coordinator BE directly returns to the import state after the data is successfully written. In addition, during import, it is better to use the same label for the same batch of data, so that repeated requests for the same batch of data will be accepted only Once, which ensures at-most -Once.
4.3.4 Real-time rules Engine
In the private domain products of Aipanpan, flexible clustering ability is an important product ability. How to perform real-time clustering with complex and flexible rules based on potential customer attributes, identity, customer behavior and other dimensions? This is where the real-time rules engine comes in. This function itself is not new, it has a similar capability in DMP. Many CDP and customer management platforms have similar capabilities. However, it is a challenge to make real-time and high-throughput rule judgment in the case of multi-tenant and massive data.
In Aipanfan RT-CDP, on the one hand, there is a large number of tenants. How does Saas service support high-performance multi-tenant clustering? On the other hand, Ipanfan RT-CDP expects to achieve real time judgment based on real time stream. Therefore, we developed a real-time rule engine based on multi-layer data. Here is a brief introduction and a separate article will follow.
What are the problems?
The traditional implementation scheme is mainly to translate the rules into a complex SQL when the tenant triggers the clustering request in real time or periodically, and perform SQL query temporarily from the tenant’s latent guest data pool. In addition, a layer of inverted indexing is generally done on potential customers, and the data query speed is acceptable when there are few tenants or OP deployments. However, the following problems need to be solved in implementing rules engine based on real-time flow:
-
Real-time judgment of massive data
-
Memory footprint for window granularity data aggregation
-
Slide window under window storm
-
Data aggregation problems without windowing rules
-
Update window data after potential guest data changes
Real-time rule engine implementation
Similar to many products, the rule circle group of Aipanfan is mainly a two-layer And/Or rule combination. Combined with the characteristics of rules, we can mainly divide them into the following categories: ordinary attribute operation (P1, P2), ordinary identity operation (I1), behavior judgment of small window (E1), behavior judgment of large window (E2) and behavior judgment of no window (E3).
For rule flexibility and efficient data processing, we define a set of rule resolution algorithms. Then Flink’s powerful distributed computing ability and state management ability are used to drive real-time rule engine calculation. The concept of streaming data has been mentioned above. Here, a hidden guest behavior is combined with real-time rule judgment to more intuitively illustrate the real-time filling of data in the stream, as shown in the figure below: After the data comes in, the Identity Ids is first supplemented through Identity Service, and the corresponding attribute information of potential customers is supplemented through data Job. Finally, real-time rule judgment is made based on a complete potential customer data. Finally, the potential customers responsible for the rules are put into the Segment.
In addition, the rule engine is a service independent of business objects such as Segment and can support various rule-related business scenarios such as circle groups, labeling, and MA journey nodes.
4.3.5 extension
4.3.5.1 Elastic Cluster
The computing and storage cluster of Aipanfan RT-CDP is built based on Baidu cloud. With the help of cloud capabilities, the storage and computation separation and dynamic scaling of resources are well realized. We can customize flexible resource expansion and contraction strategies, increase and decrease resources according to the message volume, so as to increase the cluster scale to provide computing capacity in real time at the peak of the wave, and reduce the cluster to reduce costs in time at the trough of the wave.
Our cluster is mainly divided into four types of nodes: Master, Core, Task, and Client. Details are shown in the figure above.
-
Master node: cluster management node. NameNode and ResourceManager processes are deployed and automatically migrate components when faults occur.
-
Core node: computing and data storage nodes with DataNode and NodeManager processes deployed.
-
Task node: a compute node used to supplement the computing power of core nodes and deployed with processes such as NodeManger. This node does not store data and supports dynamic capacity expansion or reduction on demand.
-
Client node: an independent cluster management node and job submission node.
4.3.5.2 Chain monitoring
Rt-cdp provides complete link monitoring capabilities to detect cluster and data flow problems in real time, facilitating timely intervention and handling, and ensuring tenants better data service capabilities. It also builds a chain-wide log collection and analysis capability, greatly simplifying the cost of troubleshooting service problems.
As shown in the figure above, we have completed cross-platform log collection & alarm and whole-link delay monitoring relying on the strong technical service capabilities of Aiphanan:
-
Log collection: Based on the Satellite collection of full-link service logs contributed by Aipanfan to Skywalking, it supports log collection of micro-services under K8s and Flink Job, so as to achieve a log platform that collects logs of full-chain services. Then log query and analysis were performed by Grafana.
-
Service indicator collection: We collected the storage cluster indicators of various micro-services, such as Apache Flink, Impala and Kudu, to Prometheus through PushGateway, so as to achieve real-time monitoring and alarm of services.
-
Full-link delay monitoring: We also collect rT-CDP full-link data burying points through Skywalking Satellite, and then conduct delay analysis through self-developed point-based analysis platform, so as to achieve full-link data delay visualization and threshold alarm.
V. Achievements of the Platform
5.1 Digitalization of Assets
Rt-cdp solves the problem of data isolation in enterprises, and helps enterprises digitize, multi-faceted, intelligent, and secure their data assets.
-
Multi-party: integrate data of one party, open up data of two parties, use data of three parties and open up data of multiple parties to achieve more accurate and in-depth customer insight.
-
Digitization: manage customer information digitally through custom attributes, tags, model attributes, etc.
-
Security: Data security and privacy protection are implemented through data encryption, private computing, and multi-party computing to protect enterprise data assets.
-
Intelligent: enrich customer portraits and serve more marketing scenarios through intelligent models.
5.2 Efficiently Supporting Services
1. Flexible data definition ability
Rt-cdp provides flexible data definition capabilities at the business level to meet the personalized requirements of enterprises:
-
Rich custom API for custom Schema, attribute, event and other scenarios of data reporting structure;
-
It supports self-definition of identity type, which is convenient for enterprises to specify potential guest identification according to their own data.
-
Data for different structures of different enterprises can be accessed at zero development cost.
2. Various marketing services for enterprises in different industries
Relying on strong RT – CDP data management ability, love’s marketing products have been the law, and business services, education and training, electronic electrician, machinery and equipment, financial services, real estate, health, beauty, life home, building materials, printing and packaging, agriculture, forestry and fishing, logistics, transportation, food and so on dozens of the thousands of companies of the industry, to help enterprises solve a lot of marketing problem. The list of successful businesses is endless.
5.3 Advanced Architecture
At present, we have completed the construction of RT-CDP1.0, and achieved good results in some core indicators:
5.3.1 Real-time high throughput
-
Identity Service can query hundreds of thousands of QPS relationship, supporting tens of thousands of TPS Identity fission.
-
Real-time computing achieves real-time processing of hundreds of thousands of TPS, real-time persistence and millisecond delay.
-
Supports real-time analysis of massive enterprise data at millisecond level with high concurrency.
-
Real real-time flow data based on the realization of rule judgment, support the private domain marking, real-time rule judgment, circle groups and other real-time business scenarios, so that marketing milliseconds touch.
5.3.2 High Scalability
Platform architecture storage and computation separation, horizontal expansion:
-
Dynamically-scalable map storage cluster based on Cloud Native +Nebula
-
With the help of baidu cloud’s native CCE, BMR and other cloud capabilities, a flexible and scalable storage and computation cluster is built.
-
Computing clusters scale dynamically, saving enterprise resource costs.
5.3.3 High stability
The stability index of each module and each cluster has been maintained above 99.99% for a long time.
Vi. Future prospects
1. [Business level] More close to the industry’s middle-Taiwan capabilities
-
At present, the platform has a good definition ability in business support. The next step will be combined with the key service of the enterprise industry, built-in more industry business objects, further simplify the cost of enterprise data access.
-
More business attempts should be made on B2B2C data model to better serve ToB enterprises.
2. [Business level] Richer AI model
- Rt-cdp provides intelligent latent guest scoring capability for enterprises, enabling enterprises to flexibly define scoring rules. In the ERA of AI, we will continue to enrich more AI models to help enterprises manage, insight and market customers.
3. [Architecture] More intelligent governance, operation and maintenance
-
At present, Flink operation is still based on Yarn management resources, apI-based and process-based operation in script mode (such as operations involving CK), and operation monitoring is conducted through streaming, SMS and telephone alarm. In the future, we will make more attempts in job management and operation and maintenance, such as managing Flink jobs based on K8s and improving operation and maintenance capabilities by combining Webhook capabilities such as stream.
-
Driven by streaming data, changes in data processing mechanisms make data governance and data inspection more challenging. Much more needs to be done to provide more reliable data services.
4. [Architecture level] Integrated lake warehouse into intelligent lake warehouse
-
Domestic Internet companies have many practical cases of data lake technology, which can indeed solve some of the pain points of the original data warehouse architecture, such as data does not support update operation and can not achieve quasi-real-time data query. We are also trying to integrate Flink and Iceberg/Hudi, which will be implemented gradually.
Seven, the author
Jimmy: A senior engineer with a team who travels for love.
Recommended reading:
When technology refactoring meets DDD, how to achieve business and technology win-win?
Interface documents automatically change? Baidu programmer development efficiency MAX secret
Tech reveal! Baidu search medium low code exploration and practice
Baidu intelligent cloud combat – static file CDN acceleration
Simplify the complex – Baidu intelligent small program master data architecture practice summary
Baidu search in Taiwan mass data management cloud native and intelligent practice
Baidu search “mixed” join information, how to rely on AI to solve?
———- END ———-
Baidu said Geek
Baidu official technology public number online!
Technical dry goods, industry information, online salon, industry conference
Recruitment information · Internal push information · technical books · Baidu surrounding