On May 6, 2017, Huang Zhenxian, data architect of Meizu, delivered a speech titled “Introduction to Meizu’s Big Data User Insight Platform” at the eighth session of Meizu Technology Open Day — Data Insight. IT Tycoon said (ID: ITdakashuo) as the exclusive video partner, by the sponsor and speaker review authorized release.
Read, read the word count: 1869 | 6 minutes
suo.im/4HBM1x
Abstract
Meizu DMP (User insight platform) builds a huge accurate crowd data center by gathering, cleaning and intelligent operation of the three-party audience data, providing rich user portrait data and real-time scene recognition. Internally: the data application of various business platforms is seamlessly connected, such as advertising platform, PUSH PUSH and personalized recommendation, and the data channel is established to support company-level precision marketing and timely message delivery service. External: improve the data management and output process, in the form of open interface for the industry practitioners to provide standard accurate crowd labels, to help optimize the delivery and improve the marketing effect. Achieve accurate delivery to the audience, release the real value of data! This article will introduce the architecture adopted by the user Insight platform, discuss the technical difficulties encountered and resolve the process, and review the shortcomings of the current architecture and the direction of future improvement.
General introduction
Through gathering, cleaning and intelligent operation of the three-party audience data, a huge accurate crowd data center is built to provide rich user portrait data and real-time scene recognition.
Seamless docking of various business platform data applications, such as advertising platform, PUSH PUSH, personalized recommendation to establish a data channel, support company-level precision marketing, timely message delivery services and so on.
Marketing effect evaluation, feedback data can be further processed, used to improve the quality of portrait labels.
Core requirements
The core requirements for user insight consist of the following components.
Label generation: Internet services change rapidly and label requirements change frequently. The system must quickly respond to label requirements.
Crowd insight: Perform filtering and aggregation calculation on any tag of full users and respond to queries within 1-2 seconds.
Audience distribution: in the seamless docking of various business systems, to achieve efficient real-time accurate marketing.
Tag query: query user profile details based on user ID. Query for advertising business needs to be returned within a more demanding 50ms.
The overall architecture
Configure and run offline computing tasks on the integrated development platform job scheduling system. The Streaming platform (AnyStream) is responsible for real-time tag calculation. Related rules generated by the management module are stored in MySQL for label generation tasks (Hive/MR/ streaming platform). The user portrait (label) wide table is saved on ES. Hbase and Redis provide KV query. Use development platform (OpenAPI) to provide external interface.
Tags generated
According to the process of generating calculation, labels are divided into two types, one of which is statistical label. Firstly, the indicators are calculated from the user’s behavior, and then the rules and statistical indicators are generated according to the label as input, which user can be corresponding to what consumption level.
Algorithm class tag calculation
In addition to the statistics class, there is an algorithm class tag.
High confidence data (such as user registration information) and user behavior data are selected as input for model training. Then the trained model is used for attribute prediction.
Single-valued labels and multi-valued labels
A single-value label indicates that users can select only one value from the label.
A multi-value label is a label that allows users to select multiple values. For example, users can have multiple interests. The existence of multi-valued tags will affect the selection of storage query engine and storage structure design.
Label generation process
The advantage of this mode is configuration management, providing Web UI management label life cycle; Labels are generated based on the configuration, and the label width table data is 100% consistent with the metadata.
The remaining disadvantage is that configuration management currently covers only the final TAB width table generation. There is a disconnect with upstream indicator statistics and algorithms. The upstream calculation process is developed separately, and the indicator definition is only a description of the data in an additional configuration (there may be inconsistencies). After some tags are taken offline (abolished), the corresponding upstream task dependencies need to be abolished separately, otherwise useless jobs will be left behind and waste computing resources.
Label storage
Label storage overview
ElasticSearch (ES) is an open source, distributed, RESTful search engine based on Lucene. Able to achieve real-time search, stable, reliable, fast. Based on ES implementation of the full user arbitrary tag online screening and aggregation analysis, second and response. Hbase provides high-throughput key/value query. Key /value queries (advertising platforms) with more demanding performance requirements are implemented using Redid.
Why ElasticSearch (ES)
The traditional Vertica Community edition has three nodes and a 1TB storage limit. As the data size and number of calls increase rapidly, performance bottlenecks Occur. For multi-value labels, only CSV can be used to save them in vARCHAR fields, resulting in poor performance.
Multi-valued label retrieval uses the string LIKE operation; Aggregation can be supported with some trick, but the performance is poor.
Now ES is capable of real-time search, stability, reliability, and speed. Online update (real-time/quasi-real-time update) horizontal expansion ability is strong. Array Type is perfect for multi-valued tag storage and analysis scenarios.
HBase and Redis
Hbase provides low-cost and high-throughput KV query. The disadvantage of satisfying queries for general business is that the query response time is not ideal (for advertising business).
However, for Redis, advertising business puts forward query delay within 50ms, which needs to be realized by Redis. Redis stores query calls that currently only serve advertising platforms.
Considering the cost, Hbase is mainly used to provide KV query. Part of the demanding business, use Redis as a supplement.
Platform function
List of Main Functions
The platform has five main functions: crowd management, crowd screening, portrait insight, audience distribution and portrait query.
Crowd management can be created in two ways. 1. Specify label conditions; 2. Import the IMEI list and modify or delete the group.
Crowd filtering is to specify the label condition option, query the number of users that meet the condition.
There are two steps to portrait insight. First, select the user group by specifying the tag condition option, and then specify the tag to be analyzed, and analyze the user characteristics through aggregation operation.
Audience distribution needs to adopt certain technical means to push the designated group to the downstream marketing channels (advertising platform, push platform, OTA, etc.).
The image query provides a query interface to the downstream system, and the caller specifies the user identity (IMEI) to query the image label of the user.
That’s all for today’s sharing. Thank you