I. Business background
In the real world, a waybill to complete the process starts from the user order meal riders from the guest to the user, user experience a series of indoor and outdoor scene order, system is single, the rider outdoor cycling, get off at the destination, and then walk to the merchants, waiting to take food stay indoor, merchants out the meal, the rider get in the car, on foot after take food from the shop outdoor cycling to the user destination, Get off the bus and walk upstairs to deliver the food, stay in the room and wait for the user to collect the food, and finally the user to collect the food and the rider to leave. Then the feature platform is the precipitation of the basic data of the rider involved in the above process, and the feature data is provided by data mining to assist the algorithm platform to better complete the calculation of the rider’s order, delivery fee and tilt.
Second, the overall structure and service separation
The overall architecture
The overall architecture of a feature platform takes into account the following basic specifications.
- Process standardization: from data input, processing calculation to data output to do a process standardization.
- Data is layered to precipitate commonalities.
- Feature pocket bottom, reduce risk.
To draw
Based on the customized basic specifications, the overall architecture of the service is shown in the figure above, which is divided into 7 layers:
- Data source layer: main wired service tables waybill tables, riding watches, and offline Hive tables
- Data layer: the data is cleaned and converted to form a wide table of real-time features and off-line features.
- Computing layer: Computes wide table data through standardized SQL.
- Storage layer: Stores characteristic data output by the computing layer.
- Service layer: provides a unified RPC feature reading service to output data from the storage layer to applications at the application layer.
- Application layer: mainly includes ETA time prediction service, sub-single engine service, rider control service and supply and demand service.
- Management system: responsible for the feature metadata from creation to destruction of the whole life cycle management, as well as feature compute caliber, data source, storage format management, feature default value.
- Monitoring and alarm: monitors the service dependence of feature production links, visualizes the quality of feature data, and finds data problems through manual inspection.
Service split
- Feature reading service: Abstract unified RPC interface for feature reading.
- Real-time feature writing service: provides real-time feature data writing based on RPC, HTTP, and MQ.
- Offline feature Write service: Relies on the big data platform to operate offline data online.
- Feature management platform: Feature metadata management.
3. Feature_source: Abstract feature production sources to define the metadata model of feature production data sources. The following
{
"product_id": ""."feature_ids": ""
}
Copy the code
Feature_business: Abstracts the business side using features to define the business side’s metadata model. The following
{
"business_id": ""."feature_ids": ""
}
Copy the code
Feature metadata: The metadata model of features is defined by abstracting the commonness of features. The following
{
"feature_id": ""."feature_name_space": ""."object_id":"".// Feature entity ID (rider, shop, user, etc.), the rider's feature is the rider'S ID
"feature_val": ""."expire_time": ""."update_time": ""."source": ""
}
Copy the code
Feature data sources, feature business parties, and the relationship among features:
feature_source vs feature (1:n)
feature_business vs feature (1:n)
4. Real-time features
This section describes the entire process of real-time features from data source to storage.
To draw
The diagram above shows the entire flow of real-time features from data source to storage. The architecture can be divided into four layers.
- Feature management platform: it is responsible for the management of the whole life cycle of features from creation to destruction, as well as the management of other business abstractions related to the whole feature management, including real-time feature production task and offline feature production task management.
- Task Scheduling Platform: Use the open source Airflow distributed scheduling platform to generate second or minute production of real-time features by abstracting feature production task commonalities and customizing custom DAG tasks, writing production feature data to MQ.
- Flink data synchronization task: Use flINK to synchronize online service data to index and detail tables in real time.
- Consumer service: Provides RPC, MQ, and HTTP feature writing, and writes feature data to storage according to the unified data model customized by feature management platform.
5. Offline features
This section describes the process of offline features from data source to storage.
To draw
The figure above shows the whole process of offline features from data source to storage. The whole architecture can be divided into four layers.
- Task scheduling platform: The task scheduling platform here refers to the offline task scheduling platform of big data, which is responsible for the whole life cycle of offline task scheduling.
- Feature management platform: it is responsible for the management of the whole life cycle of features from creation to destruction, as well as the management of other business abstractions related to the whole feature management, including real-time feature production task and offline feature production task management.
- Feature extraction task: An abstract unified feature extraction task that relies on the configuration information of the feature management platform, generates standard Hive SQL, and writes the extracted data to the feature shared table (feature wide table).
- Feature aggregation task: Generates feature data for specific services based on the configuration of the feature management platform.
- Feature synchronization task: Load the result data generated by the aggregation task to online storage using the offline data online tool.
Thinking about characteristic operators
For example, there is a rider who finishes a single trait A (T-1), and another rider who finishes a single trait B in real time that day. Now we need a rider to complete the singular feature c, c = a + b. In this case, we can reuse a and B to calculate C feature. In the above case, a general operator can be designed to reuse features. As long as a few configuration items are simply provided, the calculation and storage process can be completed without writing code. For example, the expression operator above can support the following forms of “calculation” :
To draw
Vii. Relevant materials
Production Scheduling in artificial Intelligence Online Feature System Data Access Technology in artificial Intelligence Online Feature System Construction and Practice of Meituan Delivery feature Platform Construction practice of meituan Delivery real-time feature platform Iterative process of a set of real-time feature system