background
Meituan-dianping has been systematically supported in offline scenarios. Through offline data collection and mining, target users can be reached by T+1, and the conversion rate can be improved to a certain extent by sending Push to target users. However, due to the delay of T+1 itself, users cannot be contacted in real time when they have specific behaviors, so they cannot give full play to the value of data and achieve better operation results.
In this context, operational businesses need to start mining real-time data of user behavior, such as real-time browsing, ordering, refund, search, etc., to reach users meeting operational needs in real time and maximize the effect of operational activities.
The business scenario
The following typical business scenarios exist in the real-time access requirements of operations:
- Within 30 minutes, the user performs behavior A for at least three times
- The user is a regular guest of Meituan Hotel, that is, the user has purchased meituan hotel products
- User A does not perform user B’s behavior within 24 hours prior to user A’s behavior
- The user does not have B behavior within 30 minutes after A behavior (eliminate the effect of spontaneous B behavior within 30 minutes and reduce the deviation of results)
Taking the typical real-time operation scenario as an example, this paper focuses on how to design a system that can support the efficient and stable operation of business requirements.
Early plan
There were only a few activities in the early stage of operation real-time contact requirements. We provided quick support for operation real-time contact requirements by developing a set of Storm topological codes for each requirement and hard-coding operation activity rules, as shown in Figure 1:
Problems with earlier schemes
The early scheme was a Case By Case solution and could not form a complete system. With the development of real-time operation business, the number of related operation activities increases rapidly, and multiple sets of similar code are maintained online. On the one hand, the PRINCIPLE of DRY (Don’t Repeat Yourself) is broken, and on the other hand, online maintenance cost increases linearly, and the system gradually cannot support the current demand.
challenge
In order to solve the problems in the early solution, the following challenges were presented to the system construction:
- Hard-coding activity rules results in a lot of duplicate code, high development costs, and long demand response times.
- It is difficult to modify service rules. Code modification and online topology restart are required to adjust operating conditions.
- There are many Storm topologies online, resulting in low resource utilization and system throughput, and high unified maintenance cost.
- Lack of a perfect monitoring and alarm mechanism, it is difficult to detect the stability problems in the system and data before business.
In view of the above challenges, combined with the characteristics of business rules, the meituan-Dianping data intelligence team investigated and designed a real-time access system for hotel operation.
Technology research
The need for a rules engine
To improve flexibility, business rules and system code should be decoupled from each other. The coupling of rules and code directly leads to the increase of duplicate code and the difficulty of modifying business rules. So how do you decouple business rules from system code? We thought of using the rules engine to solve this problem.
A rule engine is an engine that processes complex sets of rules. By inputting some basic events, the final execution result can be obtained by deduction or induction. The core function of the rule engine is to extract complex and changeable rules from the system and use flexible and variable rules to describe business requirements. Because many business scenarios, including the real time contact scenario of hotel operation, the input or trigger condition of rule processing is event, and the relationship between events is dependent or time sequence, so the rule engine is often used in combination with CEP (composite event processing).
CEP analyzes and deals with a number of simple events in combination, and finds out meaningful events by using the correlation of events to draw conclusions. The business scenario mentioned in the previous background of the article combines a single event into a composite event with business meaning through multiple rule processing, thus improving the order probability of browsing only unordered users. It can be seen that the rule engine and CEP can meet the specific requirements of business scenarios, and their introduction can improve the flexibility of the system in the face of demand changes.
Rule Engine research
Before designing the rules engine, we investigated existing rules engines in the industry, mainly including Esper and Drools.
Esper
Esper is designed as a lightweight solution for CEP and can be easily embedded into services to provide CEP functionality.
advantage
- Lightweight embeddable development, common CEP features are simple to use.
- EPL syntax is similar to SQL and is cheap to learn.
disadvantage
- In a single-node full memory solution, other distributed and storage devices need to be integrated.
- Memory to achieve the time window function, cannot support a longer span of time window.
- Unable to effectively support timed contact (such as user contact payment terms judgment 30 minutes after browsing).
Drools
Drools started with the rules engine and was later introduced into the Drools Fusion module to provide CEP functionality.
advantage
- More perfect functions, such as system monitoring, operating platform and other functions.
disadvantage
- The learning curve is steep and the DRL language introduced is complex, making it difficult for independent systems to carry out secondary development.
- Memory to achieve the time window function, cannot support a longer span of time window.
- Unable to effectively support timed contact (such as user contact payment terms judgment 30 minutes after browsing).
Since business rules have strong dependence on time window function and timed touch function, we choose Aviator, an expression engine with lighter weight than SpEL, to integrate streaming data processing and rule engine into Storm, which ensures the throughput of the system during data processing. When there is a bottleneck in system processing resources, you can adjust the number of workers and executors on the company’s hosting platform to reduce the cost of horizontal system scaling.
Technical solution
After the introduction of rules engine, the design and development of rules engine has become the main focus of system construction. By using the real-time behavior data of users in the real-time data warehouse, meaningful composite events are combined according to the rules of business operation activities, and then sent to the downstream business system to reach the main body of the event, namely users. Abstract the system into the following functional modules, as shown in Figure 2:
Overall, the system modules and functions are as follows:
- Rules engine: integrated into the Storm topology, it performs specific rules converted into operational activity conditions and responds accordingly.
- Time window module: sliding time window function with optional time span, provides time window factor for rule judgment.
- Timing contact module: set the execution time of rule judgment, and execute subsequent rules when the set time is reached.
- Custom functions: Extend rule engine functionality on top of the basic functions of the Aviator expression engine.
- Alarm module: regularly check the amount of messages processed by the system, and send alarm information for the person in charge when abnormal.
- Rule Configuration console: Provides a configuration page for adding scenarios and rules on the console.
- Configuration loading module: periodically loads configuration information such as active rules for the rule engine to use.
The rule engine consists of the minimum function set of the core component and the extension function provided by the extension component. Because the rules engine decouples the business rules and the system code, it abstracts the real-time data in the process of time variation, and puts forward higher requirements for data monitoring and alarm. Next, we will introduce the core components of rule engine, extension components of rule engine, monitoring and alarm respectively.
Rules engine core components
The core component of rule engine is the minimum set of rule engine to support the completion of basic rule judgment.
Rules engine core process
After the rules engine is introduced, the business requirements are translated into specific scenarios and rules for execution, as shown in Figure 3:
During the execution of rules, the rules engine involves the following data models:
- Scenario: An abstraction of business requirements. A business requirement corresponds to a scenario, and a scenario consists of several rules. Compose timing and dependencies with different rules to fulfill complete business requirements.
- Rules: RULES consist of rule conditions and factors triggered by events that route to the scene to which they belong. Rules consist of rule conditions, factors, and rule responses.
- Rule condition: A rule condition is a Boolean expression consisting of factors. The execution result of a rule condition directly determines whether the rule response is executed.
- Factor: Factor is the basic component of rule conditions, which can be divided into basic factor, time window factor and third party factor according to different sources. The basic factor comes from the event, the time window factor comes from the time window data obtained by the time window module, and the third party factor comes from the third party service, such as user portrait service.
- Rule response: Action after the rule is successfully executed, for example, sending a compound event to the service system or sending an asynchronous event for subsequent rule judgment.
- Event: An event is a basic data unit of the system. It can be divided into synchronous events and asynchronous events. After the synchronization events are routed according to rules, the timing contact module is not called, but is executed sequentially. The asynchronous event invokes the timed contact module and executes later.
Time window module
The time window module is an important part of the rule engine in the real-time access system of hotel operation, which provides the time window factor for the rule engine. The time window factor can be used to count the number of browsing behaviors within the time window and query the time of placing the first order, etc. Table 1 lists the types of time window factors that need to be supported in the operation of real-time contact activities:
type | The sample | Factor to constitute |
---|---|---|
count | Browse POI more than Y times in recent X minutes | count(timeWindow(event.id, event.userId, X * 60)) |
distinct count | Browse different POIS more than Y times in recent X minutes | count(distinct(timeWindow(event.id, event.userId, X * 60))) |
first | The first hotel paid in the last X days | first(timeWindow(event.id, event.userId, X * 60)) |
last | Last searched hotel in the last X days | last(timeWindow(event.id, event.userId, X * 60)) |
Table 1 Time window factor types
According to the type of time window factor, the time window factor has the following characteristics:
- Time window details need to be stored in a List to support aggregation and detail requirements.
- Time window factors require day-grained persistence and support for EXPIRE.
- The time window factor is an important component of many rules in multiple application scenarios. The service is under great pressure and the response time needs to be at the MS level.
For the above features, based on the evaluation of the application scenario and the level of access data, we chose the KV storage service Cellar based on Tair to store the time window data. After the test, TP99 can guarantee the storage time window in 2ms under the request of 20K QPS, and has a high cost performance in storage, which can meet the requirements of the system.
In actual operations, the number of times of a user’s behavior in the time window is usually less than five times. Considering this business scenario, to avoid excessive Value affecting read and write response time, set a threshold when updating data in the time window and truncate the part exceeding the threshold. The time window data update and truncation process is shown in Figure 4:
In the service scenario mentioned in the previous background of the article, 1. The number of times that A behavior occurs within 30 minutes is greater than or equal to 3. The user behavior has not occurred within 24 hours before B behavior, 4. Users in A behavior has not occurred within 30 minutes after B behavior (exclude users spontaneously B behavior within 30 minutes, the effect of lowering the results caused by the deviation), use the time window module within the sliding time window on the behavior of users in the statistics, with time window factor as the basis of the rules execution judgment.
Rule engine extension component
The rule engine extension component enhances the rule engine functionality on top of the core components.
Custom function
Custom functions can extend Aviator functionality, and the rules engine can perform factors and rule conditions through custom functions, such as calling third-party services such as user portraits. The customized functions in the system to support the expansion of operational requirements are shown in Table 2:
The name of the | The sample | meaning |
---|---|---|
equals | equals(message.orderType, 0) | Check whether the order type is 0 |
filter | filter(browseList, ‘source’, ‘dp’) | Filter browse list data on the review side |
poiPortrait | poiPortrait(message.poiId) | Obtain merchant portrait data according to poiId, such as merchant star attributes |
userPortrait | userPortrait(message.userId) | Obtain user profile data according to userId, such as the user’s usual residence city, new and old user attributes |
userBlackList | userBlackList(message.userId) | Determine whether a user is blacklisted based on the userId |
Table 2 Examples of custom functions
In the business scenario mentioned in the first background of the article, in 2. The user is a regular guest of Meituan Hotel, that is, the user has purchased products of Meituan Hotel, the custom function is used to determine whether the user is a regular guest of Meituan Hotel, and the user portrait service is called to determine the user portrait label.
Timing contact module
The timing contact module supports setting the timing execution time for rules and postponing the execution of certain rules to meet the operation activity rules. In the business scenario mentioned in the first background of the article, under the condition of 4. The user does not have B behavior within 30 minutes after A behavior (excluding the effect of spontaneous B behavior within 30 minutes and reducing the deviation of results), it is necessary to determine whether the user has B behavior 30 minutes after the occurrence of A behavior. In order to exclude the impact of user spontaneous B behavior on the activity effect.
The data flow diagram involved in timing contact module is shown in Figure 5:
Early business requirements required a short latency time and a small total number of activities, which supported timed contact requirements by maintaining a pure memory DelayQueue. As the number of related operation activities increases and the timing contact time increases, the memory usage of pure memory mode becomes larger and larger, and all timing data will be lost after the system restarts. When optimizing the solution, we learned that the message middleware group of the company supports message granularity delay in THE Mafka message queue, which is very suitable for our usage scenarios. Therefore, we adopted this feature to replace the pure memory mode and realize the timed touch module.
Monitoring and alarm
Compared with offline data, real-time data is difficult to detect when problems occur during use. Since the operation activities targeted by the system are directly oriented to the C end, if there is any system anomaly or data quality anomaly, if it is not detected in time, it will directly cause the waste of operation cost and seriously affect the activity conversion rate and other evaluation indicators of activity effect. In view of the system stability problem, we solve it from two perspectives of monitoring and alarm.
monitoring
The existing products of the company’s data platform are utilized to report real-time events processed by the system according to their event ID and aggregate them in time granularity. After the data is reported, the amount of various events can be viewed in real time to evaluate whether the activity rules and effects are normal by the amount of messages. The display effect of reported data is shown in Figure 6:
Call the police
The monitoring can only be used as a Dashboard for display and viewing, and cannot realize automatic alarm. Since the aggregated data used to monitor the reported data is stored in the timing database OpenTSDB, we customized the alarm module based on the OPEN HTTP API of OpenTSDB to schedule and pull data. Different alarm rules and thresholds such as sequential and absolute values are applied to different events according to the event magnitude, activity importance and other indicators. When the threshold is exceeded, alarm information will be sent through company IM. As shown in Figure 7, the event shows a decrease in the magnitude of data from the previous month. After receiving the alarm, the relevant person in charge can track the problem in time:
Summary and Prospect
The real-time contact system for hotel operation has been online and running stably for more than one year. It is a very important link in business operation and serves as a link between the preceding and the following. It has achieved good results in terms of system processing capacity and contribution to business:
- The average daily real-time message processing volume is nearly 1 billion.
- Peak event QPS 14,000.
- Helped the hotel, tourism, transportation and other business lines to carry out a variety of operational activities.
- It significantly promoted the conversion rate, GMV and pull new indexes.
While the current system addresses business requirements, there are still some practical pain points:
- Real-time data access is not automated.
- Rule engine capabilities need to be generalized.
- Scenario and rule registration is not open to operational PMS and can only be done by RDS.
Looking into the future, we still have a lot of ways to go to solve the pain points. In the future, we will continue to build the system from both technical and business aspects to make it easier to use and more efficient.
Author’s brief introduction
Xiaoxing, system engineer of Meituan Platform Technology Department — Data center — Data Intelligence Group, graduated from Beijing Institute of Technology in 2014, engaged in Java background system and data service construction. Joined Meituan-Dianping in 2017, engaged in big data processing.
Wei Bin, System engineer of Meituan Platform Technology Department — Data Center — Data Intelligence Group, graduated from Dalian University of Technology in 2015 and joined Meituan Dianping in the same year, focusing on big data processing technology and high concurrency services.
recruitment
The technical department of meituan platform — data center — data intelligence group is looking for talents in data mining algorithm, big data system development and Java background development. If you are interested, please send your resume to lishangqiang#meituan.com.
If you are interested in our team, you can follow uscolumn.