With the continuous development of Meituan takeout business, the takeout advertising engine team has carried out engineering exploration and practice in many fields, and has achieved some results so far. We plan to share it with you in the form of serialization. This article is the first serialization of “Meituan takeout advertising Project Practice”. This paper introduces some thinking and practice in the platformization process of Meituan takeout advertising engine for the purpose of improving business efficiency.
1 introduction
Meituan takeout has become one of the most important businesses of the company, and commercial realization is an important part of the whole takeout ecosystem. After years of development, advertising business covers multiple product lines such as list ads in the form of Feed streams, display ads for KA and large merchants, search ads based on user queries, and innovative ads in some innovative scenarios, and corresponds to more than a dozen segmented business scenarios.
From the technical level, the process of an advertising request can be divided into the following main steps: trigger of advertising, recall, refinement, creative selection, mechanism strategy and other processes. As shown in the figure below: the user’s intention can be obtained through triggering, and then the advertising candidate set can be obtained through recall. The stores in the candidate set can be scored and sorted through estimation, and then the Top stores can be creatively selected. Finally, the advertising results can be obtained through some mechanism strategies.
2 Status Analysis
In the process of business iteration, with the continuous access of new business scenarios and the continuous iteration of functions of original business scenarios, the system becomes more and more complex, and the demand response of business iteration gradually slows down. In the early stage of business development, we have carried out structural reconstruction of single module, such as mechanism strategy and recall service. Although efficiency has been improved to some extent, the following problems still exist:
- Business logic low reuse degree: advertising business logic is more complex, such as system service module, its main function is to bid for advertising control center and the advertisement and sorting mechanism provide decision-making, online support a dozen business scenarios, each scene there are many differences, such as will involve multiple recalls, charging mode, sorting, price mechanism, budget control and so on. In addition, there is also a large number of business-defined logic. As the relevant logic is the focus of algorithm and business iteration, there are many developers and they are distributed in different engineering and policy groups, which leads to the ununified standard of abstraction granularity of business logic and low degree of reuse between different businesses in different scenarios.
- High learning cost: due to the complexity of the code, it is expensive for new students to get familiar with the code and difficult to get started. In addition, the online service has been transformed into a micro-service very early, and the number of online modules exceeds 20. Due to historical reasons, the frameworks used by different modules are quite different, and the development of different modules has a certain learning cost. In the cross-module project development, it is difficult for one student to finish independently, which makes the efficiency of personnel not fully utilized.
- Difficult to obtain information from PM (product manager) : Due to many business scenarios and complex logic, it is difficult for most students to understand all the logic of the business. When PM needs to confirm the relevant logic in the product design stage, he can only let the R&D students check the code first and then confirm the logic, which makes it difficult to obtain information. In addition, since THE PM is not clear about the design logic of relevant modules, it often needs to consult the r&d personnel offline, which affects the work efficiency of both parties.
- QA (test) evaluation is difficult: QA function scope evaluation completely depends on the technical plan of the r&d students, and most of them confirm the scope and boundary involved in the function modification through communication, which not only affects the efficiency, but also easily occurs the problem of “missing test”.
Three goals
In view of the above problems, we launched the platform project of Meituan Takeout advertising engine in early 2020, aiming to achieve the following goals through the platform project.
-
Improve the efficiency of production and research
- High functional reuse, improve development efficiency.
- Reduce the cost of collaboration among RD, PM and QA, and improve the efficiency of production-research collaboration.
-
Improve delivery quality
- Precision QA testing coverage to improve delivery quality.
- Enabling services.
-
Through the visual platform page, PM can understand the capabilities of other product lines, empower each other and help product iteration.
4 Overall Design
4.1 Overall Idea
At present, there have been many researches in the direction of “platformization” in the industry. For example, Alibaba’s TMF is positioned in the field of platformization of pan-transaction system. The main construction ideas are layering process choreography and domain expansion, plug-in architecture separating business package and platform, and separation of management domain and operation domain. While Alibaba’s AIOS is positioned in the field of search and push platform, mainly relying on the bottom five core components, and quickly combining and deploying components in the mode of operator flow chart customization, so as to achieve fast business delivery.
Meituan takeaways in platform project started, from the business scenario and pain points, determine the key goal of our project: use appropriate to the platform design concept to build technical ability, will existing take-out research advertising business system and production process into a platform model, fast support delivery advertisement more business for delivery. We learn from the industry platform mature thought, identified by the business ability, build platform framework technology standards to ability to support research and production platform model upgrade the whole idea for security platform construction, the whole idea can be divided into three parts: the business ability of standardization, technical framework, the new research platform production process.
- Standardization of business capability: by sorting out the existing logic and standardizing the transformation, the basic guarantee of multi-business scenarios and multi-module code reuse is provided.
- Technical capability framing: it provides the ability of combination and arrangement to connect the standardized logic in series, schedule and execute through the engine, and at the same time completes the visualization of the ability to help users quickly obtain information.
- Platform production and research new process: In order to ensure the overall improvement of the r&d iteration after the project is launched, we also optimized some mechanisms of the R&D process, mainly involving the r&d personnel, PM and QA.
That is to say, the standardization provides the guarantee of reuse, the framework bears the capability of platform implementation, and the operation mechanism of the new process of production and research ensures the continuity of the overall efficiency improvement. The modules involved in the whole advertising engine service follow the idea of platform and support each upstream product scenario, as shown in the figure below:
4.2 Service Standardization
4.2.1 Business Scenario and Process Analysis
Performance enhancement is one of the most important goals of platformization, and the most important means of performance enhancement is to maximize the reuse of functions in the system. We first made a unified analysis on the current situation of takeout advertising business line scenarios and traffic, and reached the following two conclusions:
First, the major processes of each business line are basically similar, including pre-processing, recall, prediction, mechanism strategy, sorting, creativity, result assembly and other major steps; At the same time, there are many similar functions and line-of-business specific functions in the same steps of different businesses. Second, these functions can theoretically be reused as a whole, but the current situation is that these functions are concentrated within the lines of business, and the reuse situation is different between different lines of business and different groups. The main reasons for this problem are as follows:
- Different businesses are at different stages of development and have different iterations.
- There is a natural “segregation” in the organizational structure, such as recommendation and search being split into two different business groups.
Therefore, the main reason hindering the further improvement of the reuse degree of takeout advertising lies in the lack of standardization of the overall level, and there is no unified standard among all business lines. Therefore, we should first solve the problem of standardization construction.
4.2.2 Standardization construction
The breadth and depth of standardization construction determine the level of system reuse capability. Therefore, the construction goal of this standardization should cover all aspects. For all the services of the advertising system, we start from the three dimensions of business development, including the realized function, the data used by the function and the process of the function combination, to carry out the standardized construction of the unified advertising. Thus:
- At the level of individual development: developers don’t need to pay attention to process scheduling, they only need to focus on the realization of new functions, and the development efficiency becomes higher.
- From the perspective of the system as a whole: there is no need for repeated development of common functions of each service, and the overall reuse degree is higher, saving a lot of development time.
4.2.2.1 Standardization of Functions
For the standardization of functions, we first divide functions into two parts according to whether they are related to business logic: business logic related and business logic irrelevant.
① Functions unrelated to business logic are unified and constructed through a two-layer abstraction
- The standardized form of unified construction of all lines of business is two-layer abstraction. For a single, simple function point, abstraction is a tool layer; An aspect of functionality that can be independently implemented and deployed, such as creative capabilities, is abstracted as a component layer. The tool layer and component layer provide services in the form of JAR packages. All projects use related functions by referring to the unified JAR packages to avoid repeated construction, as shown in the figure below:
② The functions related to the business logic are multiplexed in layers
- Business logic-related functions are at the core of this standardization construction, with the goal of maximizing business reuse. Therefore, we abstracted the smallest non-detachable unit of business logic into the basic unit developed by business students, called Action. At the same time, according to the different scope of reuse of Action, it is divided into three layers, namely, the basic Action that can be reused by all businesses, the module Action that can be reused by multiple business lines, and the specific business Action customized by a single business, namely, the extension point. All actions are derived from Base actions, which define the basic capabilities of all actions.
- Different Action types are developed by different types of development students. For the basic Action and module Action with relatively large influence range, students with rich engineering experience should develop them. For business actions or extension points that only affect a single business, students with relatively weak engineering skills should develop them.
- At the same time, we abstract the combination of multiple actions into Stage, which is a business module formed by the combination of different actions. The purpose is to shield details, simplify the complexity of business logic flow chart, and provide coarse-grained reuse capability.
4.2.2.2 Standardization of data
As the basic element to realize functions, data sources of different businesses are almost the same. Without standardized design of data, it is impossible to achieve functional standardization and maximize reuse of data. We divide data from two aspects of data source and data usage: input data, intermediate data and output data for business capabilities are realized by standardized data context; At the same time, the third-party external data and internal data such as word lists can be obtained through unified container storage and interfaces.
Use Context to describe the environment dependencies of Action execution
- Each Action execution requires certain environment dependencies, including input dependencies, configuration dependencies, environment parameters, and dependencies on the execution state of other actions. We abstracted the first three dependencies into the context of business execution, constraining the use of actions by defining a uniform format and usage.
- Considering that different levels of Action use data dependencies from large to small, following the same hierarchical design, we design three layers of Context container inheritance, and standardized storage of the three dependency data into the corresponding Context.
- The advantage of using a standardized Context for data passing is that the Action can customize the input data and the convenience of subsequent extensions. A standardized Context also has the disadvantage of not completely restricting Action data access mechanically, and may cause the Context to become bloated with subsequent iterations. After considering the pros and cons, we still use the standard Context mode at this stage.
② Unified processing of third-party external data
- For the use of external data from the third party, mature engineering experience is required to evaluate the factors such as the amount of adjustment, load, performance, batch or unpacking in advance. Therefore, for all external data from the third party, we unified encapsulation as basic Action, and then customized use by the business according to the situation.
③ The whole life cycle management of thesaurus data
- The word table is generated according to business rules or policies and needs to load KV class data used in memory. Before standardization, the word table data is missing in different degrees in generation, pull, loading, memory optimization, rollback, degradation and other capabilities. Therefore, we designed a set of thesaurus management framework based on message notification, which realized thesaurus version management, custom loading, scheduled cleaning, full life cycle coverage of process monitoring, and defined the access mode of business standardization.
4.2.2.3 Standardization of call process
Finally, it is the call process of business that combines functions and data, and the unified process design pattern is the core means of business function reuse and efficiency improvement. The best way to unify process design is to standardize business processes. Among them, the way to call the third-party interface, let the students of the framework research and development use centralized encapsulation to unify. The timing of interface invocation is standardized based on the principle of performance first and load, and no repeated invocation occurs.
In practice, we first sort out the standardized functions used by business logic, and then analyze the dependence between these functions. Finally, we complete the standard design of the whole business logic process with the principle of performance first and taking into account load and no repeated calls.
From the horizontal dimension, by comparing the similarities of different business logic processes, we have also extracted some practical experience, taking the central control module as an example:
- For the third party data of the user dimension, encapsulation is called uniformly after initialization.
- For the third-party data of the merchant dimension, the data used by the batch interface will be encapsulated and called uniformly after recall. The data used by no batch interface is uniformly encapsulated and called after refined truncation.
4.3 Technical Framework
4.3.1 Introduction to the overall framework
The platform is mainly composed of two parts, one is the platform front part, the other part is the platform development framework package. One part is at the front desk to research and development personnel, PM and QA three roles using the Web at the front desk, main function is to integrate the engine platform development framework package service for visualization interaction, we also gave a name to this platform, called Camp platform, this is the meaning of home, climb business moral power business at your peak. The platform development framework package is integrated by the engine backend services, providing engine scheduling isolation, capacity precipitation, information reporting and other functions, while ensuring that each module maintains the same standard framework and business capability style.
Each online service needs to introduce platform development framework package, how to balance service performance and platform versatility is also the place we need to focus on. This is because introducing a platform framework enhances the existing code details; In the scenario of heavy TRAFFIC on the C-side, the more common the platform framework is, the more diversified the underlying functions are. Compared with simple “naked writing” code, the performance will be compromised. Therefore, you need to strike a balance between performance overhead and platform abstraction capabilities. Combined with the characteristics of our own business, we give the security threshold of TP999 loss within 5ms, which sinks the common ability of each business to the framework and provides the upper online service.
To sum up, the overall system architecture design is as follows:
① Camp platform provides management control and display functions, which is composed of the following sub-module packages:
- The business visualization package provides static information about the capabilities on each background system, including name, function description, configuration information, etc., which will be used in the requirements assessment stage and business development stage.
- Full choreography and under contract, business development based on the existing ability to visualize the drag and drop, through full service of automatic generation of parallel execution of the optimal process, to adjust for differences in specific business scenarios, the resulting a directed acyclic graph, nodes represent ability, side said the dependencies between business ability. This graph will be dynamically delivered to the corresponding background service for the execution framework to parse execution.
- The statistics monitoring package provides statistics and exception information, such as service capability and dictionary, during the running. It is used to view the performance and exception status of each service capability, realizing the running status of each service capability.
(2) The platform development framework package is introduced by multiple services of the advertising engine to execute the programmed business process and provide services externally. The platform development framework package consists of the following sub-module packages:
- The core package provides two functions. The first one is the scheduling function, which executes the process orchestration files issued by the platform, executes each business capability in sequence or in parallel according to the DAG execution order and execution conditions defined, and provides necessary isolation and reliable performance guarantee. Meanwhile, monitoring operation and abnormal situations are reported. The second is the service collection and reporting function, which scans and collects the service capabilities within the system and reports them to the platform Web service for business orchestration and visualization of service capabilities.
- Capability package is a collection of business capabilities, which has been defined in the previous section “4.2.2.1 Standardization of Functions”, namely, “the smallest unsplit business logic unit is abstracted into the basic unit developed by business students, called Action, also called capability”.
- Component package is a collection of business components. Business components are also defined in section 4.2.2.1 Function Standardization, that is, “a certain function that can be independently implemented and deployed, such as creative ability, is abstracted as a component”.
- Toolkits that provide basic functionality required by business capabilities, such as dictionary tools, experiment tools, and dynamic degradation tools used by the engine. This tool is also defined in section 4.2.2.1 Standardization of Functions, that is, a single, simple non-business function module is abstracted into a tool.
A typical development process is shown in the figure above. After the developer develops the business capability (1), the static information of the business capability will be collected on the Camp platform (2). Meanwhile, the optimal DAG diagram (3) will be obtained through the full graph dependent derivation. During the online service running, the engine will get the latest DAG process and provide the latest business process service externally (4,5). Meanwhile, the dynamic information of business operation will be reported to Camp platform (6).
In the following sections, we compare of several key technical points are described in detail, including the visualization of related components automatically report and DAG perform related full choreography, scheduling, etc., in the end, this article will introduce associated with strong advertising, dictionary unified encapsulation of work in the platform.
4.3.2 Service Collection and reporting
To facilitate the management and query of existing business capabilities, the platform development framework package scans @lppability annotations and @lppExtension annotations at compile time to report metadata to the Camp platform. Business students can query and visually drag and drop existing components in the Camp platform.
// Atomic power (Action)
@lppability (name = "POI, Plan, Unit ", desc = "Before making budget filter, Param = "adFlatAction. param ", response = "List
", PRD =" no product requirements ", Func = "POI, Plan, Unit data aggregation tiling capability ", cost = 1)
public abstract class AdFlatAction extends AbstractNotForceExecuteBaseAction {}/ / extension point
@lppextension (name = "data aggregation tiling extension point ", func = "POI, Plan, Unit data aggregation tiling ", diff =" default extension point, no difference between each service line ", PRD = "none ", cost = 3)
public class FlatAction extends AdFlatAction {
@Override
protected Object process(AdFlatAction.Param param) {
//do something
return newObject(); }}Copy the code
4.3.3 Full layout
In advertising engine service, there are dozens or even hundreds of actions in DAG of each business. It is difficult to achieve optimal parallelization of Action choreography through traditional manual choreography or business-driven choreography. Platform framework package, therefore, the idea of data driven, by the Action of data between the dependencies, is derived by the program automatically parallelize the optimal DAG figure, namely full choreography, again by the business personnel according to the business scenario and traffic scene custom adjustments, dynamic distributed to service node, to perform scheduling engine, In this way, the optimal parallelism in the scene is achieved through automatic deduction and scene tuning.
① The basic principle of full automatic layout
We define the input set of an Action X to be the fields used by the Action X execution as follows:
Define the output parameter set of an Action Y as the output field of the Action, expressed as follows:
We think of Action X as dependent on Action y when there is one of two situations.
- Input_x ∩ output_y ≠ ∅, that is, some input parameters of Action X are produced by Action Y.
- Output_x ∩ output_Y ≠ ∅, i.e. Action X and Action Y operate on the same field.
② Full automatic layout of the overall design
Full graphic automatic layout is divided into two modules: analysis module, dependency analysis module.
- Parsing module: Parses the input and output sets of each Action through bytecode analysis.
- Bytecode analysis uses the open source tool ASM to simulate the Java runtime stack, maintain the Java runtime local variation table, and parse out the fields that each Action execution depends on and produces.
- Dependency analysis module: Using the reverse analysis method of three-color marking, analyze the dependence between actions, and prune the generated graph.
- Dependency pruning: in order to reduce the complexity of graphs, dependency pruning is carried out on graphs without changing their semantics. Such as:
(3) Full automatic layout benefits
Automatically corrects human error orchestration and maximizes orchestration parallelism. DAG comparison before and after full mapping in a certain actual business scenario is as follows:
The two actions marked blue will operate on the same Map at the same time, and there is a thread safety risk if they are executed concurrently. As the method call stack is too deep, it is difficult for business development students to pay attention to this problem, resulting in incorrect parallelization. After full graphic analysis, it is programmed into serial execution.
Mark the three groups of actions in green, red and yellow. The two actions in each group have no data dependence relationship, and the business development students serialize the arrangement. After full graph analysis, the layout is parallel.
4.3.4 Scheduling Engine
The core function of the scheduling engine is to schedule the DAG after the preceding delivery. Therefore, the engine needs to have the following two functions:
- Composition: Generate a specific DAG template diagram based on the choreography configuration of the Action.
- Scheduling: When traffic requests are made, actions are executed according to the correct dependencies.
The working principle of the whole scheduling engine is shown as follows:
For the sake of performance, the scheduling engine uses “static composition + dynamic scheduling” instead of real-time traffic request composition.
- Static composition: When the service starts, the scheduling engine initializes the Graph template and loads it into memory based on the layout of the delivered DAG. After the service is started, the templates for multiple DAGs are persisted to memory. When a graph is delivered dynamically to the Web platform, the engine will compose the latest graph and replace it completely.
- Dynamic scheduling: When a traffic request is received, the service side assigns the corresponding DAG and sends it to the scheduling engine together with context information. The engine executes according to Graph template, completes Graph and node scheduling, and records the whole scheduling process.
As the advertising engine serves the C-end users, it has high requirements on the performance, availability and scalability of the service. Scheduling engine design difficulties also fall in these three aspects, we will briefly elaborate next.
4.3.4.1 High Performance Practices
The process engine serves c-side services, and the engine’s scheduling performance should be at least equal to or within an acceptable performance loss threshold compared to traditional hard-coded scheduling. Below, we will describe our performance practices from two representative aspects of scheduler design and scheduling thread tuning.
① Scheduler design
Meaning: how to make the nodes execute one by one; When a node completes execution, how to make other nodes sense and start execution. In the following figure, how does node A notify node B and node C to execute after the execution is complete? The common idea is hierarchical scheduling of nodes, which has the following meanings and characteristics:
- Rely on hierarchical algorithms (such as breadth-first traversal) to calculate the nodes that need to be executed at each layer in advance; Nodes are scheduled batch by batch without any notification or driver mechanism.
- In the case of multiple nodes in the same batch, the long board effect is easy to occur due to the different execution time of each node.
- It has a better performance advantage in graph scheduling with multiple serial nodes.
Another common idea is the pipeline-based queue notification driver model:
- After a node completes execution, it immediately sends signals to the message queue. After receiving the signal, the consumer side executes the subsequent node. For example, in the DAG shown in the preceding figure, after B is executed, D/E receives a notification to start execution and does not care about C’s status.
- Since the execution state of sibling nodes is not concerned, the longboard effect of hierarchical scheduling does not occur.
- In graph scheduling with multiple parallel nodes, it has very good parallel performance. However, in a graph with multiple serial nodes, performance is slightly worse due to the additional overhead of thread switching and queue notification.
As shown in the figure above, the scheduling engine currently supports both scheduling models. A hierarchical scheduler is recommended for graphs with multiple serial nodes, and a queue pipeline scheduler is recommended for graphs with multiple parallel nodes.
Hierarchical scheduler
Depending on the hierarchical algorithm mentioned above, nodes are executed in batches, serial nodes are executed in single threads, and parallel nodes are pooled.
Queue pipeline scheduler
Both the outer GraphTask (GraphTask) and the inner NodeTask (NodeTask) are pooled for execution.
-
Node scheduling mechanism
- Scheduling mechanism: the process in between when the consumer side receives a message and the node is executed. In the following DAG, after receiving the message, the node needs to complete the following processes: checking DAG execution status, checking parent node status, checking node execution conditions, modifying execution status, and node execution, as shown in the figure below:
- There are usually two ways to execute these steps: one is centralized scheduling, which is processed by a unified method; The other is distributed scheduling, which is done by each subsequent node alone.
- We adopt centralized scheduling: after a node completes execution, messages are sent to the queue; There is a task distributor on the consumption side, which is responsible for unified consumption and then task distribution.
The starting point for this is:
- As shown in the figure above, three nodes ABC are completed at the same time, and there are still a series of operations before node D is actually executed. In this process, if there is no lock control, node D will be executed three times. Therefore, locks are needed to keep threads safe. The centralized task dispenser uses lock-free queue design to avoid performance overhead caused by locking while ensuring thread safety.
- In the case of multiple children, some common operations (check graph/parent node status, exception detection, etc.) will be performed by each child node once, which will bring unnecessary system overhead. The centralized task distributor, on the other hand, uniformly handles common operations and then distributes sub-node tasks.
- In decentralized scheduling, the responsibility scope of nodes is too wide. They need to execute the core code of the business as well as handle the consumption of messages. Therefore, the responsibility is not single and the maintainability is poor.
Therefore, in the actual development of the project, considering the difficulty of implementation, maintainability, and comprehensive consideration of performance and other factors, the final use of centralized scheduling.
② Scheduling thread tuning
The scheduling engine provides two apis to callers on DAG execution, which are:
- Asynchronous invocation: The GraphTask is executed by the thread pool, and the Future of the outermost GraphTask is returned to the business side, which can accurately control the maximum execution time of the DAG. At present, there are scenarios in takeout advertisements that process different advertising services in the same request, and the business side can freely combine the scheduling of subgraphs according to the asynchronous interface.
- Synchronous calls: The biggest difference from asynchronous calls is that synchronous calls are returned to the caller only after the graph execution has finished/the graph execution has timed out.
The underlying scheduler currently provides both types of schedulers. The details are shown in the figure below:
This shows that the scheduling engine uses thread pools multiple times for internal task execution. In cpu-intensive services, if the number of requests is too large or there are too many nodes, a large number of thread switches will inevitably affect the overall performance of the service. For the queue notification scheduler, we did some scheduling optimizations to try to bring performance back to before the scheduling engine was plugged in.
- Scheduling thread model tuning
- In the case of synchronous calls, the main thread does not return directly, but waits for the DAG graph to complete. The scheduling engine takes advantage of this feature by having the main thread execute the outermost GraphTask, eliminating one thread switch per request.
- Serial nodes perform optimization
- As shown in the DAG diagram above, there are some serial nodes (such as one-way A→B→C→D). When executing these four serial nodes, the scheduling engine does not switch threads, but completes tasks successively by one thread.
- When serial nodes are executed, the scheduling engine also performs serial scheduling instead of queue notification to minimize system overhead.
4.3.4.2 HIGH availability practices
In terms of high availability, let’s take a brief look at our practices in isolation and monitoring, the core principles of which are shown in the following figure:
① Service Isolation
In advertising scenarios, there are often multiple lines of business in the same service, and the logic of each line corresponds to a DAG. For isolation of lines of business within the same service, we use a single-instance-multi-tenant approach. This is because:
- With process engines active in the same process, single-instance solutions are easier to manage.
- The internal implementation of the process engine has done some multi-tenant isolation on the granularity of the graph, so the external provision is more in favor of a single-instance solution.
Except that DAG scheduling and Node scheduling are static codes, the idea of multi-tenant isolation is adopted in the storage of graphs, selection and execution of DAG, selection and execution of Node nodes, and Node notification queue of each DAG.
② Scheduling task isolation
Scheduling tasks are mainly divided into DAG task (GraphTask) and NodeTask (NodeTask). One GraphTask corresponds to multiple NodeTasks, and its execution state depends on all nodeTasks. The scheduling engine isolates the execution of GraphTask and NodeTask by means of secondary thread pool isolation.
The starting point for such isolation is:
- Each thread pool has a single responsibility and performs a single task, and the corresponding process monitoring and dynamic adjustment are more convenient.
- If a thread pool is shared, an instantaneous QPS surge will cause the thread pool to be completely occupied by GraphTask, unable to commit NodeTask and resulting in scheduling engine deadlock.
Therefore, two-level thread pool scheduling is superior to one-level thread pool scheduling both in terms of thread refinement management and isolation.
③ Process monitoring
The monitoring of DAG scheduling is divided into three categories. They are exceptions, timeout, and statistics, as follows:
- Exception: The graph/node execution is abnormal. Configuration retry and custom exception handling are supported.
- Timeout: Graph/node execution times out, support degradation.
- Statistics: Graph/node execution times & elapsed time, providing optimized data reports.
4.3.4.3 High availability practices
Advertising business logic is complex, there are a lot of experiments, branch judgment, conditional execution, etc. And the frequency of iteration and release of advertising services is also very high. Therefore, the first consideration of the scheduling engine in scalability is how to schedule conditional nodes and how to quickly implement the orchestration configuration without publishing.
① Node conditions are executed
For conditional execution of nodes, we need to display the increment Condition expression when configuring the DAG. The scheduling engine dynamically calculates the value of the expression before executing the node. The node is executed only when the execution conditions are met.
② Configure dynamic delivery
- As shown in the previous figure, composition and scheduling are decoupled through the intermediate Graph template, and the choreography configuration can be edited through the Web platform and delivered dynamically to the service.
- Because the scheduling engine uses thread pool many times in the scheduling process, we use the company’s common components to dynamically configure and monitor thread pool for dynamic update of thread pool.
4.3.4.4 Summary of scheduling engine
① Functional aspects
DAG core scheduling
- The scheduling engine provides the implementation of two common schedulers, which can provide better support for different business scenarios.
- The scheduling engine adopts the classical two-level scheduling model, and DAG graph/node scheduling is more isolated and controllable.
Node condition execution
- The condition check function is added before node scheduling. Nodes that do not meet the conditions will not be executed. The scheduling engine dynamically determines the execution conditions of nodes based on the context and traffic.
Timeout handling
- Timeout processing is supported for DAG, Stage and Node nodes, simplifying the timeout control of internal business logic and giving the initiative to the framework for unified processing. Improve the processing efficiency of internal logic under the premise of ensuring performance.
The node can be configured
- A Node can be used in different service scenarios, but the processing logic of each scenario is different. In view of this situation, the configuration function of nodes is added. The framework passes the configuration of nodes into the logic to realize configuration.
② Performance
- In the DAG scenario with multiple serial nodes, the performance is basically the same as that of the original bare write mode.
- In the DAG scenario with multiple parallel nodes, due to the influence of pooling, there is some performance loss in multi-thread pool preemption and switching. After repeated tuning and CPU hotspot governance, TP999 loss value can be controlled within 5ms.
4.3.5 Business Component Layer precipitation
As defined in 4.2.2.1 Function Standardization, service function modules that can be independently implemented and deployed are abstracted into service components. Extracting business components with high cohesion and low coupling from business logic is an important means to improve code reuse capability. In practice, we have found that the logic contained in different business components varies greatly, as do the implementation and design and code styles. Therefore, in order to unify the design and implementation of business components, we have implemented a standardized component framework to reduce the repetitive work of developing new components and reduce the learning and access costs for users.
The left side of the figure shows the overall framework of business components, with a unified common domain and common dependencies at the bottom, a standard implementation process for business components at the top, and aspect capabilities to support business logic. On the right is an example of a framework based intelligent bidding component. The role of a framework is to:
Unified public domain and dependency management
- A common domain is a business entity that is used by different business components. We extract the common domain objects on the business and provide them to other business components as base components to reduce the repetition of domain objects in different components.
- Business components have many internal and external dependencies. We sorted out and screened the public dependence uniformly, weighed various factors at the same time, and determined the reasonable way of use. Finally form a complete and mature dependency framework.
② Unified interface and process
- We abstracted the business components into three phases: Prepare for data and environment preparation, Process for actual calculation, and Post for post-processing. Abstract generic template interfaces are designed at each stage, and different combinations of interfaces are used to complete the different business processes in the components. All classes have interfaces designed to provide both synchronous and asynchronous invocation.
③ Unified cutting ability
- At present, all the service modules adopt Spring as the development framework, and we use its AOP function to develop a series of aspect extension capabilities, including log collection, time monitoring, degradation and flow limiting, data caching and other functions. These functions are designed with non-intrusive code that reduces the coupling of aspect capabilities to business logic. New business components can be configured to be fully reusable.
The intelligent bidding component is the business component developed based on the above framework. Intelligent bidding component is an abstract aggregation of advertising bidding strategies, including PID, CEM and other algorithms. The data, such as user feature acquisition and experimental information analysis, which depended on bidding strategy, were implemented using Prepare template. Specific PID and CEM algorithms are implemented using Process template. Post template is used to verify bidding results and monitor parameters. Public domain objects and third-party dependencies used by the entire component are also centrally managed by the framework.
4.3.6 Tool Package – Dictionary Management
Toolkit is also defined in 4.2.2.1 Function Standardization, that is, a single, simple non-service function module is abstracted into a tool. The construction of tool kit is an important basis for improving the effectiveness of advertising platformization, and its main function is to deal with auxiliary general processes or functions irrelevant to business logic. For example, a large number of KV-type data need to be loaded into memory for use in advertising system, which is called word list file. In order to realize the whole life cycle management of thesaurus file, advertising platform is used to design and develop thesaurus management tool, and has accumulated good practice effect in the process of business use.
The design of thesaurus management
The figure above shows the overall architecture of the thesaurus management platform. The thesaurus management platform adopts a hierarchical design, with five layers from top to bottom:
- Storage layer: mainly used for data storage and flow. Within Meituan, S3 stores thesaurus files in the cloud, and Zookeeper stores thesaurus version information. The online service obtains the latest version update events by listening.
- Component layer: Each component can be viewed as an independent functional unit that provides a common interface to the upper layer.
- Plug-in layer: The main purpose of business plug-ins is to provide uniform plug-in definitions and flexible custom implementations. For example, the main purpose of the loader is to provide a uniform format of thesaurus loading and storage function, each thesaurus can dynamically configure its loader type.
- Module layer: The module layer mainly looks at a certain link of different processes of the whole thesaurus file from the business perspective, and the interaction between modules is completed through the event notification mechanism. For example: thesaurus management module includes thesaurus version management, event monitoring, thesaurus registration, thesaurus add/uninstall, thesaurus access and so on.
- Process layer: We define a complete thesaurus business behavior process as a process. The whole life cycle of thesaurus can be divided into the process of adding thesaurus, updating thesaurus, canceling thesaurus, rolling back thesaurus and so on.
② The business benefits of thesaurus management
The main advantages of the platform dictionary management tool in business practice are as follows:
- More flexible service architecture: transparency of thesaurus processes. Users do not need to pay attention to the word flow process, using a unified API access.
- Unified business capabilities: unified version management mechanism, unified storage framework, unified word sheet and loader.
- System high availability: rapid recovery and degradation capabilities, resource and task isolation, multi-priority processing capabilities and other multiple system support functions.
4.4 New process of production and research
As mentioned above, due to the large number of advertising business lines involving many upstream and downstream, the existing business logic has become extremely complex after several years of rapid iteration of engineering and strategy, leading to some process problems gradually becoming prominent in daily iteration.
(1) Obtaining PM information is difficult
PM in product research and design, involving the related module of the current logic is not very clear, often to solve by means of offline consulting developers, affect the efficiency of both sides, and pure product design document to business perspective and to process, every time to review, QA and r&d staff it difficult to get to the point of change and visual changes to the scope, A lot of time is spent communicating with each other to confirm the compatibility of boundaries with existing logic and so on.
② Functional evaluation of r&d personnel depends entirely on experience
In scheme design, it is difficult for r&d personnel to directly obtain whether horizontal related modules have similar function points (reusable or extensible), resulting in low reuse rate. Meanwhile, in project scheduling, they completely rely on personal experience, and there is no unified reference standard, which often leads to project delay due to inaccurate workload assessment.
③ Inefficient QA testing and evaluation
QA completely relies on the technical plan of r & D students (RD) in functional scope evaluation, and most of them confirm the scope and boundary involved in functional modification through oral communication. In addition to affecting efficiency, some test problems will be put behind in the whole project cycle, affecting the progress of the project. At the same time, the management of the basic JAR package after the platform completely depends on manual, and there is no unified test standard for some actions, especially the basic Action. The above problems can be summarized as follows:
4.4.1 target
With the help of platform, the new production and research process is implemented for the whole process of project delivery (as shown in the figure below), so as to solve the problems encountered by product, R&D and testing personnel in iteration, empower business, and improve the delivery efficiency and quality of the overall project.
4.4.2 Thinking and landing
Implement the new production and research process based on platform, that is, use Stage/Action to drive the delivery of the whole project, as shown in the figure below:
- For PM (product) : Build Stage/Action visualizations and apply them in project design.
- For RD (research and development) : adopt the new Stage/ Action-based scheme uniformly, design and develop the schedule mode.
- For QA (testing) : unify communication and collaboration language -Stage/Action, and drive improvement of related testing methods and tools
4.4.2.1 product side
The following figure shows the application and practical effect after the production and research function construction. The first two are the visualization of the business capability of construction, providing a visual function for PM to understand the latest process of each business and detailed Action capability. The third is the survey and function description of related business in product design (for data security reasons, the following screenshots are illustrated by examples of non-real projects).
4.4.2.2 r&d side
According to the different stages of the r&d work in the project development cycle, we formulated the process specifications before and after the code development to ensure that the r&d students could make full use of the capabilities of the platform for design and development improvement throughout the development cycle.
-
Before the development of
- Technical design: based on the existing Action functions involved in each business and the visualization ability of Action DAG, carry out the research reference and reuse evaluation of horizontal business, as well as the technical design of adding or changing Action functions.
- Project schedule: Standardized evaluation of development workload based on new, changed and reused Action capabilities in technical design as well as Action hierarchy.
-
After the development of
- Action precipitation: The system reports and periodically evaluates the reusability and expansion of platform Action capabilities.
- Process feedback: track each project based on the platform, and report the relevant indicators in the delivery process quantitatively, and collect project staff feedback.
4.4.2.3 test side
-
Stage/Action unified communication and collaboration language: Stage/Action is adopted as the communication language for function description and design in the multi-party project, such as requirement design and review, scheme design and review, test case writing and review, so as to make the discovery of problems in the subsequent process as early as possible, and at the same time, all participants are more clear about changes and test contents. Provide support for QA to better evaluate the test scope, and thus better ensure the quality of project testing.
-
Promote full coverage of base Aaction UT: Build unit tests for base actions, automatically trigger unit test pipeline when Merge code, output success rate and coverage of single test execution, and evaluate indicator baseline to ensure efficiency and quality of sustainable testing.
-
Improve JAR management tools and automated analysis and testing: Level 1 Action is written in the platform JAR package, similar to the public JAR package management, development of exclusive management and maintenance tools, to solve the upgrade of the public JAR automation single test coverage problem and each upgrade JAR version requires manual analysis of manual maintenance of the test efficiency problem, through the whole process of integration test automation.
Effect of 5
① Improvement of production and research efficiency
-
System capacity precipitation
- All business lines of takeout advertising have completed the platform architecture upgrade, and are continuously running and iterating on this architecture.
- Business basic capacity precipitates 50+, module common capacity precipitates 140+, product line common capacity precipitates 500+.
-
Human efficiency improvement
- Improvement of R&D efficiency: after the migration of platform architecture of all business lines, the major business iterations were 20+ times, and the improvement of business iteration efficiency was 28+% in total compared with the previous one. Especially in the access of new business, the same function does not need to be repeatedly developed, and the improvement effect is more obvious:
- Capacity reuse 500+ times, capacity reuse ratio 52+%;
- In new service access scenarios, Action reuse is 65+%.
- Improved test automation index: with the help of JAR automated analysis, integrated test and process coverage construction, the coverage of AD automated test has increased by 15%, the test efficiency has increased by 28%, and the comprehensive score of automation has also improved significantly.
- Improvement of R&D efficiency: after the migration of platform architecture of all business lines, the major business iterations were 20+ times, and the improvement of business iteration efficiency was 28+% in total compared with the previous one. Especially in the access of new business, the same function does not need to be repeatedly developed, and the improvement effect is more obvious:
② Improve delivery quality and enabling products
- Action-based changes and a clear visualization of business links helped QA more accurately assess the scope of impact, with the number of process and online issues both decreasing by about 10%.
- Through the visualization of system capabilities, the transparency of the system is increased, and the existing capabilities of the system are effectively helped to be understood by the products in the stage of product research, and problems such as business consultation and knowledge barriers across product lines are reduced (see 4.4.2.1 for details).
6. Summary and Outlook
This paper introduces the thinking and implementation scheme of takeout advertising platform in construction and practice from three aspects of standardization, framework and new process of production and research. After two years of exploration and construction and practice, meituan takeout advertising platform has begun to take shape, effectively supporting the rapid iteration of multiple business lines.
In the future, platformization will refine the strength of standardization and reduce the cost of business development. Deepen the capability of the framework, and continuously improve stability, performance and ease of use. In addition, we will continue to optimize user experience, improve the operation mechanism, and constantly improve the process of production and research iteration.
The above are some explorations and practices of takeout advertising on business platformization, and explorations in other fields such as advertising engineering architecture. Please look forward to the next series of articles.
7 Introduction to the Author
Le Bin, Guoliang, Yulong, Wu Liang, Lei Xing, Wang Kun, Liu Yan, Siyuan, etc., are all from meituan takeaway advertising technology team.
Recruitment information
Meituan takeout advertising technology team continues to recruit a large number of positions, sincerely looking for advertising background/algorithm development engineers and experts, Beijing. Welcome to join us if you are interested. Resume can be sent to: [email protected] (Email subject: Meituan takeout advertising technology team)
Read more technical articles from meituan’s technical team
Front end | | algorithm back-end | | | data security operations | iOS | Android | test
| in the public bar menu dialog reply goodies for [2021], [2020] special purchases, goodies for [2019], [2018] special purchases, 【 2017 】 special purchases, such as keywords, to view Meituan technology team calendar year essay collection.
| this paper Meituan produced by the technical team, the copyright ownership Meituan. You are welcome to reprint or use the content of this article for non-commercial purposes such as sharing and communication. Please mark “Content reprinted from Meituan Technical team”. This article shall not be reproduced or used commercially without permission. For any commercial activity, please send an email to [email protected] for authorization.