Serverless is a hot technology topic at present, the major cloud platform and the Internet factory inside are actively building Serverless products. This paper will introduce some practical experience in the landing process of Meituan Serverless products, including consideration of technology selection, detailed design of the system, system stability optimization, surrounding ecological construction of the products and the landing situation in Meituan. Although the background of each company is different, there are always some ideas or methods that can be learned from each other. I hope they can bring some inspiration or help to everyone.

1 background

The term “Serverless” was first proposed in 2012 and became widely known in 2014 due to the rise of Amazon’s AWS Lambda Serverless Computing Service. Serverless is often translated as “Serverless.” Serverless computing allows users to build and run applications regardless of the server. With Serverless computing, the application still runs on the server, but all server administration is handled by the Serverless platform. For example, machine application, code release, machine downtime, instance expansion and shrinkage capacity, computer room disaster tolerance, etc. are all completed automatically with the help of the platform. Business development only needs to consider the implementation of business logic.

Review the evolution of the computing industry, from physical machines to virtual machines to virtual machines to containers. Service architecture goes from traditional monolithic application architecture to SOA architecture, and then from SOA architecture to micro-service architecture. Viewing the overall technology development trend from the two main lines of infrastructure and service architecture, we may find that both infrastructure and service architecture are evolving from large to small or from large to small. The essential principle of such evolution is nothing more than solving the problem of resource cost or R&D efficiency. Serverless, of course, is no exception and is designed to address both of these issues:

  • Resource utilization: Serverless products support rapid and elastic scalability, which can help the business improve resource utilization. When the business traffic peaks, the computing power and capacity of the business automatically expand, carrying more user requests, and when the business traffic drops, the resources used will shrink at the same time to avoid resource waste.
  • R&D operation and maintenance efficiency: on Serverless, developers only need to fill in the code path or upload the code package, and the platform can help complete the construction and deployment work. Developers do not directly face the machine, the management of the machine, whether the machine is normal and whether the flow of high and low peak need to expand capacity and other problems, these all do not need to consider, by the Serverless products to help R & d personnel to complete. This frees them from the hassles of operations and allows them to move from DevOps to Noops and focus more on the implementation of business logic.

Although AWS launched its first Serverless product Lambda in 2014, the application of Serverless technology in China has been lukewarm. However, in recent two or three years, in the container, Kubernetes and cloud native technology driven by the rapid development of Serverless technology, the major domestic Internet companies are actively building Serverless related products, explore the landing of Serverless technology. In this context, Meituan also started the construction of the Serverless platform in early 2019 under the internal project name Nest.

Up to now, NEST platform has been under construction for two years. Reviewing the overall construction process, it has mainly experienced the following three stages:

  • Rapid verification, landing MVP version: we through the technology selection, product and architecture design, development iteration, quickly landing the basic capabilities of Serverless products, such as construction, release, elastic expansion, contact source, execution function, etc. After the launch, we promoted the pilot access of some businesses to help verify and polish products.
  • Optimize core technology to ensure business stability: After the pilot business verification, we soon found some stability related problems of the product, including the stability of elastic scaling, the speed of cold start, the availability of system and business, and the stability of the container. In view of these problems, we have made special optimization and improvement on the technical points involved in each problem.
  • Improve the technology ecology and realize the benefits: after optimizing the core technology points, the products gradually mature and stable, but still face ecological problems, such as lack of R&D tools, lack of upstream and downstream products, lack of platform opening ability and other problems, which affect or hinder the promotion and use of the products. Therefore, we continue to improve the technology ecology of our products, remove barriers to service access and use, and realize the business benefits of our products.

Quick verification, landing MVP version

2.1 Technical selection

In building NEST platform, the first thing to solve is the selection of technology. NEST mainly involves the selection of three key points: evolution route, infrastructure, and development language.

2.1.1 Evolution route

Initially Serverless services mainly included FAAS (Function as a Service) and BAAS (Backend as a Service). In recent years, Serverless has expanded its product range to include application-oriented Serverless services.

  • FAAS: A function service that runs in a stateless computing container. The function is usually event-driven, has a short lifetime (or even a single invocation), and is managed entirely by a third party. Industry related FAAS products include Lambda of AWS and function calculation of Ali Cloud.
  • BaaS: Backend services built on the cloud service ecosystem. Industry related BaaS products include AWS S3, Dynamodb, etc.

Application oriented Serverless services: For example, Knative provides comprehensive service hosting capability from code package to image construction, deployment, as well as elastic scaling, etc. Public Cloud products include Google Cloud Run (based on KNative), Ali Cloud SAE (Serverless Application Engine).

Inside Meituan, BaaS products are actually internal middleware and underlying services, which have become very rich and mature after years of development. Therefore, Meituan Serverless product evolution is mainly in the two directions of function computing service and application-oriented Serverless service. So how exactly does that evolve? The main concern was that the FAAS function computing service was more mature and established in the industry than the application-oriented Serverless service. Therefore, we decided on an evolutionary route of “building FAAS function computing services first, and then building application-oriented Serverless services”.

2.1.2 Infrastructure

Since elastic scaling is a necessary capability of the Serverless platform, Serverless inevitably involves scheduling and management of the underlying resources. This is why many open source Serverless products (such as OpenFaas, Fission, Nuclio, Knative, etc.) are based on Kubernetes, which can take full advantage of Kubernetes’ infrastructure management capabilities. The internal infrastructure product of Meituan is Hulk. Although Hulk is based on the encapsulated product of Kubernetes, considering the landing difficulty and various reasons, Hulk did not use Kubernetes in the native way at the beginning of landing, and also adopted the rich container mode in the container layer.

Given this history, we were faced with two options for infrastructure selection: either use the company’s Hulk for Nest’s infrastructure (non-native Kubernetes), or use native Kubernetes infrastructure. We consider that the current industry using native Kubernetes is the mainstream trend and the use of native Kubernetes can also make full use of the native capabilities of Kubernetes, can reduce repeated development. As a result, we adopted native Kubernetes as our infrastructure.

2.1.3 Development Language

Since the dominant language in the cloud native space is Golang, and in the Kubernetes ecosystem, Golang is the absolute dominant language. But at Meituan, Java is the most widely used language, and Java has a better internal ecosystem than Golang. Therefore, in the selection of language, we choose Java language. At the beginning of the Nest product, the Kubernetes community’s Java client was not fully developed, but as the project has progressed, the community’s Java client has been enriched and is now fully functional. In addition, in the process of using it, we also contributed some Pull requests, feeding back the community.

2.2 Architecture Design

Based on the above evolution route, infrastructure and development language selection, we carried out the architecture design of NEST products.

In the overall architecture, traffic is triggered to NEST platform by EventTrigger (EventTrigger source, such as NGINX, application gateway, timing task, message queue, RPC call, etc.). NEST platform will route to specific function instances according to the characteristics of traffic and trigger the execution of functions. The internal code logic of the function can call various BaaS services in the company, and finally complete the execution of the function and return the result.

In terms of technical implementation, the NEST platform uses Kubernetes as the foundation and appropriately refers to some excellent designs of Knative. Its internal architecture is mainly composed of the following core parts:

  • Event Gateway: The core capability is responsible for routing traffic from external event sources to function instances. In addition, the gateway is also responsible for the statistics of the flow information of each function, and provides the data support for the flex decision of the elastic flex module.
  • Elastic scaling: the core ability is responsible for elastic scaling of function instances. Scaling mainly calculates the number of target instances of the function according to the flow data of function operation and the configuration of instance threshold, and then adjusts the number of function instances with the help of Kubernetes’ resource control ability.
  • Controller: Core competence is responsible for the control logic implementation of Kubernetes CRD (Custom Resource Definition).
  • Function instance: A running instance of a function. When the event gateway traffic is triggered, the corresponding function code logic is executed within the function instance.
  • Governance platform: A user-oriented platform that is responsible for building, versioning, releasing functions, and managing some function meta-information.

2.3 Process design

In terms of specific CI/CD processes, how does NEST differ from the traditional model? To illustrate this, let’s look at the overall life cycle of a function on the Nest platform. There are four phases: build, release, deploy, and scale.

  • Build: The developed code and configuration is built to generate an image or executable file.
  • Version: The image or executable generated by the build plus the release configuration forms an immutable version.
  • Deployment: Release the version, that is, complete deployment.
  • Scaling: According to the flow rate and load information of the function instance, the elastic expansion capacity of the instance is carried out.

In terms of these four phases, the essential difference between NEST and a traditional CI/CD process is deployment and scaling: traditional deployment is machine-aware, typically publishing the code package to a certain machine, but Serverless is maskering the machine from the user (at deployment time, the number of instances of the function may still be zero); In addition, the traditional mode is generally not with dynamic expansion capacity, and Serverless is different, Serverless platform will be based on the business’s own traffic needs, dynamic expansion capacity. Elastic scaling will be covered in more detail in the following sections, so we’ll only discuss deployment design here.

The core point of deployment is how do you mask the machine from the user? For this problem, we abstract the machine and put forward the concept of grouping, which is composed of three information: SET (unitary architecture identifier, which will be carried on the machine), swimlane (test environment isolation identifier, which will be carried on the machine), and region (Shanghai, Beijing, etc.). User deployments only need to operate on the appropriate groups and do not involve specific machines. The Nest platform helps users manage machine resources, and each deployment initializes the corresponding machine instances in real time based on grouping information.

2.4 Function Trigger

The execution of functions is triggered by events. To complete the triggering of the function, the following four processes need to be implemented:

  • Traffic Induction: Registers the event gateway information with the event source to direct traffic to the event gateway. For example, for MQ event source, the traffic of MQ is introduced to the event gateway by registering the consumer group of MQ.
  • Traffic Adaptation: The event gateway ADAPTS the incoming traffic from the event source.
  • Function discovery: The process of obtaining function metadata (function instance information, configuration information, etc.), similar to the process of service discovery for microservices. The event traffic received by the event gateway needs to be sent to the specific function instance, which requires discovery of the function. The essence of discovery here is to get information stored in the built-in resources in Kubernetes or in the CRD resources.
  • Function routing: The process by which event traffic is routed to a specific function instance. In order to support traditional routing logic (such as SET, swimlane, region, etc.) and version routing capabilities, we have adopted multi-layer routing, with the first layer routing to the group (SET, swimlane, region) and the second layer routing to the specific version. Instances in the same version, through the load balancer to select the specific instance. In addition, with this version routing, we easily support Canary, Blue and Green releases.

2.5 Function Execution

Functions are different from traditional services, which are executable programs, but functions are snippets of code that cannot be executed by themselves. How does the function execute after the traffic is triggered to the function instance?

The first problem with function execution is the environment in which the function is run: Since the Nest platform is based on Kubernetes, functions must be run in Kubernetes POD (instance), POD is inside the container, inside the container is the runtime, the runtime is the entrance of function traffic reception, and finally it is the runtime that triggers the execution of the function. Everything seemed to be working smoothly, but we still encountered some difficulties in the implementation. The main difficulty was to get the developers to seamlessly use the company’s components, such as Octo (service framework), Celler (caching system), DB, etc.

In Meituan’s technology architecture, it is difficult to run the company’s business logic in a pure container (without any other dependencies) due to years of technology precipitation. Because the container of the company has a lot of environment or service governance capabilities, such as Agent service of service governance, instance environment configuration, network configuration, etc.

Therefore, in order for businesses to seamlessly use corporate components within functions, we reuse corporate container architecture to reduce the cost of writing functions for businesses. But reusing the company’s container system isn’t that easy, because no one in the company has tried it before, and Nest is the company’s first platform based on native Kubernetes. The first mover is always going to run into some bumps. As for these pits, we can only “cut through the mountains and build Bridges when we meet the water” in the process of advancing them, and solve them one by one. To sum up, the most core is the CMDB and other technical systems that are opened in the starting link of the container, so that there is no difference between the container that runs the function and the machine that the development students usually apply for.

2.6 Elastic stretching

There are three core problems of elastic scaling: when, how much, and how fast? That is, scaling timing, scaling algorithm, scaling speed.

  • Timing to scale: Calculate the number of expected instances of the function in real time according to Traffic Metrics. Traffic Metrics come from the Event Gateway, where concurrency Metrics for major statistical functions are actively obtained by the ElastoScale component once per second from the Event Gateway.
  • Scaling algorithm: Concurrency/Singleton threshold = expected number of instances. Based on the collected Metrics data and the business configuration thresholds, the expected number of instances is calculated by the algorithm, and then the specific number of instances is set through the Kubernetes interface. The whole algorithm looks simple, but it is very stable and robust.
  • Scaling speed: This depends on the cold start time, which will be explained in more detail in the next section.

In addition to the basic scale-out capability, we also support scaling to 0 and configuring a maximum and minimum number of instances (a minimum instance is a reserved instance). The specific implementation of scaling to 0 is that we add an activator module inside the event gateway. When there is no instance of the function, the request traffic of the function will be cached inside the activator, and then the elastic scaling component will be driven by traffic Metrics immediately to scale up. After the instance scaling is started, The activator then retries the cached request to the expanded instance to trigger function execution.

3. Optimize core technologies to ensure business stability

3.1 Elastic telescopic optimization

The scaling timing, scaling algorithm and scaling speed mentioned above are all ideal models. Especially the scaling speed, the current technology simply cannot do the scaling capacity at the millisecond level. Therefore, in the actual online scenarios, elastic scaling may not meet expectations, such as frequent instance scaling or insufficient capacity expansion, resulting in unstable service.

  • To solve the problem of frequent expansion of the example, we maintain a sliding window of statistical data in the elastic expansion component, smooth the index by calculating the mean value, and alleviate the frequent expansion and contraction problem by delaying and real-time expansion. In addition, we added a scaling strategy based on the QPS metric because the QPS metric is more stable than the concurrency metric.
  • In view of the problem of insufficient capacity expansion, we take the means of advance capacity expansion. When it reaches 70% of the threshold value of the instance, the capacity expansion can better alleviate this problem. In addition, we also support multi-metric hybrid scaling (concurrency, QPS, CPU, Memory), timing scaling and other strategies to meet various business requirements.

The following figure shows a real case of online elastic scaling (its configured minimum number of instances is 4, single-instance threshold is 100, and threshold utilization is 0.7). The upper part is the number of requests per second for the business, and the lower part is the decision graph of the scaling instance. It can be seen that in the case of 100% success rate, the business perfectly deals with the traffic peak.

3.2 Cold start optimization

Cold start refers to the link of function call including resource scheduling, image/code download, startup container, runtime initialization, user code initialization and so on. When the cold start is complete and the function instance is ready, subsequent requests can be executed directly by the function. Cold boot is crucial in the Serverless space, where the time it takes determines the speed of elastic scaling.

The so-called “the world of martial arts, no hard not broken, only fast not broken”, this sentence is also useful in the field of Serverless. Just imagine that if an instance is pulled up fast enough to the millisecond level, almost all function instances can be reduced to zero, and then the instance can be expanded to handle requests when there is traffic, which will greatly save the cost of machine resources for businesses with high and low peak traffic. Of course, the ideal is very plump, the reality is very skinny. It’s almost impossible to get to the millisecond level. However, as the cold start time gets shorter and shorter, the cost gets lower and lower. In addition, the extremely short cold start time has great benefits for the availability and stability of the scaling function.

Cold start optimization is a step by step process, we mainly experienced three stages of cold start optimization: mirror start optimization, resource pool optimization, core path optimization.

  • Image to start the optimization: we take part in the process of start of mirror image (start the container and run-time initialization) for the targeted optimization, mainly the container IO speed limit, some special Agent start-up time, startup disk and disk data copies of key points such as optimization, eventually will start the process of system takes from 42 s optimization to about 12 s.

  • Resource pool optimization: mirroring startup time is optimized to 12s, which has almost reached the bottleneck point, and there is not much room for further optimization. So, can we bypass the time consuming process of mirroring? Finally, we adopted a relatively simple idea of “space for time”, using the resource pool scheme: cache some started instances, and when capacity expansion is needed, directly get the instances from the resource pool and bypass the link of the mirrored startup container. The final effect is obvious, and the time spent on the startup system is optimized from 12s to 3s. It should be noted that the resource pool itself is also managed by Depolyment of Kubernetes. Instances in the pool will be automatically replenished immediately after being removed.

  • Core path optimization: on the basis of resource pool optimization, we tried to optimize the download and decompress code in the startup process. In the process, we used high-performance compression and decompress algorithm (LZ4 and ZSTD) as well as parallel download and decompress technology, and the results were very good. In addition, we also support common logic (middleware, dependency packages, etc.) sinking, which optimizes the end-to-end startup time to 2s through preloading, which means that it only takes 2s to scale a function instance (including function startup). If you exclude the initialization startup time of the function itself, the platform-side time is already at the millisecond level.

3.3 High Availability Guarantee

While high availability refers to the high availability of the platform itself, Nest’s high availability also includes functions hosted on the Nest platform. Therefore, Nest’s high availability assurance needs to start with both platform and business functions.

3.3.1 High availability of the platform

For the high availability of the platform, Nest has made comprehensive guarantees from the architecture layer, the service layer, the monitoring operation layer and the business perspective level.

  • Architecture layer: We use a master-slave architecture for stateful services, such as an elastoscale module, where the slave nodes are replaced immediately when the master node is out of order. In addition, we have implemented multiple layers of isolation architecturally. Horizontal geographical isolation: strong isolation between two clusters in Kubernetes, weak isolation between two clusters in service (event gateway, elastic scaling) cluster (elastic scaling in Shanghai is only responsible for business scaling in Shanghai Kubernetes cluster, event gateway has call demand in two places, need to visit two Kubernetes). Vertical business line isolation: service business lines are strongly isolated, and different business lines use different cluster services; Resources at the Kubernetes layer are weakly isolated from lines of business using NAMESPACE.

  • Service layer: it mainly refers to the event gateway service. Since all the function traffic passes through the event gateway, the availability of the event gateway is particularly important. In this layer, we support the current limiting and asynchronization to guarantee the stability of the service.
  • Monitoring operation layer: it mainly monitors alarms by improving the system, combs the core links and promotes related dependent parties to conduct governance. In addition, we will comb the SOP regularly and carry out fault injection drill through the fault drill platform to find hidden problems of the system.
  • Business perspective layer: We have developed an online uninterrupted real-time inspection service, which can detect whether the core link of the system is normal in real time by simulating the request traffic of user functions.

3.3.2 High availability of services

For high business availability, Nest mainly guarantees it from two levels: service layer and platform layer.

  • Service layer: supports the capability of business degradation and flow limiting: when the backend function fails, the degradation configuration can be used to return the degradation result. For abnormal function traffic, the platform supports limiting its traffic to prevent back-end function instances from being overwhelmed by abnormal traffic.
  • Platform layer: supports instance preservation, multi-level disaster recovery and rich monitoring and warning capabilities: when the function instance is abnormal, the platform will automatically isolate the instance and immediately expand the capacity of new instance. The platform supports multi-region deployment of services, and the function instances in the same region can be broken up into different computer rooms as far as possible. When a host, machine room, or region fails, a new instance is immediately rebuilt on the available host, machine room, or region. In addition, the platform automatically helps the business to monitor various indicators such as time delay, success rate, instance scaling, request number and so on. When these indicators do not meet expectations, it will automatically trigger an alarm to inform the business development and administrators.

3.4 Container stability optimization

As mentioned earlier, Serverless is different from the traditional model in CI/CD processes, where the machine is prepared in advance and the application is deployed. Serverless is an example of real-time elastic scaling capacity based on the high and low peaks of traffic. When a new instance is expanded, the business traffic is processed immediately. This may sound fine, but there are some problems in the rich container ecosystem: We found that the newly expanded machines were very heavily loaded, causing some business requests to fail, affecting business availability.

After analysis, it is found that the main reason is that after the container is started, the operation and maintenance tools will carry out Agent upgrade, configuration modification and other operations, which are very CPU consuming. Being in the same rich container naturally preempts the resources of the function process, making the user process unstable. Also exacerbating this problem is the fact that the resource configuration of the function instance is typically much smaller than that of the traditional service machine. Based on this, we refer to the industry, the joint container facilities team, landed a lightweight container, put all the operation and maintenance agents into the Sidecar container, and business processes into the APP container alone. The isolation mechanism of the container is adopted to ensure the stability of the service. At the same time, we also promoted the container tailoring plan to remove some unnecessary agents.

4. Improve ecology and realize income

Serverless is a system engineering, in the technology involved in Kubernetes, container, operating system, JVM, runtime and other technologies, in the platform capability involved in all aspects of the CI/CD process.

In order to provide users with the ultimate development experience, we provide users with the support of development tools, such as CLI (Command Line Interface), WebIDE, etc. In order to solve the problem of the interaction of the existing upstream and downstream technology products, we have integrated with the existing technology ecology of the company to facilitate the use of developers. In order to facilitate the docking of the downstream integration platform, we open the API of the platform to enable NEST to enable each downstream platform. In view of the problem that the container is too heavy and the system overhead is high, which leads to the low resource utilization rate of the low-frequency business functions themselves, we support the function consolidation deployment to increase the resource utilization rate by many times.

4.1 Provide R&D tools

The development tool can reduce the cost of using the platform and help the development students to carry out the CI/CD process quickly. At present, Nest provides CLI tools to help developers quickly complete application creation, local build, local test, Debug, remote release and other operations. Nest also offers WebIDE, an online one-stop shop for code modification, build, release, and testing.

4.2 Integration of technology ecology

Only supporting these R&D tools is not enough. After the project promotion, we soon found that the development students had new demands for the platform. For example, we could not complete the operation of functions on the Pipeline line and the offline service instance arrangement platform, which also formed some obstacles to the promotion of our project. Therefore, we integrate the mature technology ecology of these companies, get through the platform such as Pipeline Pipeline, and integrate into the existing upstream and downstream technology system, so as to solve the worries of users.

4.3 Open Platform Capability

There are many NEST downstream solution platforms, such as SSR (Server Side Render), service choreography platform, etc., which achieve further productivity liberation by docking with NEST’s OpenAPI. For example, users can quickly create, publish and host an SSR project or choreographer from 0 to 1 without having to ask developers to apply, manage and operate machine resources themselves.

In addition to opening up the API of the platform, Nest also provides users with the ability to customize the resource pool. With this ability, developers can customize their own resource pool, customize their machine environment, and even sink some common logic to achieve further optimization of cold start.

4.4 Support for consolidated deployment

Merged deployment refers to the deployment of multiple functions within a single machine instance. There are two main backgrounds for merge deployment:

  • The current container is heavy and the container itself has high system overhead, which leads to low resource utilization of business processes (especially low frequency business).
  • In the case that the cold start time cannot meet the requirements of the business on the delay, we reserve the instance to solve the business requirements.

Based on these two backgrounds, we consider to support combined deployment, to deploy some low-frequency functions into the same machine instance, to improve the resource utilization rate of the business process in the reserved instance.

In terms of concrete implementation, we refer to the design scheme of Kubernetes and design a set of function combination deployment system based on Sandbox (each Sandbox is a function resource). POD is analogous to the Node resource of Kubernetes. Sandbox is analogous to Kubernetes’ POD resources, and Nest Sidecar is analogous to Kubelet. In order to realize Sandbox’s unique deployment and scheduling capabilities, we also customize some Kubernetes resources (such as SandboxDeployment, SandboxReplicaset, SandboxEndPoints, etc.) to support dynamic insertion of functions to specific POD instances.

In addition, in the form of merge deployment, the isolation between functions is also an unavoidable problem. In order to solve the problem of interference between functions (merged in the same instance) as much as possible, we adopt different strategies for the implementation of Runtime according to the characteristics of Node.js and Java languages: functions in Node.js use different processes to achieve isolation, while functions in Java language use class-loading isolation. The main reason for this strategy is that Java processes tend to take up much more memory space than Node.js processes.

5. Landing scene and profit

At present, Nest products are very popular in Node.js field of Meituan front-end, which is also the most widely deployed technology stack. At present, Nest products have been implemented on a large scale in the front end of Meituan, covering almost all business lines and accessing a large amount of core traffic at B/C terminals.

5.1 Landing scene

Specific landing front-end scenes include: BFF (Backend For Frontend), CSR (Client Side Render) /SSR (Server Side Render), background management platform scene, timing task, data processing, etc.

  • BFF scenario: The BFF layer mainly provides data For the front-end page and adopts the Serverless mode. Students in the front-end do not need to consider the operation and maintenance links they are not good at, which easily realizes the transformation from BFF to SFF (Serverless For Frontend) mode.
  • CSR/SSR scenario: CSR/SSR refers to client-side rendering and server-side rendering. With the Serverless platform, there is no need to consider the operation and maintenance links. More front-end businesses try to use SSR to realize the rapid display of the front-end first screen.
  • Background management platform scenario: the company has a lot of Web services of the background management platform. Although they are more heavy than functions, they can directly host the Serverless platform and fully enjoy the extreme publishing and operation efficiency of the Serverless platform.
  • Timed task scenario: Companies there are a lot of periodic tasks, such as pull data every few seconds, 0 point clear the log every day, every hour to collect all data and generate reports, etc., Serverless platform directly with task scheduling system get through, simply write a task on the processing logic and configuration timing trigger on the platform, namely timing task access, no machine resources management.
  • Data processing scenarios: When MQ Topic is connected to the Serverless platform as the event source, the platform will automatically subscribe the Topic messages. When there is message consumption, the triggering function will be executed, similar to the timing task scenario. As a user, it only needs to write the data processing logic and configure MQ triggers on the platform, that is, to complete the access to the MQ consumer end. No machine resources to manage at all.

5.2 Return on landing

The benefits of Serverless are obvious, especially in the front-end space, where the sheer volume of business access is the best illustration. Specific benefits can be seen from the following two aspects:

  • Cost reduction: high frequency business resource utilization can be increased to 40% ~ 50% through Serverless’s elastic scalability; Low frequency service functions can also greatly reduce the cost of function operation by combining and deploying them.
  • Efficiency improvement: the overall R&D efficiency increased by about 40%.

    • From the perspective of code development, it provides complete CLI, WebIDE and other R & D tools, which can help students generate code scaffolding, focus on writing business logic, and quickly complete local tests. In addition, let business services have the ability to view logs and monitor them online at no cost.
    • From the point of view of publishing, through the cloud-native mode, the business does not need to apply for a machine, and publishing and rolling back are second-level experiences. In addition, it can also utilize the natural capabilities of the platform to cooperate with the event gateway to realize tangential flow and complete the canary test.
    • From the perspective of daily operation and maintenance, the business does not need to pay attention to the problems such as machine failure, lack of resources and disaster recovery in the computer room. In addition, when the business process is abnormal, Nest can automatically isolate the abnormal instances and quickly pull up the new instances to realize replacement, thus reducing the business impact.

6 Future Planning

  • Scenarialized solutions: There are many scenarios that access Serverless, such as SSR, backstage management terminal, BFF, etc. Different scenarios have different project templates, scene configuration, such as scalability configuration, trigger configuration, etc. In addition, different languages, configuration is also different. This virtually increases the service cost, and brings obstacles to the access of new services. Therefore, we consider the idea of scenarioalization to build the platform, and strongly link the capabilities of the platform with the scene. The platform deeply precipitates the basic configuration and resources of each scene, so that the business can play Serverless in different scenes with only simple configuration.
  • Traditional microservice Serverless: The Serverless of traditional microservice is the application-oriented Serverless service mentioned in Route Selection. The most widely used development language in Meituan is Java, and there are a number of traditional microservice projects within the company that would be impractical to migrate to a functional model. Imagine if these traditional micro-service projects do not need to be transformed, but also directly enjoy the technical dividend of Serverless, its business value is self-evident. Therefore, the Serverless of traditional micro services is an important direction for our future business expansion. In the implementation path, we will consider the technical integration of the service governance system (such as ServiceMesh) and Serverless, and the service governance component will provide Serverless with support for scaling indicators and realize accurate flow allocation in the scaling process.
  • Cold start optimization: Although the cold start optimization of functions has achieved good results at present, especially the system start time on the platform side, and the improvement space is very limited, the start time of business code itself is still very prominent, especially the traditional Java micro-service, which is basically the start time at the minute level. Therefore, our subsequent cold-start optimization will focus on the start-up time of the business itself, and strive to greatly reduce the start-up time of the business itself. In terms of specific optimization methods, we will consider adopting APPCDS, GraalVM and other technologies to reduce the time of business start-up.
  • Other planning

    • Enrich and improve R&D tools, improve R&D efficiency, such as IDE plug-in, etc.
    • Build up the upstream and downstream technology ecology, deeply integrate into the company’s existing technology system, and reduce the obstacles brought by the upstream and downstream platforms.
    • Container lightweight, lightweight containers can bring better startup time and better resource utilization, therefore, container lightweight has been Serverless unremitting pursuit. On the ground, prepare to work with the container facility team to push some agents on the container on Daemoncet deployment, sinking to the host and lifting the payload of the container.

Author’s brief introduction

  • Yin Qi, Hua Shen, Fei Fei, Zhi Yang, Yi Kun, etc., from Application Middleware Team, Infrastructure Department.
  • Jia Wen, Kai Xin, Ya Hui, etc., from the big front end team of the financial technology platform.

Recruitment information

Meituan Infrastructure Team is looking for senior and senior technical experts, Base Beijing, Shanghai. We are committed to the construction of Meituan company-wide unified high-concurrency and high-performance distributed infrastructure platform, covering database, distributed monitoring, service governance, high-performance communication, message-oriented middleware, basic storage, containerization, cluster scheduling and other major technical fields of infrastructure. Interested students are welcome to send their resumes to: tech@meituan.com.

Read more collections of technical articles from the Meituan technical team

Front end | | algorithm back-end | | | data security operations | iOS | Android | test

| in the public bar menu dialog reply goodies for [2020], [2019] special purchases, goodies for [2018], [2017] special purchases such as keywords, to view Meituan technology team calendar year essay collection.

| this paper Meituan produced by the technical team, the copyright ownership Meituan. You are welcome to reprint or use the content of this article for non-commercial purposes such as sharing and communication. Please note that the content is reprinted by Meituan Technical Team. This article shall not be reproduced or used commercially without permission. For any commercial activity, please send an email to tech@meituan.com for authorization.