Welcome toTencent Cloud + community, get more Tencent mass technology practice dry goods oh ~

This article was published in cloud + community column by Tencent Cloud Serverless team

ArchSummit global Architect Summit, an annual technology event, was held at OCT Intercontinental Hotel in Shenzhen on July 6-7, 2018. More than 100 technical experts from home and abroad gathered in Shenzhen to share best practices in various technical architectures. Jerome from Tencent Technology Engineering Business Group Architecture platform Department shared the theme of “Serverless Practice based on Elastic Computing”, the following is the content of the live speech.

According to Gartner and McKinsey, the global average server CPU utilization is only 6 to 12 percent. The average CPU utilization rate of domestic large Internet enterprises such as Alibaba and Tencent is around 10%, mainly due to three reasons:

• It is difficult for a single service to make balanced use of various resources, resulting in idle resources 99% of the time

• The nature of the online service determines low load 30% of the time

• Server flow may result in 5% idle time

Ideally like Google have corporate level borg resource scheduling platform, based on the unified platform to build the business, the Shared resource pool mix scheduling, but based on BG separate operation history of the status quo, pipe joint flat elastic computing and the BG company operations collaboration platform for the sharing work force, try to collect all the company’s idle resources, Mining gold from the live network to meet the needs of computing in the form of Docker containers.

After one year of construction, about 140W cores have been excavated, supporting video transcoding of 100 million levels, picture compression, Tencent Go, King of Glory and other game AI, mobile browser, etc. However, there are still many challenges, such as:

For example, the service access threshold is high, requiring services to understand diverse and flexible resources and adapt to them properly, making it difficult to access long-tail services. In addition, the utilization rate is not controlled, some stateful service utilization rate is not high, and can not automatically expand or shrink capacity; Small resource fragments, such as the remaining 1 core 1GB, can not be used; Finally, the platform cannot accurately monitor the running state and quality of the business, such as request delay, etc. To sum up, the root cause of the current predicament is that we only provide services at the resource level, and the good use of resources depends on many work at the business level. Reviewing the history of cluster resource management platform, from the initial use of physical machine (Nose-as-Service) to IaaS virtual machine Service, and then to PaaS container Service, each step tries to do more things for business, liberate more productivity and improve utilization efficiency. In the next step, can the resource management platform directly jump out of the resource service hierarchy and let the business only need to relate to the business logic and pay on demand?

The answer is yes, its name is serverless. Serverless does not mean that there is no server, but that users do not need to pay attention to the existence of the server. In a broad sense, SaaS,BaaS,FaaS and PaaS can be called serverless. From left to right, platform versatility is getting better; However, Serverless is really well known because of the appearance of FaaS function as service represented by AWS Lambda. Therefore, in a narrow sense, Serverless generally refers to FaaS function as service. Today, we use Serverless to refer to function as service.

Serverless is just the answer to our current problems, the business access threshold is high, with ServerelSS can not care about resources, to achieve faster business online; The utilization rate is not controlled, with the Serverless platform can completely control the allocation of resources, utilization rate is no longer controlled by others; Shard resources can not be used out, with serverless can let the business to cut the calculation into functions of smaller granularity function level; Service quality is difficult to monitor. With Serverless, we perceive every invocation and return of business, and know the delay and return result.

Therefore, we built serverless cloud function platform based on the existing container platform, mainly including:

• Build function management module to manage function configuration, including metadata, code, permissions, trigger mode, etc.

• Build function call module to distribute call function, solve load balancing, fault disaster recovery, capacity expansion and other problems;

• Build a multi-language runtime environment, network of proxy functions to listen for requests, run user function codes, monitor function running process, collect function logs, etc.;

• Build function trigger and other modules to realize automatic call of cloud functions triggered by events;

There are many considerations when building a new platform, such as how to make it easy enough for businesses to try; How to be stable enough to keep the business; How to achieve cost savings over container platform, so that the business can see real benefits; How to achieve faster, safer, sustainable development and so on, this article will share one by one.

The key of easy enough to use lies in whether users can do fewer things. In development, cloud functions extract parts other than business logic to achieve it, such as network packet sending and receiving, load balancing, expansion and contraction, fault disaster recovery, etc. The platform draws the dragon well, and only needs the last touch of business to go online. In operation and maintenance, managed code package management, server fault handling, quality of service monitoring and disaster recovery, load balancing, capacity expansion and reduction configuration; Even in the application, it provides automatic call triggered by events such as file upload/delete, timer and topic message. Although some micro-service platforms are doing similar things at present, they still take container as dimension, and the interior of container is black box for platform. Cloud functions participate in the process of calling and returning each user request inside the container, and can obtain more information of business dimensions, making resource scheduling and quality control more accurate.

Compared to the traditional distributed systems, cloud platform function to be stable enough also need to solve the problem of good function call flow, the most simple call process from cloud API to come in, go do distribute invoker, to compute nodes, the synchronous invocation way once scene so no problem, because the synchronous calls are failing business was able to try again; However, the call result will not be returned to the user when the asynchronous call is made, so the user will not be aware of the lost request. Therefore, it is necessary to have a persistent way to save all the asynchronous calls to avoid loss. In addition, asynchronous invocation services cannot be automatically retried. If the platform fails to retry automatically, the service needs to be persisted to facilitate problem tracing. Retry itself also requires careful design, because resources are allocated in real time when the call request comes in, retries can multiply resources in the background, and in some streaming computing scenarios, calculations have strict timing requirements, and retries need to block the entire flow. The most common challenge to stability is the capability of hot update, and Serverless cloud function platform naturally supports hot update, because it can know every request and return of business, and can control the distribution process of business request. To do hot update, in fact, it is simple to disable nodes, wait for the end of the request, and then restart the node after the update.

Compared with container platform, cloud function platform has lower cost and needs to avoid idle and invalid wasted resources. In traditional business architecture, a service module needs at least one instance, and at least two for DISASTER recovery. The cloud function platform is designed to allocate resources in real time when it is called. The minimum number of instances of business modules can be reduced to 0. Resources are not reserved when there is no load, and resources are allocated in real time when business requests are received. Peak business application resources, according to the specified number of CPU core, memory, disk capacity, network bandwidth and a series of parameters of resources, but the vast majority of cases to burn, so the function of cloud platform typically allow the user to configure the memory size, because the memory is incompressible resources, without the program to run up, for the CPU, the compressible resource such as bandwidth, All are configured and dynamically adjusted by the platform side according to the memory size and actual needs to avoid the waste caused by excessive application of resources. Another cloud function has a special point is to support the event trigger execution, but trigger misallocation may cause ineffective circulation caused by resources waste, such as the configuration file upload to perform a certain function, but form cycle trigger function and upload the file, so the cloud function will call flow monitoring function, found ineffective circulation, avoid the waste of resources.

Since the resources of the cloud function are allocated and initialized in real time after receiving the request, and the waiting time that users can tolerate is generally less than 3s, the cold start time of the first request should be fast enough, and the initialization of the function needs to do many things: First apply for resources, determine the location to download the image, download the function code, start the container, initialize the function after the function to execute the function, return the result. The image download time is generally more than 3 s, so we used a lot of pretreatment, cache and parallel way to improve performance, such as real-time, mirror preassigned to the server to avoid container resources using multistage cache to heavy use, small to use pointer instead of memory copy to pass parameters, etc., there’s a small problem, Can a container used by one user function be handed over to another user? The process of container startup and code download was parallelized, and the transfer of large function parameters was carried out in the way of shared memory pointer. Finally, the time of cold startup was controlled to 200ms, and the time of hot startup was controlled to 5ms, which reached the first-class level in the industry.

Due to finer sharing granularity, cloud function platform inevitably needs to share server kernel, which will pose greater security challenges. Therefore, based on docker isolation, we take some additional security measures, such as limiting the function runtime environment to only/TMP directory writable, and limiting port listening through seccomp and other kernel features. Sensitive system calls such as port scanning; However, at the same time of security, we also pay attention to service compatibility. For example, root is used to run programs, and banning root may bring some extra adaptation costs to users. Therefore, root namespace technology is used to map root in the container to common users outside the container to control permissions. In addition, since we cannot audit user codes, users can technically write codes to try to do anything, such as collecting operating environment information, snooping management server location, etc., for security, compared with the open source Serverless platform, the separation of function running environment and management environment is realized. Function network packet sending and receiving agent and log collection are stored outside the container. The operating environment of functions in the container is not aware of any IP address, port information and platform logs of management nodes.

Due to the business in addition to the core logic of the non-functional requirements are relying on the function of cloud, the cloud platform function is bound to have more demand, in support of business sustainable development, keep up with business iteration speed, we have done some optimization, such as in the past year, business platform for us to ask the most need is to support a variety of programming languages, run-time environment installation of various libraries, etc., We refined in order to speed up the iterative efficiency, all kinds of the public portion of the common language runtime environment using C language implementation, because of various high-level languages are easy to implement on the C library directly call, this new increase a language support can be in 1 ~ 2 weeks to complete, in terms of updates, we put all the runtime libraries to deploy on machine tools, through the way of mount directory into the container, Enables library updates without changing the runtime environment image.

To summarize the key design points, we improved usability by making users do less; Carefully design asynchronous calls and retry to improve stability; Eliminate idle and excessive applications to reduce costs; Use caching and parallelism to improve startup speed; Through kernel layer technology and isolation of management environment and operation environment to improve security; What does it actually do to speed up iteration by refining common features and avoiding mirror reworks? Let’s share some of our current user cases and lessons learned.

At present, the biggest internal user of cloud function platform is feature extraction of game AI. Game AI is still in the exploratory stage in the whole industry, with rapid algorithm changes, frequently updated programs and huge calculation. For example, King of Glory needs to perform feature extraction of hundreds of millions of video files in one day. If the implementation is based on cloud functions, AI engineers only need to write two functions in Python. One script will extract features from video files to generate HDF5 files, and the other function will break up and randomize HDF5 files and push them to the training platform for training. When the algorithm is updated, submit the new function, so that we can focus on algorithm research. You no longer need to worry about server management, computing distribution, capacity expansion, and disaster recovery. In addition to the game AI, wechat small program developer tools have also begun to integrate cloud functions are in internal testing, welcome to experience.

Reviewing the construction process over the past year, the biggest lessons are as follows:

• Not long ago, there was a security vulnerability in Tencent Cloud function, in the context of cloud function through bash reflection, trying various Linux commands can get the Kubelet access address of K8S, and because K8S is not configured with authentication, users can directly control K8S to obtain internal data of other containers, causing security risks. Fortunately, the company’s security team found the problem in time and did not cause serious consequences. The reason for this problem lies in our lack of understanding of the security mechanism of K8S. The lesson to us is that every time we introduce an open source component, we need to understand before opening the service.

• As the cloud function platform is designed to work continuously 7*24, every computing node upgrade needs to go through shielding, waiting for the function to complete, and then enabling the upgrade again. It was not automated in the early years, and it took nearly a week to upgrade the whole cluster, which slowed down the version iteration process;

At the same time, we think there are two good experiences:

• The smooth upgrade of the platform was considered at the beginning of the design, so over 50 versions of the platform were changed in one year without stopping the business;

• In addition, we attach great importance to the extraction of common functions of the platform, such as multi-language support. A new programming language can be completed within 1~2 weeks, which is much faster than other Serverless platforms;

In the process of promoting function serverless cloud platform, we found that for a stateless service, load fluctuation is large, is not sensitive to time delay of the scene, is relatively easy to good service, but for the services of a stateful, need business to save the state, for the continued high load business, can’t play the advantages of resources on-demand usage, for delay sensitive applications, Because of the extra layer of distribution of intermediate functions, there is a loss of approximately 5ms in latency, and Serverless is not currently suitable for these scenarios.

However, in the future, Serverless can be expected. Currently, Serverless has become the standard configuration of major public clouds. For public clouds, Serverless is not only a new form of computing service, but also acts as the glue of the whole cloud platform, which is convenient to package and promote other cloud services. Let the public cloud from various scattered cloud product collection into an organic whole, become the user’s public cloud background; In the open source community, the Serverless platform also flourishes, with new serverless platforms appearing every once in a while that are worth learning from. Although serverless’s main battlefield is currently in the data center, with the development of IOT, IOT edge devices may become a bigger battlefield for Serverless, because the computing resource management and software package distribution of edge devices are more complex and challenging, which can be solved in a timely way.

Wuxue repair for the various methods of integration through, can achieve the realm of no recruit win recruit, for us to do cluster resource scheduling programmers, to do the resource scheduling to the extreme, so that the business is not aware of the existence of the server, is our highest pursuit.

Question and answer

Serverless: How do I delete a function?

reading

Use the SCF serverless cloud function to periodically check the site and send alarms by email

Use SCF serverless cloud function to back up database periodically

Use Serverless for AI predictive reasoning

Cloud, college courses, special recommend | tencent technology test team leader, in combination with 8 years experience in detail for you hot and cold separation principle

This article has been authorized by the author to Tencent Cloud + community, more original text pleaseClick on the

Search concern public number “cloud plus community”, the first time to obtain technical dry goods, after concern reply 1024 send you a technical course gift package!

Massive technical practice experience, all in the cloud plus community!