The author | de-hui kong (xia guan) source | alibaba cloud native public number

background

Serverless will be the default cloud programming paradigm for the next decade

With the further popularization of the Serverless concept, developers gradually moved from the wait-and-see state to the trial stage, and more and more enterprise users began to migrate their services to the Serverless platform. Inside Ali Group, core functions such as Taobao, Flying Pig, Idle Fish, Autonavi and Yuq have been steadily implemented. Outside Ali Group, enterprises from all walks of life such as Sina Weibo, Century Lianhua, Graphite Document, TPLink and Blue Moyun class have also unlocked different scenarios used by Serverless. Serverless is becoming the default programming paradigm for the cloud of the next decade.

For more examples, see functions to calculate user Cases.

Serverless reduces cost, increases efficiency and avoids operation and maintenance. Serverless scheme based on function calculation saves about 60% OF IT cost for blue ink and 58% of server cost for graphite documents. Improve the development efficiency of Code long technology, realize the function online within two weeks; The smooth support load peaks and troughs differ by more than 5 times sina Weibo, easily handling billions of requests every day.

2. Is observability a stumbling block to Serverless?

With the in-depth use of Serverless, developers gradually find it more difficult to locate problems in Serverless architecture than in traditional applications. The main reasons are as follows:

Component distribution: Applications with Serverless architecture tend to glue multiple cloud services, and requests need to flow through multiple cloud products. Once end-to-end delay becomes longer or performance does not meet expectations, problem location is very complex, and it is necessary to go to each product side for troubleshooting step by step.
Scheduling black box: Serverless platform undertakes the responsibility of request scheduling and resource allocation. Real-time elastic expansion will inevitably bring cold start. Resource scaling of Serverless is not controlled by developers. Cold start will affect end-to-end latency. Whether this request encounters cold start, what steps are spent on cold start time, and whether there is room for optimization are all questions that developers are eager to know.
Black-box execution environment: Developers are used to executing their own code on their own machines. When an exception occurs, they log in to the machine to check the CPU, memory, and IO usage of the execution environment. In the face of Serverless application, the machine is not their own, can not board, can not see, developers in front of a dark.
Product non-standard: In the Serverless scenario, developers cannot control the execution environment, install probes, and use the open source three-party monitoring platform. Therefore, the way of investigating problems has to be changed, and the traditional experience of investigating problems cannot be put to good use.

Function computing is the Serverless product of Aliyun. In the past year, the function computing team has made a lot of efforts to better answer the above questions.

This paper mainly introduces the observability of function calculation and its present situation.

Serverless Observability

Observability is a measure of the internal state of the system through external performance. — Wikipedia

In application development, observability helps us determine the internal health of the system. While the system is running smoothly, it helps us assess risks and predict possible problems. When the system problems, help us quickly locate the problem, timely stop loss.

A good observability system helps users discover problems, locate problems, and resolve them end-to-end as quickly as possible.

In Serverless, an o&M free platform system, observability is the developer’s eye. Without observability, how can high availability be discussed?

1. Observability 1.0

Figure 1 – Fundamentals of observability

Observability mainly consists of three parts: log, indicator and link tracing.

As with almost all FaaS products, functional computation (FC) supports function logging and metrics viewing from the beginning of commercialization.

Function log

The user configures the Project and Logstore of SLS on FC, and the FC saves the logs of functions sent to STdout to the Logstore of the user. Users can view function logs through the SLS console and analyze and aggregate the logs with the power of SLS.

Basic indicators

FC pushes indicator logs to cloud monitoring to provide basic indicators such as the number of function calls, number of errors, function delay, and function memory.

Function logs and basic metrics are the application’s stethoscope, which, despite its simplicity, helps users find and locate problems.

Even if there are problems that developers can not troubleshoot, in the era of less users, developers can provide users with personal services, combined with background logs to help users locate problems.

For details about how to use function logs and Indicators, see Configuring and Viewing Function Logs and Monitoring Indicators.

2. Observability 2.0 – Cloud native observability

With the development of Serverless, more and more scenarios are implemented in Serverless, the application scale is getting larger and larger, and the product architecture is getting more and more complex. The observability 1.0 of the application stethoscope can no longer meet the monitoring demands of developers from all walks of life. This almost black box execution environment creates a strong sense of distance and distrust among developers. Application developers need to control their own, want to know how each request in function calculation after of course, want to see if the end-to-end delay long because of the cold start, want to see the function instance execution environment, want to request an exception when positioning problem, the first time want to reuse the familiar source observation platform.

Faced with these requirements, the team also had a long and fierce discussion. Some students thought that we should support these requirements, while others thought that these requirements to some extent go against the nature of Serverless, which is to shield the underlying computing resources. Users do not need to care about the underlying computing resources. On the other hand, what is the use of exposing these metrics? Even if the user sees a cold boot, sees the system time consumption, sees the CPU of the underlying instance, and the user can’t do anything, do these metrics really mean anything? These two views are debated, and I am a staunch opponent.

Team moved to the EFC, later every day waiting for that I don’t know when will the elevator (input you want to go to the floor, to the corresponding quietly waiting for the elevator, can’t see the elevator at present floor), the elevator told us that you are here, etc. Oh, I’m sure I would come, but now I’m in which layers, I when you don’t have to know, It doesn’t matter if you know that. My scheduling is optimal, and you have to trust the professional elevator scheduling algorithm. But how can I trust you?

For developers, function calculation is that I don’t know when will come to the elevator, we and the developers say your request, we will be stable, your execution environment must be very healthy, request we will automatically increase too much, but the current instance of the monitoring indicators, when you don’t need to know, expansion and I our scheduling is definitely the best, You have to trust the scheduling algorithm of the professional r&d team. How can developers trust us?

The observability of Serverless is not only to help developers to check problems, but also to gradually uncover the mysterious veil of Serverless and win the trust of developers on Serverless.

So we have function computing observability 2.0, and we want observability 2.0 to be the electrocardiogram of application.

Figure 2 – Observability status of function computations

In order to answer requests in the functional computing lifecycle, connect upstream and downstream services of distributed systems, and embrace open source observable capabilities, we have integrated OpenTracing to support link tracing.
To expose system state and provide application-level monitoring, we integrated ARMS(Java) with built-in APM capabilities.
To speed up end-to-end problem location, we supported request Level Metrics (FCInsights) and launched a monitoring center for problem discovery/investigation one-stop solution.
To accommodate developers’ existing user experiences, we embrace open source, integrate OpenTracing, and support the Grafana Dashboard. We support tripartite monitoring platform, code almost zero transformation access APM monitoring system.
To accommodate the traditional developer’s observable experience and support for probe installation, we have extended the programming model to support the function LifeCycle, making it possible to integrate three-party monitoring.

Figure 3 – Function computations compatible with open source observable capabilities

Instead of creating your own FaaS observability experience, functional computing is compatible with open source observability, integrates with Jaeger, supports Grafana platters, and supports excellent third-party monitoring platforms like New Relic with minimal changes. Function Computing is the first FaaS provider to embrace open source, container ecosystem and cloud native developers. Smooth migration of observable experiences supports smooth migration of applications on container and Serverless platforms.

1) Integrate OpenTracing to support link tracing

FC is integrated with link tracing service, providing developers with complete tools such as call link recovery, call usage statistics, link topology analysis, and cold start location. Helps developers quickly analyze and diagnose performance bottlenecks in distributed architectures.

FC link tracing has the following features:

Embrace open source: fully compatible with the OpenTracing protocol, with no additional learning costs.
Active recording: The end-to-end time consumed in function calculation for reporting requests.
Scheduling transparency: exposes code preparation time and instance startup time. It is the first FaaS product to expose cold startup latency and specific time consumption.
Connect upstream and downstream applications. You can connect to upstream applications through the SPAN context, and pass the SPAN context into a function to connect to downstream services.

Figure 4 – Link tracing link example

Figure 5 – Link tracing comprehensive capability details

2) Integrated ARMS with built-in APM capability

FC seamless docking ARMS application monitoring, developers only need to add an environment variable to the function to enable APM application monitoring function, ARMS probe to code non-invasive way to monitor application performance, providing application-level observability, This includes CPU, memory metrics, Java virtual machine metrics, code Profiling information, SQL queries, and other metrics of function instances.

Figure 6-ARMS example

3) Release monitoring center (Insights) for one-stop solution of problem discovery and investigation

FC supports request level indicators. You can configure a camera for each request by adding an indicator log. Based on the request level indicator, you can view the request execution time, memory usage, exceptions, error types, cold startup status, and traceID. You can also concatenate all observable capabilities based on request level metrics.

The monitoring center integrates the capabilities of Metrics, Logs and Tracing, and can preview indicators, view Logs and analyze links in one site, striving for one-stop problem discovery and investigation.

The monitoring center has the following characteristics:

Multi-dimension: Supports multi-dimension indicators of Region, Service, Function, Qualifier, and Request, and displays the number of calls and error distribution in each dimension.
Multi-level: Integrate the capabilities of Metrics, Logs, Tracing to monitor applications at all levels.
Full link: Combines indicators, logs, and links to discover, locate, and resolve problems on a site.

Figure 7 – Example monitoring center

4) Extend programming model and integrate three-party monitoring

The life cycle of a function instance is completely controlled by the platform, and the user has no control over the start and recovery of the instance, nor is the user aware of the pause and restart of the instance, which makes it particularly difficult to execute the background thread except the main thread on the function calculation. Monitoring probes are one of many important background threads.

FC extends the programming model by publishing RuntimeLifeCycle, which listens for function instance LifeCycle events and allows function instances to call back the user’s function logic before pausing and recycling. The release of this feature makes it possible for FC to integrate three-party APM monitoring. The user only needs to send out the collected indicators before the instance is paused and clean up the data in the memory before the instance is reclaimed to view the monitoring indicators in real time on the APM platform.

Figure 8 – Tingyun APM example

Figure 9 – NewRelic APM example

3. Summary

The observability of function computing has experienced the development of 1.0-> 2.0, from closed vehicle observability to open source observability, from platform observability to developer observability, from FaaS Only observability to cloud native observability.

As the first FaaS provider to embrace open source observables and embrace container ecology and cloud native developers, functional Computing will also be in a better position to smooth the migration of developers’ business.

The future planning

FC visibility has taken a small step forward from a year ago, evolving from black-box visibility to weak-candle visibility, but there is still a long way to go to achieve the goal of “white-box Serverless applications”. We want to be able to accommodate developers’ monitoring experience and support users to move their business smoothly and safely to Serverless.

We will continue to do the following:

Improve the monitoring center, support alarm configuration, early warning abnormal indicators.
Provide instance level metrics so that code problems can be located and the environment can be traced in the field.
Integrate open source projects, integrate Prometheus, Opentelemetry, configure Grafana platelets.
At present, there are still some indicators that are difficult to reveal, and those that have not been exposed, we should gradually expose them.

Hopefully, the observability of function calculations becomes a light that shines on every Serverless application.

Welcome to join the cloud native Serverless team (function computing, Serverless workflow, Serverless application engine) to build the industry-leading Serverless product system by the trinity of public cloud, group and open source community. JD is available for the position. Interested students can send their resume to [email protected].

Serverless observability past, present, and future