This chapter provides more functional computing services on the platform by integrating other open source projects.

6.1 Design objectives of functional computing services

Function computing service scheme design of edge computing platform based on Nomad includes:

L monitor

L Alarm and automatic capacity expansion

L Load balancing

L Call chain trace

The development of these functions is based on the implementation of the FaaS function runtime described above, and also promotes lower coupling and open interfaces in the implementation of the FaaS function runtime design. The idea of functional computing service solution is to integrate cloud native project on FaaS function runtime and provide corresponding services.

6.2 monitor

The monitoring function is responsible for monitoring the running status and load status of components in the functional computing system of edge computing platform. When the component operation or the load of the function computing system is abnormal, the alarm system will send an alarm to the management personnel of the function computing system in the first time according to the monitoring data, so as to minimize the loss caused by the failure. In addition, the data collected by monitoring can also be used as the decision-making basis for the automatic expansion and contraction of the function calculation system, payment by volume and so on.

As shown in Figure 6.1, the monitoring system can accept component health data and system load data and store this data in an appropriate database based on the data nature and data type. The third-party service module can interconnect with third-party services, such as visual monitoring service, elastic capacity expansion service, and alarm distribution service.

Figure 6.1 Relationship between monitoring and other modules

Implementatively, monitoring uses Prometheus, as shown in Figure 6.2, and adds Prometheus and StatsD container workloads to schedule as system job types. Prometheus is an open source system monitoring and alert toolkit originally built on SoundCloud. Prometheus is now an independent open source project. Prometheus implements a high-dimensional data model, has been adopted by many companies and organizations, and has a very active developer and user community. Prometheus joined CNCF in 2016 as the second hosting project after Kubernetes.

Figure 6.2 Function calculation monitoring function design

Prometheus collects metric data for the FaaS API gateway and FaaS function runtime components. As shown in Table 6.1, the FaaS API Gateway component needs to add the following metrics to integrate the health and behavior of the Prometheus monitoring functions.

Table 6.1 Function calculation monitoring function Prometheus collection indicators

indicators type describe
gateway_functions_seconds histogram Function call time
gateway_function_invocation_total counter Number of function calls
gateway_service_count counter Number of copies of the function
http_request_duration_seconds histogram The time it takes to service an HTTP request
http_requests_total counter Total number of HTTP requests

NATS provides queue and asynchronous function execution. Synchronous function calls /function/ routing, asynchronous method calls /async-function routing. Http_request * metrics log /system/* Route latency and statistics, monitor FaaS API gateway and FaaS function runtime.

Monitor visualization using Grafana. Grafana is an open source analysis and monitoring solution for data sources, an open observability platform. Grafana has a plugable data source model and is bundled with support for many of the most popular time series databases. Grafana includes Prometheus data sources since 2.5.0, and a partial dashboard of function monitoring based on Grafana customization is shown in Figure 6.3.

Figure 6.3 Grafana function monitoring part of dashboards

6.3 Alarm and Automatic Capacity Expansion

As shown in Figure 6.4, alarms and automatic scaling are triggered by configured rules. The rule engine is responsible for rule configuration and rule management, and the rule triggering service is responsible for checking the threshold of managed rules. When the threshold of managed rules is reached, the action defined by rules will be triggered. Alarm rules are based on Prometheus’ flexible PromQL definition. Alertmanager is responsible for handling alerts sent by client applications such as Prometheus servers, de-duplexing, grouping, and routing to integrated receivers.

Figure 6.4 Alarm and automatic capacity expansion

The rules engine is used to define functions, calculate alarm rules and automatic capacity expansion rules for functions in the system, generate alarms based on monitoring data, and expand and shrink capacity based on load conditions. In addition, the FaaS API gateway needs to provide/System/Alert routes to process AlertManager alarms and automatic capacity expansion requests.

In implementation, Prometheus integrates AlterManager to interconnect rule engine modules, and configure AlertManager alarm rules based on Prometheus monitoring indicators. As shown in Figure 6.5, added container workloads for Altermanager, NATS Streaming, and NATS Streaming for asynchronous call queues. The rules for automatic capacity expansion and contraction should be refined. Different automatic capacity expansion and contraction policies should be abstracted, and QPS, CPU, and user-defined indicators should be supported.

Figure 6.5 Function calculation alarm and automatic capacity expansion function design

Because the edge computing environment needs to consider energy saving, the application program needs to be dynamically started and stopped according to the actual load of application access. The number of function copies can be reduced to zero, which reduces resource occupation by applications that are not accessed by others and reduces power consumption at the edge. Reduction to zero is implemented in 4.3.2 Faas-IDler or Altermanager. The current rule for automatic expansion of functions is that the function call rate exceeds 5 within 10 seconds, as shown in Figure 6.6.

Figure 6.6 Automatic capacity expansion rule

As shown in Figure 6.7, Nomad’s Web UI triggers a rule, and a function job is being expanded.

Figure 6.7 Function jobs in capacity expansion

As shown in Figure 6.8, the expansion process can be seen through monitoring instrument inventory.

Figure 6.8 Monitoring the capacity expansion of copies

6.4 Load Balancing

As shown in Figure 6.9, without load balancing, the number of expanded replica calls is severely skewed.

This article uses Fabio for load balancing. Fabio is an HTTP and TCP reverse proxy that can self-configure using data from Consul to perform zero-configuration load balancing for Applications managed by Consul.

FIG. 6.9 Observation of cumulative call times of function instances without load balancing

It should be noted that the cumulative call times of function instances before capacity expansion are more than those after capacity expansion. Therefore, the change of instance call times after capacity expansion is observed, as shown in FIG. 6.10. After Fabio function is used to calculate load balancing, load is evenly distributed to each function instance. Integrate Fabio load balancing and register the addresses discovered in Consul without manual intervention.

FIG. 6.10 Observation of changes in function instance call times after load balancing

6.5 Invoking chain Tracing

Based on the design and implementation of 4.3.4 functional workflows, most operational issues that arise when considering distributed architectures are ultimately based on two aspects: networking and observability. Network connectivity and debugging between serverless functions is a set of interleaved distributed services that are an order of magnitude more complex than a single overall application. Call chain tracing based on function workflow can reduce the cost of troubleshooting.

Jaeger is used for distributed tracing in this article. The core steps of distributed tracking system are generally three: code burial point, data storage, query and display. This paper mainly considers the buried part of code, data storage and query display can integrate the existing open source cloud native components. After the data is collected and stored, distributed tracking systems typically choose to present the call chain using a sequence diagram containing a timeline. Following the OpenTracing specification can be compatible with different distributed tracking system apis, avoiding platform-vendor binding. Jaeger is Uber’s open source end-to-end distributed tracing system for monitoring and troubleshooting transactions in complex distributed systems, compatible with the OpenTracing API.

A chain of function calls can be thought of as a DAG consisting of multiple function processes, so an implementation of the chain trace can append code buries to the function workflow above. Nodes in the traced link context share a RequestId. Each node passing through the call chain should contain SpanID and ParentSpanID, indicating the node processing the user request and the previous node processing the user request respectively. SpanID uses the UUID random number generation mechanism to ensure uniqueness. Each request has a unique SpanID. ParentSpanid is passed to the SpanID using the previous node. The transmission mechanism of RequestId, SpanID, and ParentSpanID is based on the HTTP protocol used by Restful interfaces. The client generates the RequestId and adds x-Faas-worklow-ReqID to the request header.

The code buried point implementation, as shown in Figure 6.11, is achieved by adding attributes to the OpenTracing interface in OpenFaaSEventHandler and tracing through the lifecycle of the EventHandler. EventHandler includes Request, Execution, Node, and Operation. The EventHandler has three trigger points: Start, End, and Failure. Execution has two triggers: Forward and Continuation.

Figure 6.11 The EventHandler implementation adds OpenTracing

Add the RequestId field to the faAS-Workflow request body to uniquely identify the request and store the status of the upstream request. State storage This article uses Consul’s key-value pair storage function and abstracts the StateStore interface, which can be used in other state storage systems.

As shown in Figure 6.12, the CURRENT UI is still weak, and some personalized analysis requirements, such as comparison and statistical delay distribution, cannot be quickly met. Later, more functions can be further customized to meet the needs of function calculation call chain tracking.

Figure 6.12 Function call chain tracing