The system is highly available for health checks and health metrics and those kinds of things

One, foreword

With the continuous improvement of people’s living standards, people pay more and more attention to health, many people have done physical examination, general company will have the annual physical examination benefits, health examination is a household name.

With the rapid development of the Internet, the competition between similar and homogeneous products is getting bigger and bigger. An important difference between products is user experience. Besides product design, technology is also an important factor affecting user experience, mainly reflected in service availability and response speed. Improving service availability and response speed is so important, in order to achieve this goal, there must be corresponding means, among which health check is a very important premise to ensure service availability and fast response.

What are the items, indicators and methods of health examination? This article takes you through them all.

What is health examination

Physical examination refers to the diagnosis and treatment of the subject through medical means and methods to get to know the health status of the subject, early detection of disease clues and health risks. The system health check is a process of checking whether a series of objects, such as network, host, application, and service, are healthy or available by technical means.

Three, why do you need to do a health check

Internet products have high requirements for user experience, but often due to technical reasons, a series of problems affecting user experience, such as slow service response or service unavailability, lead to business interruption, impact on revenue, and a huge negative impact on the company’s brand and reputation.

There are many factors affecting service unavailability and slow response, such as service hardware damage or optical fiber being cut off, database CPU load and disk I/O being too high due to excessive requests, or a student burying lightning, OOM occurred when the new online function was run for the first time……

What can we do to ensure high availability? Some people say that the system node redundancy to eliminate single node failure is not on the line. That’s right, eliminating single nodes is a common way to make systems highly available. An important prerequisite for eliminating single nodes is to find the problem node, remove the problem node, or switch traffic to another normal node.

How to “find problem nodes” is what the system health check needs to do.

Four, how to do a health check

Before we talk about how to get a health checkup, the first thing we need to know is who we are really testing. An object can be a network connection, a small functional component, a process, a service cluster, or a computer room unit. Therefore, in order to achieve “high availability”, first of all, it is necessary to figure out which level of high availability to do, which objects may have single point problems, and make “objects” clear.

So, how to do a health check? There are usually two ways: active and passive.

4.1 Active Mode

The inspector, acting as the active party, periodically initiates health check requests. The content or format of the request packets is usually independently designed, and the healthy objects perform a simple self-check and then return a response. Here’s an example:

check interval=3000 rise=2 fall=5 timeout=1000 type=http;
check_http_send "The HEAD/check. Do HTTP / 1.0 \ r \ n \ r \ n";
check_http_expect_alive http_2xx http_3xx;
Copy the code

If the number of consecutive failures reaches fall=5, the server is considered to be down. If the number of consecutive successes reaches Rise =2, the server is considered to be up. Of course, the response status code must be 2XX or 3XX to be considered healthy.

4.2 Passive Mode

Passive health check Does not design independent health check requests. Instead, normal connection status or response to service requests are used as indicators to measure the health status of checked objects. For example, the passive health check configuration for the official open source version of Nginx:

server 127.0. 01.:8080       max_fails=3 fail_timeout=30s;
Copy the code

Nginx is based on connection probing. If three attempts fail within 30 seconds, the backend Web service is considered unavailable.

4.3 Eliminating single points

As mentioned above, to achieve high availability, it is necessary to eliminate single point of failure. The simplest and direct solution is to add the standby service node. When the active service node is found to be down through periodic heartbeat health check, the standby service node takes over the work of the active service node, and the client switches the request traffic to the standby service node.

Master service node and service between nodes through a dedicated heartbeat health checks, they may not be able to receive because of network partition and so on each other heart, at this time for node will think the master node is down, the master node also think for node has downtime, but both master-slave node status is normal, the client can normal visit to master-slave two nodes, a “double”, This phenomenon is called split-brain in the industry.

In this case, the introduction of a third-party organization for arbitration can effectively avoid the occurrence of split brain, which may lead to data confusion and affect the correctness of services. In this case, the introduction of a third-party organization for arbitration can effectively avoid the occurrence of split brain, which may lead to data confusion and affect the correctness of services.

4.4 Third-party Arbitration

Since the primary and secondary parties cannot confirm the survival of the other party, the third-party arbitration node can make the decision in case of a dispute. It decides who is the primary one. The third-party arbitration node is generally implemented by high availability solutions such as Zookeeper.

Example of health check

5.1 Network Devices

Keepalived is a cluster high availability service that functions like Heartbeat to prevent a single point of failure. However, it generally does not exist on its own, but works with other load balancing technologies such as LVS, HAProxy, and Nginx to achieve high cluster availability.

Its health check also includes two aspects, one is Keepalived components between health checks (via VRRP heartbeat messages), as shown below

Another is Keepalived component and local load balancing component health check, configuration as follows:

vrrp_script check_nginx_running {
    script "/usr/local/bin/check_running"(Defining the script) interval10(Interval between script execution) weight -10(Priority of script execution)}Copy the code

The application health check mode is implemented using a customized script.

Keepalived components perform health checks through VRRP. If the primary server is down, the standby server is elected as the new primary server through VRRP to snatch virtual IP addresses from the old primary server to achieve high availability.

VRRP packets are encapsulated on IP packets and support various upper-layer protocols. Network devices, such as switches, routers, and firewalls, usually use VRRP to implement active/standby HA switchover.

When a network device is faulty, VRRP elects a new network device to take over the data traffic, ensuring reliable network communication.

5.2 Network Connection

When a mobile device connects to the Internet in NAT mode, the PUSH PUSH of a mobile App needs to maintain a long connection with the server. However, most mobile network operators will eliminate the corresponding connection in the NAT list when there is no data interaction for a period of time, resulting in connection interruption. In order to maintain the “healthy” availability of the network connection, after the connection is established, App and server can send Ping Pong heartbeat information to each other regularly to keep the connection continuously effective.

The above is the connection health check scheme at the application layer. The operating system also supports the connection health check of the underlying network, namely, Keepalive. The TCP Keepalive sends an empty probe packet after the connection is inactive for a period of time. This prevents the TCP connection from being closed by intermediate network devices such as clients or firewalls. Linux configures the interval, frequency, and threshold of Keepalive using the following three parameters:

net.ipv4.tcp_keepalive_time = 7200
net.ipv4.tcp_keepalive_intvl = 75
net.ipv4.tcp_keepalive_probes = 9
Copy the code

5.3 Hosts and Processes

The reachability between hosts can be identified by using the Ping command, which uses the ICMP protocol. The Ping command can identify the network connectivity of the entire path from the client to the target host. Ping is usually used to manually test whether a host is started and whether the network is connected.

ICMP is a network-layer protocol and is independent of specific processes. Therefore, you cannot Ping a process to determine whether it exists. However, the process has ports and process information. You can use the Telnet port or ps command to check whether the process exists. If a process is killed or abnormally shut down due to insufficient memory, it can be detected and automatically pulled up by the CRon timing script. This solution is very effective in improving the availability of applications that can only be deployed in a single instance in old and run-down projects.

5.4 Middleware -RocketMQ

NameServer is the routing center of RocketMQ. NameServer maintains the service status and routing information of Producer, Broker, and Consumer clusters. When a new Consumer joins the cluster, in addition to reporting its own information, it also retrieves the address, Topic, queue, and other information of each Broker, so that it can know which Broker and queue its consuming Topic messages are stored on.

Multiple NameserVers can be deployed. Nameservers are independent of each other. When starting Producer, Broker, and Consumer services, multiple NameserVers must be specified. Service information will be registered with multiple specified NameserVers at the same time to achieve high availability.

Each Broker node maintains a TCP long connection with all NameServer nodes and sends heartbeat messages to NameServer every 30 seconds to tell NameServer that it is alive. Each NameServer checks the last heartbeat of each Broker every 10 seconds. If a Broker does not send heartbeat messages for more than 120s, it considers that the Broker is down, closes the corresponding network connection channel, and removes it from the routing information.

5.5 Application Layer – Spring Boot Actuator

A service instance or process can report its survival to other services through periodic heartbeat packets, but having this heartbeat packet is not enough to reflect its health. For example, when disk space runs out, the service can no longer write data, but it can still respond to heartbeat packets. The service relies on Redis, but the Redis service has a problem and cannot connect, but it can still respond to heartbeat packets; Some functions of the service depend on the distributed storage service, but the distributed storage service is not available, but it can still respond to heartbeat packets. As we can see, there are many aspects to consider when determining whether a service instance is alive and healthy. Spring Boot Actuator can solve this problem well. It can reflect the health of the whole service, including the subsystem it depends on.

Spring Boot Actuator is a sub-project of Spring Boot, which provides endpoints for external applications to access and interact with. The Actuator includes many features, such as health checks, auditing, metrics collection, and more, to help us monitor and manage Spring Boot applications. Health is one of the endpoints. It provides basic Health information about Spring Boot applications and allows other cloud services or K8S to periodically detect the Health status of applications and respond to exceptions in a timely manner.

If a microservice application uses resource systems such as MySQL, Amazon S3, Elastic Search, and DynamocDB, its health check should include the health of all of these subsystems:

The health check of the Actuator is implemented by the HealthIndicator interface, which has only one health() method and returns a health object.

@FuncationalInterface
public class HealthIndicator {
 
    /**
     * Return an indication of health.
     * @result the health for
     */
    public Health health(a);
 
}
Copy the code

The Health object has two fields: status and Details. Status has four values of UNKNOWN, UP, DOWN and OUT_OF_SERVICE by default. Users can customize and expand details.

@JsonInclude(Include.NON_EMPTY)
public final class Health extends HealthComponent {
 
    private final Status status;
 
    private finalMap<String, Object> details; . }Copy the code

The Actuator has many of the commonly used Healthindicators built in:

You can customize as required, for example:

@Override
public Health health(a) {
    int errorCode = check(); // perform some specific health check
    if(errorCode ! =0) {
        return Health.down().withDetail("Error Code", errorCode).build();
    }
    return Health.up().build();
}
Copy the code

By default state of health is enabled and opening to the outside world, through http://locahost:8080/actuator/health you can query to the application of health status: {” status “: } “UP”, this is a summary of status, a detailed health information management. The item can be configured endpoint. Health. The show – the details = always open, and a complete contains the details of the health examination information is as follows:

The aggregated health status is summarized by HealthAggregator. The aggregated algorithm is as follows: The health status of all subsystems is sorted in the order of DOWN, OUT_OF_SERVICE, UP, and UNKNOWN.

For example, ehCache is UP, MySQL is UNKNOWN, diskSpace is OUT_OF_SERVICE; The order is OUT_OF_SERVICE, UP, and UNKNOWN. The first one is OUT_OF_SERVICE, indicating that the service is unavailable.

Six, summarized

High availability is a complex engineering problem that consists of a series of sub-problems, of which health checks and health metrics are just one. To ensure continuous services and continuous system running, ensure that all nodes on the link are highly available to avoid single points of failure.

Health check plays an important role in detecting unhealthy or faulty nodes and generating alarms and failfast/failover to avoid avalanche effects when unhealthy or faulty nodes occur.

Author: Vivo Internet Server Team -Chen Jianbo