This is the first day of my participation in the Gwen Challenge in November. Check out the details: the last Gwen Challenge in 2021
Recently, business growth is very rapid, the background for us this is also a big challenge, the core business interface performance bottlenecks, and not a single problem, but a few questions together: we solve a hair online, then find another performance bottlenecks. This is also my lack of experience, resulting in a sudden positioning of the solution; And I have stubborn self-esteem for the whole backstage team, so I don’t want to survive the pressure peak through a large number of horizontal expansion. As a result, problems of varying degrees appeared on the line for several consecutive nights, which must have an impact on our business growth. This is also where I am immature and reflective. This series of articles mainly record the general technical optimization of our background micro-service system and the optimization of business process and cache for this business growth, which is only applicable to our business, so we will not repeat it here. This series will be divided into the following parts:
- Improved the client load balancing algorithm
- Develop a plug-in to filter the log output exception stack
- Improved asynchronous log waiting strategies for x86 cloud environments
- Increasing the monitoring of HTTP request waiting queues for synchronous microservices and cloud deployment requires careful response to slow requests due to reaching the upper limit of instance network traffic
- Add necessary intrusive monitoring for system critical services
Added monitoring of HTTP request waiting queues for synchronous microservices
Problems with request timeouts for synchronous microservices
Compared with the asynchronous microservice based on Spring-WebFlux, the synchronous microservice based on Spring-WebMVC does not handle the client request timeout configuration well. When the client request times out, the client will directly return a timeout exception, but the invoked server task is not cancelled in the synchronous microservice based on Spring-WebMVC, but the asynchronous microservice based on Spring-WebFlux will be cancelled. Currently, there is no good way to cancel these timed tasks in a synchronous environment.
For our spring-Web MVC based synchronous microservice, the HTTP container uses Undertow. In a Spring-Boot environment, we can configure the thread pool size to handle HTTP requests:
server: undertow: If bytebuffers are requested every time they are needed, they need to go through the JVM memory allocation process (TLAB -> heap). For direct memory, system calls are required, which is inefficient. Therefore, memory pools are generally introduced. In this case, it's a BufferPool. Currently, there is only one 'DefaultByteBufferPool' in UnderTow. The other implementations are not currently available. # This DefaultByteBufferPool is very simple compared to Netty's ByteBufArena, similar to the JVM TLAB mechanism # It is best to use the same TCP Socket Buffer configuration as your system's # '/proc/sys/net/ipv4/tcp_rmem' (for reading) # '/proc/sys/net/ipv4/tcp_wmem' (for writing) # in memory greater than 128 When MB, bufferSize is 16 KB minus 20 bytes, which are used for the protocol header buffer-size: 16364 # Direct memory allocated (NIO directly allocated out-of-heap memory), this is enabled, so the Java startup parameter needs to configure direct memory size to reduce unnecessary GC # directBuffers are used by default when memory is greater than 128 MB: True threads: # Set the number of I/O threads that perform non-blocking tasks. These threads are responsible for multiple connections, with one reader and one writer per CPU core as the default: # This value is set depending on the blocking factor of the system thread executing the task. The default is the number of I/O threads *8 worker: 128Copy the code
The thread pool, behind it is jboss thread pool: org. Jboss. Threads. EnhancedQueueExecutor, spring – the boot is not configured to modify the queue of the thread pool size, the default queue size is an Integer. The MAX
We need to monitor the queue size of this thread pool and do something about it:
- When this task continues to increase, it means that the request processing can not keep pace with the request arrival rate, and an alarm is required.
- When the number of instances reaches a certain level, the instances need to be temporarily removed from the registry and expanded.
- When the queue is finished consuming, it goes back online.
- After a certain amount of time has not been consumed, restart the instance.
Example Add the monitoring of HTTP request waiting queues for synchronous microservices
Fortunately, org. Jboss. Threads. EnhancedQueueExecutor itself through JMX exposed the HTTP servlet request thread pool of the indicators:
In our project, we use two types of monitoring:
- Prometheus + Grafana microservice metrics monitoring, which is used for alerting and quickly locating problem sources
- JFR monitoring, which is used to locate singleton problems in detail
For HTTP request wait queue monitoring, we should expose grafana through the Prometheus interface to collect metrics and refine response operations.
The code that exposes the Prometheus interface metrics is:
@log4j2@Configuration (proxyBeanMethods = false) // Load only when Prometheus is introduced and the actuator exposes the Prometheus port @ConditionalOnEnabledMetricsExport("prometheus") public class UndertowXNIOConfiguration { @Autowired private ObjectProvider<PrometheusMeterRegistry> meterRegistry; // Initialize only once private volatile Boolean isInitialized = false; // The logging configuration is initialized before loading the ApplicationContext; // But loading the related Bean for Prometheus is complicated, // ApplicationContext may refresh several times, such as using /actuator/refresh // For simplicity, use a simple isInitialized to determine if it is the first initialization, Ensure initialized only once @ EventListener (ContextRefreshedEvent. Class) public synchronized void init () {if (! isInitialized) { Gauge.builder("http_servlet_queue_size", () -> { try { return (Integer) ManagementFactory.getPlatformMBeanServer() .getAttribute(new ObjectName("org.xnio:type=Xnio,provider=\"nio\",worker=\"XNIO-2\""), "WorkerQueueSize"); } catch (Exception e) { log.error("get http_servlet_queue_size error", e); } return -1; }).register(meterRegistry.getIfAvailable()); isInitialized = true; }}}Copy the code
Then, call /actuator/ Prometheus and you can see the corresponding indicator:
# HELP http_servlet_queue_size
# TYPE http_servlet_queue_size gauge
http_servlet_queue_size 0.0
Copy the code
When queue stacking occurs, we can quickly alarm, and intuitively find from grafana monitoring:
For public cloud deployment, pay attention to monitoring network restrictions
The current public cloud virtualizes physical machine resources and network card resources. For example, AWS implements the virtualization of Network resources, namely, The Elastic Network Adapter (ENA). It monitors and limits the following metrics:
- Bandwidth: Each virtual machine instance (each EC2 instance in AWS) has the maximum bandwidth for outbound traffic and the maximum bandwidth for inbound traffic. This statistic uses a network I/O score mechanism to allocate network bandwidth based on average bandwidth usage, with the net effect of allowing the rated bandwidth to be exceeded for a short time, but not consistently.
- Number of packets Per Second (PPS) : Each virtual machine instance (in AWS, each EC2 instance) is limited to PPS size
- Connection number: The number of connections that can be established is limited
- Linking local service access traffic: Generally in the public cloud, each virtual machine instance (each EC2 instance in AWS) accesses DNS, metadata servers, etc., limiting traffic
Meanwhile, mature public clouds generally provide users with display and analysis interfaces for these indicators. For example, AWS CloudWatch provides monitoring of the following indicators:
During a surge in traffic, we found a performance bottleneck in accessing Redis through JFR, but the monitoring of Redis itself showed that it did not encounter a performance bottleneck. This time will need to look at whether because network traffic restrictions cause a problem with it, in the wrong time, we improve a lot: we found NetworkBandwidthOutAllowanceExceeded events
For this kind of problem, you need to consider vertical scaling (improved instance configuration) versus horizontal scaling (multi-instance load balancing), or reducing network traffic (increased compression, etc.)
Wechat search “my programming meow” public account, a daily brush, easy to improve skills, won a variety of offers