Zheng Jimin, joined the team of Domestic hotel Quotation Center in August 2019, mainly responsible for quotation related system development and architecture optimization. Strong interest in high concurrency and high availability, with daily orders of ten million distributed system high availability construction experience. Like to study algorithms, ACMICPC program design competition twice entered the Asian preliminary competition. I won the first prize in the first Hackathon Competition in Qunar.
background
Before the introduction of “Domestic Hotel Stability Governance practice of Inter-system Dependency Governance”, we carried out a special governance of inter-system dependency, involving general flow limiting, caching, Dubbo, Http, DB, MQ and so on. But it is not enough to govern dependencies between systems. We also analyze and govern resources within systems.
This article focuses on the governance of internal resources of the system, including the use of degradation, fusing, isolation, synchronous to asynchronous control processes (limiting the flow has been introduced in the previous article, this article will not repeat), as well as the governance of core resources such as thread pools.
Governance measures and programs
Downgrade, circuit breaker
This mainly deals with situations where an external interface or resource has problems. Consider some options in advance to ensure that the main process is not interrupted by exceptions. Typically, APPLICATION interface P3 is invoked by application interface P1. Ensure that any failure of APPLICATION interface P3 does not affect the core process of application interface P1.
1) Conduct degradation research and processing for external calls involved in the core interface, give priority to non-destructive degradation, and prepare the means of damaging degradation for core calls.
2) Perform fusing for external interfaces in non-core scenarios (mainly based on failure rate and response time). Pay attention to the processing after fusing, and define the returned default value or defined exception in advance.
3) For fuses, it is recommended to prepare alternative interfaces, alternative resources or return default values after fuses in advance, and dynamically adjust the fuses threshold.
4) Specification (not mandatory) : new requirements and reconstruction can be degraded back to modify before, new functions can be offline, which can greatly reduce the risk of release.
isolation
This mainly deals with the situation where the failure of some resources in the same resource affects other resources.
1) Thread pool isolation: This issue has significant implications for dubbo thread pools, custom business thread pools, and thread pools used by concurrent streams in JDK8 (parallelStream is the same thread pool that is shared by default in JDK8).
2) Data storage isolation: it mainly refers to the isolation of core data from non-core data. Core data can also be fragmented.
3) Isolation of core interface from non-core interface: isolation means can be used according to business needs by using different applications, different groups, different thread pools, different clients, etc.
4) Isolation can be considered wherever there is a distinction between core and non-core.
Synchronous to asynchronous, serial to parallel
This is mainly for scenarios that block the entire process when a synchronous operation goes wrong.
1) The main process generally chooses synchronous processing, and the secondary process involved considers asynchronous processing. The auxiliary process should do a good job of abnormal and bottom handling, problems do not affect the main process.
2) The process related to the core interface should be processed in parallel to reduce the return time of the interface.
3) Dubbo, HTTP and other interfaces also consider asynchronous invocation and processing.
Thread pool governance
Thread pool resources in the system are very valuable and need to be managed and governed.
1) All custom thread pools should have related thread pool monitoring (number of active threads, number of tasks in the queue, number of completed tasks, etc.). You can monitor the resource usage of the thread pool, facilitating resource management in advance and more accurate resource assessment for new requirements.
2) The core thread pool allows dynamic adjustment of core parameters (number of core threads, maximum number of threads, work queue length, etc.). In this way, adjustments can be made dynamically online when the thread pool is running out of threads or the thread pool is consuming too many resources, without the need to publish the changes, greatly reducing the impact time.
3) It is recommended to use different thread pools for core services, referring to the isolation policy described above.
JVM and hardware metrics governance
1) FGC times monitoring, set reasonable single-machine alarm, for example, the NUMBER of FGC shall not be greater than or equal to 2 times within 5min.
2) ygc times monitoring, set reasonable stand-alone alarm, for example, ygc times should not exceed 10 times within 1min.
3) GC time monitoring, for example, each YGC time can not exceed 0.7sec, individual applications can be set more in line with their own index value.
4) The number of active connections of Tomcat applications on ordinary virtual machines should not exceed 300. Individual applications can be set to more consistent with their own index values.
5) CPU usage of I/O intensive applications should not exceed 60%. Individual applications can be set to more consistent with their own indicators.
6) Monitor the blocked thread, configure the alarm and print the stack. This prevents certain threads from occupying resources and affecting the external service capability of applications.
Daily operations
1) Alarm management (long-term) : control the application of daily alarm times, optimize the configuration of alarm indicators, as far as possible to ensure that the alarm is problematic, need manual processing.
2) exception management (long-term) : continue to pay attention to the top5 alarms and occasional runtimeexceptions of core applications, and continuously optimize and correct them.
3) Service inspection: perform service inspection after peak hours and release every day, confirm the cause of abnormal indicators and optimize them as soon as possible.
4) Automatic fault location: At present, we are doing automatic fault location based on application monitoring alarm, anomaly and trace link analysis, and some progress has been made. When the cause of the fault can be directly located automatically, combined with our prepared treatment means, for the same fault, the impact of the fault can be less and less.
5) Pressure test the core system regularly, and evaluate the capacity expansion and reduction of the machine according to the actual flow. At present, the flow weight adjusting pressure measuring system can be considered in Qunar.
conclusion
This stability governance has come to an end. Here is a review from daily operation and governance methods:
In terms of daily operation and maintenance, cross-room deployment of core applications, high availability of related components and service redundancy should be carried out in advance, service inspection should be done after release and peak hours, the robustness of applications and availability of related tools should be verified through regular pressure testing and fault drills, and AIOps should be introduced to automatically locate fault causes.
In terms of governance means, core resources are managed by means of degradation, fusing, limiting, isolation, multi-channel, multi-copy and so on.
It’s hard to keep an app from breaking down, but there are many ways to improve the high availability of your app, and the proper use of these ways is especially important to the stability of your app. At the same time, it is not enough to manage stability only once. It is necessary for each member of the group to continuously improve their awareness and ability in this aspect and implement these governance items and attention points in daily work.
Finally, I hope this practice can guide more people.