This article is a record of my experience in integrating Hystrix. It was originally written in the company’s internal wiki, so some contents in it were directly quoted from other colleagues’ wikis and published to the external Internet in order to avoid repetition. This part cannot be directly quoted, so it may not be too complete and will be added later.
1. The background
Currently, for non-core operations, such as storing operation logs after inventory increase or decrease and sending asynchronous messages (specific business processes), if MQ service exceptions occur, the interface response times out. Therefore, it can be considered to introduce service degradation and service isolation for non-core operations.
2. Hystrix instructions
The official document [https://github.com/Netflix/Hystrix/wiki]Copy the code
Hystrix is Netflix’s open source disaster recovery framework that solves the problem of bringing down business systems and even causing avalanches when external dependencies fail.
2.1 Why Hystrix is needed?
In large and medium-sized distributed systems, the system usually has many dependencies (HTTP, Hession,Netty,Dubbo, etc.). Under high concurrent access, the stability of these dependencies has a great impact on the system. However, the dependencies have many uncontrollable problems, such as slow network connection, busy resources, temporarily unavailable, offline services, etc.
When a dependency blocks, most servers’ thread pools BLOCK, affecting the stability of the entire online service. Applications with complex distributed architectures that have many dependencies will inevitably fail at some point. High concurrency dependencies fail without isolation, and the current application service is at risk of being dragged down.
For example, a system that relies on 30 SOA services, each 99.99% available. 0.3% means 3,000,00 failures for 100 million requests, which translates into approximately 2 hours of service instability per month. As the number of service dependencies increases, the probability of service instability increases exponentially.Copy the code
Solution: Isolate dependencies.
2.2Hystrix design philosophy
If you want to know how to use Hystrix, you must first understand its core design concept. Hystrix is based on command pattern
As you can see, Command is an intermediate layer added between Receiver and Invoker. Command encapsulates the Receiver. So how does Hystrix fit into the picture above?
Apis can be Invoker or Reciever, and encapsulate these apis by inheriting from the Hystrix core class HystrixCommand (for example, remote interface calls, database queries, and the like that can cause delays). You can provide elastic protection for your API.
2.3 How does Hystrix address dependency isolation
- 1:Hystrix uses the Command pattern HystrixCommand(Command) to wrap the dependent call logic, with each Command executed in a separate thread/under signal authorization.
- 2: The dependent call timeout period can be configured. Generally, the timeout period is set to slightly higher than 99.5% average time. When the call times out, the fallback logic is returned or executed directly.
- 3: Provide a small thread pool (or signal) for each dependency, if the thread pool is full the call will be rejected immediately, no queuing by default. Speed up the failure determination time.
- 4: dependent call result points: success, failure (throw exception), timeout, thread rejection, short circuit. Fallback logic is executed when the request fails (exception, rejection, timeout, short circuit).
- 5: provide fuse components, can be automatically run or manually called, stop the current dependence for a period of time (10 seconds), fuse default error rate threshold is 50%, over which will automatically run.
- 6: Provides statistics and monitoring of near-real-time dependence
2.4Hystrix process structure analysis
Process description: 1: create a new HystrixCommand for each call, encapsulating the dependent calls in the run() method. 2: execute()/queue for synchronous or asynchronous calls. 3: Check whether the circuit-breaker is on. If it is on, skip to Step 8 for downgrade; if it is off, enter step 4: Check whether the thread pool/queue/semaphore is full. 5: Call the run method of HystrixCommand. 6: Check whether the invocation of logic is successful. 6A: return the result of successful invocation. 6b: Invoke error, enter Step 8.7: calculate the status of the fuse. 8:getFallback() degradation logic. The following four conditions will trigger getFallback call: (1) : the run () method to throw a non HystrixBadRequestException anomalies. (2):run() method call timeout (3): fuse start interception call (4): thread pool/queue/semaphore run full 8C: the degraded logical invocation fails. 9: The result of successful execution is returnedCopy the code
2.5 Circuit Breaker
By default, each fuse maintains 10 buckets, one per second. Each bucket records the status of success, failure, timeout, and rejection.
The default error is more than 50% and more than 20 requests are interrupted within 10 seconds.
2.6Hystrix isolation analysis
Hystrix isolation uses thread/signal isolation to limit the concurrency and blocking spread of dependencies.
- (1) Thread isolation separates the thread executing the dependent code from the requesting thread (such as jetty thread), and the requesting thread is free to control when it leaves (asynchronous process). The amount of concurrency can be controlled by the size of the thread pool. When the thread pool is saturated, the service can be denied in advance to prevent the proliferation of dependency problems. It is recommended not to set the thread pool too large, otherwise a large number of blocked threads may slow down the server.
- (2) Advantages and disadvantages of thread isolation
- Advantages of thread isolation:
- [1]: Using threads can completely isolate third-party code and request threads can be quickly put back in.
- [2]: When a failed dependency becomes available again, the thread pool is cleaned up and made available immediately, rather than a long recovery.
- [3]: Asynchronous call can be completely simulated to facilitate asynchronous programming.
- Disadvantages of thread isolation:
- [1]: The main disadvantage of thread pooling is that it increases CPU, because execution of each command involves queuing (SynchronousQueue is avoided by default), scheduling, and context switching.
- [2]: Adds complexity to code that relies on thread state, such as ThreadLocal, by manually passing and cleaning up thread state.
- NOTE: Netflix internally believes that thread isolation overhead is small enough to not have a significant cost or performance impact.
- Netflix’s internal API relies on 10 billion HystrixCommand requests per day using thread isolation, with approximately 40 + thread pools per application and approximately 5-20 threads per thread pool.
- Advantages of thread isolation:
- (3) Signal isolation
- Signal isolation can also be used to limit concurrent access and prevent blocking from spreading. The main difference with thread isolation is that the thread executing the dependent code is still the requesting thread (which needs to be requested by signal).
- If the client is trusted and can return quickly, you can use signal isolation instead of thread isolation to reduce overhead.
- The size of the semaphore can be adjusted dynamically, but the thread pool size cannot.
The difference between thread isolation and signal isolation is shown below:
3. Access mode
This article focuses on the thrift servitization project (THRIFT) -based access approach.
3.1 Adding hystrix dependencies
About version: due to different versions of the Compile Dependencies, in use process can change the version on the specific circumstances, specific Dependencies mvnrepository.com/artifact/co…
< hystrix version - > 1.4.22 < / hystrix - version > < the dependency > < groupId > com.net flix. Hystrix < / groupId > <artifactId>hystrix-core</artifactId> <version>${hystrix-version}</version> </dependency> <dependency> <groupId>com.netflix.hystrix</groupId> <artifactId>hystrix-metrics-event-stream</artifactId> <version>${hystrix-version}</version> </dependency> <dependency> <groupId>com.netflix.hystrix</groupId> <artifactId>hystrix-javanica</artifactId> <version>${hystrix-version}</version> </dependency> <dependency> <groupId>com.netflix.hystrix</groupId> <artifactId>hystrix-servo-metrics-publisher</artifactId> <version>${hystrix-version}</version> </dependency> <dependency> <groupId>com.meituan.service.us</groupId> < artifactId > hystrix - collector < / artifactId > < version > 1.0 - the SNAPSHOT < / version > < / dependency >Copy the code
3.2 Introduction of Hystrix Aspect
Application – context. The XML file
<aop:aspectj-autoproxy/>
<bean id="hystrixAspect" class="com.netflix.hystrix.contrib.javanica.aop.aspectj.HystrixCommandAspect"></bean>
<context:component-scan base-package="com.***.***"/>
<context:annotation-config/>Copy the code
Note:
- 1) The two lines of the hystrixAspect configuration must be in the same file as context:component-scan
- 2) Some of the jars that Hystrix relies on need to resolve conflict issues, for example guava is version 15.0
3.3 Statistical Data
You need to register the plugin and get statistics directly from the plugin
Added initialization Bean
import com.meituan.service.us.collector.notifier.CustomEventNotifier;
import com.netflix.hystrix.contrib.servopublisher.HystrixServoMetricsPublisher;
import com.netflix.hystrix.strategy.HystrixPlugins;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import org.springframework.beans.factory.InitializingBean;
/**
* Created by gaoguangchao on 16/7/1.
*/
public class HystrixMetricsInitializingBean {
private static final Logger LOGGER = LoggerFactory.getLogger(HystrixMetricsInitializingBean.class);
public void init() throws Exception {
LOGGER.info("HystrixMetrics starting...");
HystrixPlugins.getInstance().registerEventNotifier(CustomEventNotifier.getInstance());
HystrixPlugins.getInstance().registerMetricsPublisher(HystrixServoMetricsPublisher.getInstance());
}
}Copy the code
Application – context. The XML file
<bean id="hystrixMetricsInitializingBean" class="com.***.HystrixMetricsInitializingBean" init-method="init"/>Copy the code
3.4 Adding annotations
This article uses synchronous execution, so annotations and method implementation are synchronous. If there is a need for asynchronous execution and reactive execution, you can refer to: official annotations [github.com/Netflix/Hys…]
@HystrixCommand(groupKey = "productStockOpLog", commandKey = "addProductStockOpLog", fallbackMethod = "addProductStockOpLogFallback", commandProperties = { @HystrixProperty(name = "execution.isolation.thread.timeoutInMilliseconds", Value = "400"),// Specify how long the timeout is in milliseconds. Timeout into the fallback @ HystrixProperty (name = "circuitBreaker. RequestVolumeThreshold", value = "10"), / / judge fusing the minimum number of requests, the default is 10. Only the number of requests processed within a statistics window reaches this threshold, Will fuse or not judgment @ HystrixProperty (name = "circuitBreaker. ErrorThresholdPercentage", value = "10"), / / judge fusing threshold, the default value is 50, Indicates that 50% of requests failed in a statistics window, }) public void addProductStockOpLog(Long sku_id, Object old_value, Object new_value) throws Exception { if (new_value ! = null && ! new_value.equals(old_value)) { doAddOpLog(null, null, sku_id, null, ProductOpType.PRODUCT_STOCK, old_value ! = null ? String.valueof (old_value) : null, string.valueof (new_value), 0, "C end ", null); } } public void addProductStockOpLogFallback(Long sku_id, Object old_value, Object new_value) throws Exception {logger. warn(" Failed to send commodity inventory change message, Fallback,skuId:{},oldValue:{},newValue:{}", sku_id, old_value, new_value); }Copy the code
Example:
@hystrixCommand (groupKey="UserGroup", commandKey =" GetUserByIdCommand", commandProperties = { @HystrixProperty(name = "execution.isolation.thread.timeoutInMilliseconds", Value = "100"),// Specify how long the timeout is in milliseconds. Timeout into the fallback @ HystrixProperty (name = "circuitBreaker. RequestVolumeThreshold", value = "10"), / / judge fusing the minimum number of requests, the default is 10. Only the number of requests processed within a statistics window reaches this threshold, Will fuse or not judgment @ HystrixProperty (name = "circuitBreaker. ErrorThresholdPercentage", value = "10"), / / judge fusing threshold, the default value is 50, Indicates that 50% of requests failed in a statistics window, ThreadPoolProperties = {@hystrixProperty (name = "coreSize", value = "30"), @HystrixProperty(name = "maxQueueSize", value = "101"), @HystrixProperty(name = "keepAliveTimeMinutes", value = "2"), @HystrixProperty(name = "queueSizeRejectionThreshold", value = "15"), @HystrixProperty(name = "metrics.rollingStats.numBuckets", value = "12"), @HystrixProperty(name = "metrics.rollingStats.timeInMilliseconds", value = "1440") })Copy the code
Note: Hystrix function must be public, fallback function can be private. Both return values and parameters with the same details.
The hystrix function needs to be in a service, and calling the Hystrix function in another function of the class itself is not useful for monitoring purposes.
3.5 Parameter Configuration
Parameters that | value | note |
---|---|---|
groupKey | productStockOpLog | The group identifier, a group uses a thread pool |
commandKey | addProductStockOpLog | The command id |
fallbackMethod | addProductStockOpLogFallback | The fallback method, which must return the same value and argument |
Timeout setting | 400ms | Execute the policy. In THREAD mode, the timeout period is reached. Can interrupt For most circuits, You should try to set their timeout values close to the 99.5th percentile of a normal healthy system so they will cut off bad requests and not let them take up system resources or affect user behavior. |
The minimum number of requests in the statistical window (10s) | 10 | Fusing strategy |
Fuse how many seconds before you try the request | 5s | Fuse breaker policy, default value |
Fusing the threshold | 10% | Circuit breaker policy: A circuit breaker is triggered when 10% of requests fail to be processed in a statistics window |
The thread pool coreSize | 10 | Default value (recommended value). In the current project, the dependency isolation method is to send an MQ message, the TP99 time of sending MQ message method is less than 1ms, and the maximum QPS value of single device in recent 2 weeks is around 18. After the gray level verification (QPS of the day > QPS of the weekend), ActiveThreads<=2, The size of the thread pool (rejected=0) is 10, which is sufficient for 2000QPS, provided that the time of sending MQ messages is normal (this is the case in the actual project, I will not go into details here). |
The thread pool maxQueueSize | – 1 | The thread pool queue is SynchronousQueue |
4. Parameter Description
For other parameters, see github.com/Netflix/Hys…
classification | parameter | role | The default value | note | |
---|---|---|---|---|---|
The basic parameters | groupKey | Represents the owning group. A group shares a thread pool | getClass().getSimpleName(); | ||
The basic parameters | commandKey | Name of the current executing method | |||
Execution (Control hystrixcommand-run () Execution strategy) | execution.isolation.strategy | Isolation policy, including THREAD and SEMAPHORE THREAD | SEMAPHORE can be used for methods that only want to control concurrency outside the system that have thread isolation and call local methods or methods that are very reliable and take very little time (such as Medis) | ||
Execution | execution.isolation.thread.timeoutInMilliseconds | timeout | 1000ms | Default value: 1000 In THREAD mode, the THREAD can be interrupted when the timeout duration reaches. In SEMAPHORE mode, the THREAD will wait until the execution is complete, and then determine whether the timeout setting criteria: Retry, 99MEANTIME + AVG Meantime is not retry, 99.5 Meantime | |
Execution | execution.timeout.enabled | Whether to enable timeout | true | ||
Execution | execution.isolation.thread.interruptOnTimeout | Whether to enable timeout thread interrupts | true | THREAD mode is valid | |
Execution | execution.isolation.semaphore.maxConcurrentRequests | Maximum semaphore concurrency | 10 | SEMAPHORE mode is valid | |
Fallback (sets the policy when Fallback degradation occurs) | fallback.isolation.semaphore.maxConcurrentRequests | Maximum concurrency of fallbacks | 10 | ||
Fallback | fallback.enabled | Whether fallBack is available | true | ||
Circuit Breaker | circuitBreaker.enabled | Whether to turn on the fuse | true | ||
Circuit Breaker | circuitBreaker.requestVolumeThreshold | The minimum number of fuse triggers in a statistical window /10s | 20 | ||
Circuit Breaker | circuitBreaker.sleepWindowInMilliseconds | Fuse how many seconds before you try the request | 5000ms | ||
Circuit Breaker | circuitBreaker.errorThresholdPercentage | What percentage of failure rate is reached after fusing | 50 | The main adjustment is based on dependency importance | |
Circuit Breaker | circuitBreaker.forceOpen | Whether to forcibly enable the circuit breaker | |||
Circuit Breaker | circuitBreaker.forceClosed | Whether to forcibly turn off the fuse | If strongly dependent, it should be set to true | ||
Metrics (Set the statistics needed for HystrixCommand execution) | metrics.rollingStats.timeInMilliseconds | Sets the length of the statistics scroll window in milliseconds. Used for monitoring and fuses. | 10000 | The scroll window is split into buckets and scrolls. For example, this property is set to 10s(10000) and 1s for a bucket. | |
Metrics | The metrics. RollingStats. NumBuckets sets the amount of barrels statistics window | 10 | The metrics. RollingStats. TimeInMilliseconds must be able to be divided exactly by this value | ||
Metrics | metrics.rollingPercentile.enabled | Set whether the execution time is tracked and calculate the time of each percentage, 50%,90%, etc | true | ||
Metrics | metrics.rollingPercentile.timeInMilliseconds | Set the execution time to be retained in the scroll window to calculate the percentage | 60000ms | ||
Metrics | metrics.rollingPercentile.numBuckets | Sets the number of buckets in the rollingPercentile window | 6 | The metrics. RollingPercentile. TimeInMilliseconds must be able to be divided exactly by this value | |
Metrics | metrics.rollingPercentile.bucketSize | This property sets the maximum execution time saved for each bucket. | 100 | If it is set to 100, but there are 500 requests, only the most recent 100 will be counted | |
Metrics | metrics.healthSnapshot.intervalInMilliseconds | Sampling interval | 500 | ||
RequestContext (Set HystrixRequestContext properties used by HystrixCommand) | requestCache.enabled | Set whether requests are cached within request-scope | true | ||
Request Context | requestLog.enabled | Sets whether HystrixCommand execution and events are printed to HystrixRequestLog | |||
ThreadPool Properties(configure the ThreadPool Properties used by HystrixCommand) | coreSize | Set the core size of the thread pool, which is the maximum number of concurrent executions. | 10 | Setting criteria: CoreSize = requests per second at peak when healthy × 99th percentile latency in seconds + Some Breathing room The default of 10 threads is recommended in most cases | |
ThreadPool Properties | maxQueueSize | Maximum queue length. Set the maximum length of BlockingQueue | – 1 | Default: -1 If a positive number is used, the queue is changed from SynchronousQueue to LinkedBlockingQueue | |
ThreadPool Properties | queueSizeRejectionThreshold | Sets the threshold at which requests are rejected | 5 | The reason this property does not work when maxQueueSize = -1 is that the maxQueueSize value cannot be changed at runtime. We can change this variable dynamically to change the length of the queue allowed | |
ThreadPool Properties | keepAliveTimeMinutes | Set the keep-live time | 1 minute | This is usually not used because by default corePoolSize is the same as maxPoolSize. |
5. Performance test
5.1 Test Results
After removing the first outlier in the Cold state, the average Hystrix time of 1-10 test scenarios is shown in the figure above, which can be concluded:
- The extra time for a single HystrixCommand is basically stable at around 0.3ms, regardless of the size of the thread pool or the number of clients
- The extra hystrix time is related to the number of HystrixCommands executed. As the number of Hystrixcommands increases, the extra hystrix time increases, but the increment is small
- When the App was just started, the first request took 300+ms, and the subsequent request took less than 1ms. In the first short period of time, the consumption is slightly larger than that in the Hot state, and the total consumption is less than 1ms
Personal Introduction:
Gao Guangchao: With years of front-line Internet research and development and architecture design experience, he is good at designing and implementing high-availability and high-performance Internet architecture. Now I am working for Meituan.com, responsible for core business research and development.
This article was first published onGao Guangchao’s Jane Book blogReprint please specify!