At least one latency problem can be solved
2017-11-19
We have a delay-sensitive module that needs to access another machine in the network to fetch a timestamp. To implement a distributed transaction, you need to perform this operation twice, and if the timestamp here is slow, the latency of the entire transaction will increase. Theoretically, the round trip between the Intranet environment and the machine room should be less than 0.5ms, most simple read requests should be less than 1ms, and the delay of 80% requests is expected to be less than 4ms. A customer took the opportunity to check the problem but said that there was a delay of more than 30ms but tested it with sysbench running OLTP in the Intranet environment and it was able to reproduce.
Observed in OpenTracing, this step does have a large delay, and a large number of slow logs are printed in the logs, which affects the overall transaction completion time. The first direction is to determine where the slowness is, whether there is a network problem or a runtime problem.
A colleague observed that when running the benchmark of this module, he opened 1000 additional Goroutine workers and tick idle every second. Compared with running the Benchmark only, he found that the delay of the former was much higher than that of the latter. Suspected runtime problems.
The other one runs the test separately and observes the network. It is found that the network retransmission has a significant effect on the results. Then change to the client server on the same machine to test again, eliminate the network interference, the results found that the process will affect each other; After the server and client are tied to different cores, the server processing time is relatively stable, while the client still has high latency. The bottom line at this point is that neither the Runtime nor the network guarantees stability.
But can the impact of runtime reach the tens of ms level? It doesn’t seem to make sense. In my mind, everything was on the order of microseconds, and even going back to the 1.0 version of Go, Stop the World GC wasn’t that bad. Besides, GC will not stop the world now that it is optimized to 1.9. So I used go Tool Trace to continue analyzing the problem. I don’t know. I was surprised to see it.
In this graph, the red arrow indicates that ublock the Read operation of a goroutine upon receiving a network message. Notice that it took 4.368ms from the time the network message was readable to the time the goroutine that read the network was scheduled again!! I even found some more extreme scenarios, from receiving a network message readable to actually waking up the Goroutine, which took 19ms. For performance reasons, the service implementation is batch. Therefore, requests are forwarded to a Goroutine through a channel, and that Goroutine sends batch requests. Obviously this Goroutine is critical, because all the other Goroutines depend on it. The scheduling delay at the MS level directly affects the overall service delay.
And then when to schedule a goroutine, a goroutine is a coroutine, and if it can execute, it will execute until it blocks and it will give up the CPU. For example, the execution encounters a lock, or reads a channel, or reads an IO request, etc. Once the goroutine is switched out, if the condition is met, it is thrown into the ready queue, waiting to be run again. However, the exact time when it will be executed is uncertain, depending on the length of the queue of queued tasks, the execution time of the tasks ahead of it, and the load at that time and many other factors.
The problem here is not GC, but scheduling. The final delay problem is related to the scheduling design of Go, mainly the fair scheduling strategy of coroutines:
- Can’t preempt
- There is no concept of priority
Since there is no preemption, assume that the network message is good, but all cpus have goroutines running at that moment, and no one can be kicked out, so the goroutine reading the network has no chance to wake up.
Since there is no concept of priority, let’s say a Goroutine finally blocks and gives up the CPU, and who gets to execute depends entirely on the mood of the scheduler. The goroutine reading the network is unlucky and does not wake up.
As long as the Goroutine does not go to the function call, there is no chance of triggering the schedule and freeing the CPU.
Go claims to be able to open thousands of Goroutines, but there is a cost: the more goroutines are “fairly” scheduled, the more likely it is to affect the wake up of one of the important goroutines, thus affecting overall latency.
Looking back at the previous colleague’s test, it can be explained that the idle worker affects the delay. Since the probability of being scheduled is equal, the more unrelated goroutines, the lower the probability of the goroutine doing the work being scheduled, thus increasing the delay.
Go’s garbage collection does not stop the world, but it can still affect latency: GC can interrupt the Goroutine to ask for the CPU, and it depends when the Goroutine is called back.
There are so many factors that affect scheduling that the latency throughout the runtime becomes uncontrollable. You may not be able to see anything when you are under pressure, but you will perform worse when you are under great pressure.