When doing a long time limit pressure test for an item, it is often the case that the pressure will not go wrong for a few hours and then suddenly collapse.
Check the monitor and see that the goroutine suddenly rises when it crashes. Then you can use the problem diagnosis tool I developed earlier to configure the following strategy: If the Goroutine suddenly starts to rise, dump the goroutine text. For those unfamiliar with this diagnostic tool, check out my previous posts on unattended automatic dump-2 and unattended automatic dump.
When the code is integrated and pressed again, we find that there are always many goroutines stuck in the lock:
10760 @ 0x42f81f 0x4401d9 0x4401af 0x43ff4d 0x474df9 0x95709a 0x952d0d 0x17466fc 0x17462c4 0x174c8a8 0x174c88a 0x174e308 0x1755b78 0x1749432 0x17588e8 0xf90a54 0xc9670d 0x45d061 # 0x43ff4c sync.runtime_SemacquireMutex+0x3c / Users/xargin/SDK/go1.12.17 / SRC/runtime/sema. Go: 71 # 0 x474df8 sync. (* Mutex). Lock + 0 x108 / Users/xargin/SDK/go1.12.17 / SRC/sync/mutex. Go: 134 # 0 x957099 github.com/rcrowley/go-metrics. UniformSample (*). The Update + 0 x39 /Users/xargin/mosn/vendor/github.com/rcrowley/go-metrics/sample.go:508 # 0x952d0c github.com/rcrowley/go-metrics.(*StandardHistogram).Update+0x3c /Users/xargin/mosn/vendor/github.com/rcrowley/go-metrics/histogram.go:199 ..... Here is a bunch of business code...... 1 @ 0x42f81f 0x4401d9 0x4401af 0x43ff4d 0x474df9 0x504a3d 0x5040ba 0x5040ac 0x957187 0x957172 0x952d0d 0x17465d0 0x17462c4 0x174c8a8 0x174c88a 0x174e308 0x1755b78 0x1749432 0x17588e8 0xf90a54 0xc9670d 0x45d061 # 0x43ff4c Sync. Runtime_SemacquireMutex + 0 x3c/Users/xargin/SDK/go1.12.17 / SRC/runtime/sema. Go: 71 # 0 x474df8 sync. (* Mutex). Lock + 0 x108 / Users/xargin/SDK/go1.12.17 / SRC/sync/mutex. Go: 134 # 0 x504a3c math/rand. (* lockedSource). Int63 + 0 x2c / Users/xargin/SDK/go1.12.17 / SRC/math/rand/rand. Go: 380 # 0 x5040b9 math/rand. (* rand). Int63 + 0 x69 / Users/xargin/SDK/go1.12.17 / SRC/math/rand/rand. Go: 85 # 0 x5040ab math/rand. (* rand). Int63n + 0 x5b / Users/xargin/SDK/go1.12.17 / SRC/math/rand/rand. Go: 117 # 0 x957186 math/rand Int63n + 0 x126 / Users/xargin/SDK/go1.12.17 / SRC/math/rand/rand. Go: 319 # 0 x957171 github.com/rcrowley/go-metrics.(*UniformSample).Update+0x111 /Users/xargin/mosn/vendor/github.com/rcrowley/go-metrics/sample.go:514 # 0x952d0c github.com/rcrowley/go-metrics.(*StandardHistogram).Update+0x3c /Users/xargin/mosn/vendor/github.com/rcrowley/go-metrics/histogram.go:199 .... Below this is a pile of business code.....Copy the code
If you look at the reasons, it is not difficult to understand that the CPU utilization is 90%+ in the limit pressure test, when the lock conflict will cause the entire service to crash directly because of the lock conflict (Goroutine increases OOM, etc., or delays greatly, is not available).
But the question is why the crash occurred after a period of time, rather than at the beginning. The pressure on the service from the platform has not changed much during this period, and there is no TCP Retrans or anything like that.
Recently, in a group of friends, a former colleague pulled up an article I wrote earlier: a short story about Rlock, which was rather rough at that time, but was later attacked by netizens. Another short story about Rlock, and some netizens pointed out some problems in the comments section, I posted the improved demo here:
1 package main 2 3 import ( 4 "fmt" 5 "os" 6 "runtime/trace" 7 "sync" 8 "sync/atomic" 9 "time" 10 ) 11 12 var mlock sync.RWMutex 13 var wg sync.WaitGroup 14 15 func main() { 16 trace.Start(os.Stderr) 17 defer trace.Stop() 18 wg.Add(100) 19 20 for i := 0; i < 100; i++ { 21 go gets() 22 } 23 24 wg.Wait() 25 if a > 0 { 26 fmt.Println("here", a) 27 } 28 } 29 30 func gets() { 31 for i := 0; i < 100000; i++ { 32 get(i) 33 } 34 wg.Done() 35 } 36 37 var a int64 38 39 func get(i int) { 40 beginTime := time.Now() 41 Mlock. RLock() 42 tmp1 := time.since (beginTime).nanoseconds () / 1000000 43 if tmp1 > 100 {// exceed 100ms 44 atomic.AddInt64(&a, 1) 45 } 46 mlock.RUnlock() 47 }Copy the code
When the console has output, trace it to see that the time.Since call was scheduled by cooperative preemption after RLock was successfully fetched. The reason for preemption here is function stack expansion, not syscall.
When CPU usage is high, there is a certain probability that the goroutine that obtains RLock will be preempted, which will be destructive to our service.
Authors of parallel programs have known for decades that performance can suffer badly if a thread is preempted while holding a lock; this is sometimes referred to as inopportune preemption.
In other words, if the lock is preempted during the lock holding period, the other Goroutine must wait for the goroutine to be re-scheduled from the global queue and to complete its critical section.
A write Lock would be interesting. And because the scheduling itself has a certain degree of randomness, so if it is really encountered, it is really difficult to locate.