Article | Sanzang (Tang Zhijian)

When we receive the online service alarm, how to deal with it correctly? How do I locate the RootCase when I encounter unknown performance problems? Online problem diagnosis is always difficult, but a proven methodology and tool chain can help us locate problems quickly. Here according to our internal practice and we do a share.

1. Alarm and investigation

Alarm is an objective fact, even false positives also indicate some unexpected cases. We cannot exhaust all the problems, but we can establish standard procedures and SOP manuals to split the granularity of problems, simplify the difficulty, and locate problems more easily.

1.1 process

  • Don’t panic, chest of thunder and face like a flat lake, can worship the general! Don’t panic, panic in the face of tension and pressure will lead to mistakes, instead of the correct clues.
  • Then synchronize in the group that you have started to step in and let people trust you with their backs.
  • 80% of the incidents were due to same-day changes.
  • For accidents, stopping losses is the first priority, even if the scene is damaged, we can always rely on the clues afterwards. Cannot downgrade to restart, restart invalid rollback.
  • Establish different Sops from resource utilization to latency. When we run into a problem, we just need to look at the picture, which is enough to solve 90% of the problem. For those who cannot solve the problem, call teammates and leaders for support, voice is the most effective communication.

1.2 Establish SOP manual

To establish a good SOP and cultivate the fighting power of the organization, it is not just a person’s fight alone. Fortunately, we have established a perfect SOP, so that even a newcomer can quickly enter the battlefield and participate in the online investigation. Our internal documentation is as follows:

  • Service invocation exception check SOP
  • Response delay growth troubleshooting SOP
  • Fusing troubleshooting SOP
  • Mysql response to RT increase troubleshooting SOP
  • Redis response to RT elevation screening SOP
  • ES responds RT and the error rate increases SOP
  • Goroutine abnormal elevation screening SOP
  • Instance CPU or memory troubleshooting SOP
  • Check SOP for flow increase
  • Troubleshoot common service problems

Limited to internal sensitivity, here is a summary based on the above information, the processing manual should do the following:

  • Include the links of various tools, the Owner of the service, the person in charge of the infrastructure, and do not search, do not ask.
  • Grafana and other tools make dashboards. A good Dashboard can intuitively locate problematic apis, and you can see P99, 95, 90, delay, QPS, traffic and other monitoring. The error rate is the key monitoring value, the delay is just a symptom, the error can help us get closer to the truth.

  • Check whether the corresponding service exists. Is there a resource bottleneck (CPU full)? Has the Goroutine soared? The whole family blew up?

In this case, the resource connection is not released. Stop the bleeding if it doesn’t work.

  • Infrastructure (Redis, Mysql, etc.) delay, view the current number of connections, slow requests, hardware resources, etc. USE (utilization, saturation, and Erros) is a good entry point. For all resources, check their usage, saturation, and errors.
  • Some interfaces are slow. Direct traffic limiting prevents avalanches. Grab a request and look at the tracing to see how long the link takes.

A qualified handling manual handed to the novice can also locate the problem spot and solve part of the problem.

2. Check performance

Bugs in business logic can eventually be located to the Root Case through logging and monitoring. But it’s the elusive performance issues that make us bald. Here’s how we can troubleshoot performance issues:

2.1 toolkit

2.1.1 pprof

This is one of Go’s most common and useful performance checkup tools, so if you haven’t used it yet, it’s highly recommended to read the official tutorial and this article.

During the time window, the CPU Profiler registers the program with a timed hook (in various ways, such as the SIGPROF signal) in which we fetch the stack trace of the business thread each time. Then the hook execution frequency is controlled at a specific value, which in Golang is 100Hz (adjustable), that is, a call stack sample of business code is collected for every 100 CPU instructions. When the time window ends, all samples collected are aggregated to obtain the number of times each function is collected, and the relative proportion of each function is obtained compared with the total number of samples.

Heap Profiling can be used to locate and troubleshoot memory leaks

Here is a brief introduction:

  • Pprof supports two patterns that are no different from each other, and the second can be Profiling at any time and is more useful.
  • Runtime /pprof For background services, embedded in the service, the program will automatically complete the sampling after the end
  • Net/HTTP /pprof provides the HTTP Hanlder interface to generate Profiling
  • CPU and memory Profiling is supported.
  • Benchmark also supports Profiling go test-bench. -cpuprofile=cpu.prof

This is a profile result for a CPU (example from the Go pprof you didn’t know). What are we looking at with all this data?

It is recommended to look first at cum: current function and the overhead of its calling function and then at flat: current function overhead. The reason to look at CUM first is that flat high may be called many times, most of which are system functions. While CUM we can see a whole, often our problem codes can be seen here, of course, this is not an absolute.

When we find the exception function, we can use the list to expand the function to find the critical time.

The Web command opens up a background where we can switch to Flame Graph, our favorite Flame Graph, to visually see the call stack and overhead. This is one of my favorite interview pits: Is the darker the color the bigger the problem?

Pprof can analyze performance problems in most applications.

2.1.2 trace

When runtime bottlenecks occur, such as goroutine scheduling delays and GC STW is too large, we can trace the runtime details. curl host/debug/pprof/trace? In seconds=10 > trace.out we will generate the data within 10 seconds, and then we will go through the Go tool trace.

We can use View Trace to see how our program is running in the meantime. Let’s go to the following screen at random. We can use WSAD when scaling, where we can see gc time, STW effect, function call stack, goroutine scheduling.

For example, it took 4.368 milliseconds from the time the Goroutine was recalled to the time it received data from a pingCap share.

2.1.3 Goroutine visualization

In addition, we can render the goroutine runtime relationship through Divan/GoTrace, which is very interesting to visualize.

2.1.4 perf

There are times when pprof may fail, such as when the application hang dies. Such as scheduling full (preemptive scheduling solves this problem). For example, perf Top allows you to see the symbols that take the most time (Go is embedded in the symbol table when compiled without manual injection).

2.1.5 Swiss Army Knife

Big man Brendan Gregg has drawn up a performance guide called the Swiss Army Knife. When we suspect the OS problem, we can use the corresponding tools according to the picture, of course, the most effective is to call the operation and maintenance leaders to support.

2.2 How to Optimize

Performance problems tend to come from multiple sources and may be recent even if you haven’t made any changes; It can happen occasionally; There could be machines. We should benchmark, any optimization should have a baseline comparison, the numbers are the most intuitive. The processing logic of the application layer and the underlying layer is often completely different and should be thought of separately.

2.2.1 application layer

Application layer optimization should be the first thing to think about and the most important thing to focus on. Many performance problems are often the result of poorly designed business logic. After that, try some general optimizations including:

  • Sync. pool is introduced to pool resources
  • Lock convergence, control the range of use
  • JSON library replacement, memory allocation is always a performance killer

Fasthttp Best practices are worth learning. Performance is not a way to improve, to all aspects of the details.

2.2.2 system layer

If that still doesn’t solve the problem, congratulations on the most interesting problem you’ve encountered and don’t rush to fix it. Try to upgrade Golang to the latest version and it will most likely be resolved. The vast number of optimizations and fixes for each version is to solve the problems that you and I have encountered, and you will find that there are many similar issues that you encounter. You will learn a lot about optimizing MR, such as optimizing TLS. Upgrade version is a miracle, upgrade hardware is a miracle (infrastructure students please ignore this article).

There is no silver bullet for system layer optimization, only trade-off. Closing Swap and NUAM often depends on the scene. For details, see the Redhat Tuning Guide.

Evolution of 3.

3.1 Continuous Profiling

Profiling is done when we encounter performance problems. But in many cases we cannot preserve the scene of the accident, or the cause and the appearance of the two temporal dimensions, which presents us with greater challenges. Industry leaders have proposed Continuous Profiling. Same thing as CI/CD. Continuing to do sampling, Google again and again to the front research. Google/pubs/pub365…

Through cron, pPROF can be done periodically, data can be archived, and analysis can be done at any time through the Web interface, and even DIff can be done on PPOF of different time periods to find potential problems. Conprof is a great open source alternative.

3.2 eBPF + GO

EBPF is the latest hot dynamic debugging technology, non – intrusive, non – buried to debug. Berkeley Packet Filter (BPF), tcpdump, and Wireshark are all based on BPF, which is originally a network Packet capture tool. Extensions can capture kernel packages. The kernel can trace the system by exposing probes. With eBPF we can see what happens at the system level when PPROF fails. Okay

For example, tools can be used to track the time and latency of a function or to track the call stack

package main import "fmt" func main() { fmt.Println("Hello, BPF!" ) } # funclatency 'go:fmt.Println' Tracing 1 functions for "go:fmt.Println"... Hit Ctrl-C to end. ^C Function = fmt.Println [3041] nsecs : count distribution 0 -> 1 : 0 | | 2 -> 3 : 0 | | 4 -> 7 : 0 | | 8 - > 15:0 | | 16 - > 31:0 | | 32 - > 63:0 | | 64 - > 127:0 | | 128 - > 255:0 | | 256 - > 511: 0 | | 512 - > 1023:0 | | 1024 - > 2047:0 | | 2048 - > 4095:0 | | 4096 - > 8191:0 | | 8192 - > 16383: 27 |****************************************| 16384 -> 32767 : 3 |**** | Detaching... The sample from https://www.brendangregg.com/blog/2017-01-31/golang-bcc-bpf-function-tracing.htmlCopy the code

4. To summarize

We’re going to have a lot of strange questions and we’re going to have different challenges. You can’t control the outbreak of a problem, but you can use tools like Golangci-Lint and CodeReview to avoid potential problems and reduce the frequency and scope of accidents. Most problems are either too simple to think of or too complex to find. Keeping code simple and following the KISS principle is a timeless idea.

Fixing the problem isn’t just the end, it’s the beginning. For every serious accident, there are 29 minor accidents, 300 near-misses and 1,000 potential accidents. Doing a good review and improvement is the accident gives us the greatest value.

Finally, if you’re really interested in performance tuning, you shouldn’t miss out on this top performance book.

reference

  • pprof – The Go Programming Language
  • Visualizing in Go · Divan’s blog
  • QQ music go pprof actual combat
  • Heap Profiling can be used to locate and troubleshoot memory leaks
  • Graphite documentation Websocket million long connection technology practices
  • Diagnose and locate large system online problems
  • Performance optimization for Go applications
  • Go runtime related problems in TiDB production environment
  • www.brendangregg.com/blog/2017-0…