The author of this article: protect our Li Yuanfang commercial reprint please contact the author for authorization, non-commercial reprint please indicate the source.

The fault phenomenon

An online buried point reporting machine occasionally triggers an alarm about excessive memory usage. SSH to top and found that the main memory was occupied by the buried service. It has been restarted a few times before, but after a while there are still alarms that the memory is too high. Here are the alarm details.

[P1][PROBLEM][ali-e-xxX-service03][][][all(#3) mem.memfree. Percent 4.19575<5][O3 >2019-10-28 10:20:00]

Inference problem

Buried service mainly receives compressed requests from clients, decompresses the request data, and delivers it to Kafka. The logical function is relatively simple. Initial suspect is memory leak or Groutine leak caused by some resources not released.

Troubleshoot problems

Since the code is not maintained by our business side, we first asked relevant departments for the code permission. After reading the source code, all resource releases were handled by Defer, and there were no obvious unreleased resources.

1. Modify the online environment configuration and enable the Trace port

Before debugging and analyzing problems, students familiar with GO know that Golang provides a very convenient performance diagnosis tool, Go Tool Trace. Go Tool Trace is a performance killer in Golang, revealing all run-time events, memory footprint, etc. It is one of the most useful tools in the Go ecosystem for diagnosing performance problems such as latency, parallelization, and race exceptions.

Because they are all based on the company’s basic library, the basic library specially encapsulates Go Trace. Simply modify the config file online and send the trace information to the specified port in the configuration file to use the Go Tool for analysis.

Then I used the Go tool locally and found that the network was disconnected. I checked and found that the trace port was not bound to 0.0.0.0 [dog head]. Then use proxy tool to reverse proxy port 9900 to port 9999, and then use go Tool to make a memory flame map of normal state memory usage. The diagram below the 2020-01-18-2020-01-18. JPG

2. Wait for the fault to occur

The waiting is a long process, which can take days or months if you’re unlucky. Save tens of thousands of words x#@#@$@! %? @ ^! .

3. What’s supposed to come will come

A few days later I received another server alert, this time earlier than the previous ones due to the increase in QPS. Made a top to the memory immediately after receiving the alarm

go tool pprof -alloc_space http://{addr}/debug/pprof/heap
Copy the code
Showing top 20 nodes out of 55 flat flat% sum% cum% 2.76GB 59.76% 59.76% 2.76GB 59.76% compress/ flate.newreader 0.45 GB 9.73% 69.49% 9.73%.net 0.45 GB/HTTP newBufioWriterSize 0.29 3.05 GB GB 6.33% 75.82% to 66.09% Net /http.newBufioReader 0.13GB 2.85% 84.01% 0.13GB 2.85% Runtime.rawstringtmp 0.11GB 2.31% 86.32% 0.11GB 2.31% bytes.makeslice 0.10GB 2.26% 88.58% 0.10GB 2.26% Runtime.hashgrowCopy the code

Each line represents information about a function. The first two columns show memory usage and percentage; The third column is the ratio of Memory usage for all current functions. The fourth and fifth columns represent the Memory and ratio (also known as the cumulative value) of the function and its children, which should be greater than or equal to the values of the first two columns. The last column is the name of the function. If the application is having performance problems, this information should tell us which functions memory is being spent executing, and pprof’s CPU time analysis is similar.

Pprof can not only print the most memory consuming place (top), but also list the function code and corresponding sample data (list), assembly code and corresponding sample data (DISASM), and can output in various styles, such as SVG, GV, CallGrind, PNG, GIF, etc.

You can see that most of the memory is occupied by these readers.

4. Reread the code with questions

Earlier we did a preliminary analysis of occupancy, NewReader. Package Flate implements the DEFLATE compressed data format described in RFC 1951. Gzip implements access to DEFLATE based file formats. So with that in mind, we once again locate the relevant source implementation, and here are some key definitions:

var Gzip GzipPool
func GetReader(src io.Reader) (*gzip.Reader, error) {
	return Gzip.GetReader(src)
}

func PutReader(reader *gzip.Reader) {
	Gzip.PutReader(reader)
}

// GzipPool manages a pool of gzip.Writer.
// The pool uses sync.Pool internally.
type GzipPool struct {
	readers sync.Pool
	writers sync.Pool
}
Copy the code

Pool is used to optimize GC. In order to verify that all memory is in the Pool, we use the Go Tool web tool to check the Pool memory usage. Sure enough, most memory usage is in the Pool. Long time ago no picture this time…

5. What is sync.pool?

The Sync package provides basic Golang concurrent programming tools. According to the official documentation:

Pool’s purpose is to cache allocated but unused items for later reuse, relieving pressure on the garbage collector. That is, it makes it easy to build efficient, thread-safe free lists. However, it is not suitable for all free lists.

Golang is commonly used to build applications in high-concurrency scenarios, but since golang’s built-in GC mechanism can affect application performance, to reduce GC, Golang provides a mechanism for object reuse, namely sync.pool object Pool. Sync. Pool is scalable and concurrency safe. Its size is limited only by the size of memory and can be thought of as a container for the values of reusable objects. The purpose of the design is to store allocated but not used objects, when needed directly from the pool. Seeing this, I believe many students who are familiar with GC have guessed the answer: maybe it is related to GC.

So, what is Golang’s GC trigger timing?

The GC implementation of Golang GC version 1.13 is the tricolor tag dispatch co-write barrier and auxiliary GC. There are two main trigger conditions:

  1. The memory size exceeds the threshold. Procedure
  2. Reach time

The threshold is controlled by a variable called gcpercent, which is triggered when the percentage of newly allocated memory in use exceeds gCprecent.

For example, if the memory usage is 5M after a reclamation, the next reclamation is when the memory allocation reaches 10M. In other words, the more memory allocated, the more garbage collected. What if the memory size threshold is never reached? At this time, GC will be triggered at a fixed time. For example, if the value is less than 10M, GC will be triggered at a fixed time (once every 2 minutes by default) to ensure resource recovery.

Therefore, as the memory usage gradually increases, the number of GC triggers decreases and approaches 2min. Without GC, objects in the pool will not be reclaimed, leading to a gradual increase in the memory usage.

Actively trigger GC for validation

Modify the code, limit rlimit, and use a Goroutine every 30 seconds to actively call GC, and then test it after it goes live. The interface time of observation line has not changed significantly, the system runs normally, and the memory occupation alarm has not been triggered any more.

The problem summary

At the beginning of the occasionally triggered memory alarm did not care too much, there are many lucky psychology, but the problem is always objective, timely discovery of problems, regular summary to continue to progress and growth, try to avoid a problem on the restart, prevent becoming SRB(Service ReBoot Boy😁).


Reference documentation

[1] Golang Memory Analysis/Dynamic Tracking [2] Golang GC