Problems arise

Appear alarm!!

Some instances of the bytedance. Xiaoming service have too much memory, up to 80%. The service has not been released in a new version for a long time, so you can eliminate the problems introduced by the new code launch.

When problems are found, instances are migrated. Except for one instance reserved for troubleshooting, other instances are migrated. After the migration, the memory of the new instance is low. However, as time goes by, the memory of migrated instances also increases slowly, showing memory leakage.

Problem orientation

Theory one: Suspected Goroutine escape

The screening process

Usually the main cause of memory leakage is too much Goroutine, so I first suspected whether there was a problem with Goroutine. I went to see goroutine and found that it was normal, the total amount was low and did not continue to increase. (I forgot the screenshot at that time, and later I made up a picture, but the number of Goroutine has not changed all the time)

The results of the February

There is no Goroutine escape problem.

Hypothesis 2: A memory leak is suspected

The screening process

Pprof was used for real-time memory collection to compare the memory usage of problem instances and normal instances:

Examples of problems:

Normal example:

Take a closer look at graph of problem instances:

Metircs.flushclients () uses the most memory in the database.



func (c *tagCache) Set(key []byte, tt *cachedTags) {
        if atomic.AddUint64(&c.setn, 1)&0x3fff == 0 {
                // every 0x3fff times call, we clear the map for memory leak issue
                // there is no reason to have so many tags
                // FIXME: sync.Map don't have Len method and `setn` may not equal to the len in concurrency env
                samples := make([]interface{}, 0, 3)
                c.m.Range(func(key interface{}, value interface{}) bool {
                        c.m.Delete(key)
                        if len(samples) < cap(samples) {
                                samples = append(samples, key)
                        }
                        return true
                }) // clear map
                logfunc("[ERROR] gopkg/metrics: too many tags. samples: %v", samples)
        }
        c.m.Store(string(key), tt)

}
Copy the code

In order to avoid memory leaks, sync.map has been counted to clean up the key stored in sync. In theory there shouldn’t be a problem.

The results of the February

There are no code bugs causing memory leaks.

Hypothesis 3: Suspect RSS is the problem

The screening process

One thing I noticed in pprof was that metrics only used 72MB in total and the total heap memory was only 170+MB. In our example, we had a 2GB memory configuration, and 80% of the memory was about 1.6GB of RSS usage. These two big discrepancies (a temporary fix for this problem is described below) should not cause an 80% memory footprint alarm. So the guess is that the memory did not recycle in time.

After a search, we found this amazing thing:

Linux uses MADV_DONTNEED when the Go runtime frees memory and returns it to the kernel. This is inefficient but causes the resident set size (RSS) to decrease rapidly. However, go 1.12 has been optimized for this, and runtime uses the more efficient MADV_FREE instead of MADV_DONTNEED to free memory. A detailed introduction can be found here:

Go-review.googlesource.com/c/go/+/1353…

Update on Go1.12:

The Go 1.12 to 1.15 Runtime optimizes the GC policy and, when supported by Linux kernel versions (> 4.5), defaults to a more “aggressive” policy for more efficient memory reuse and lower latency. The downside is that RSS does not drop immediately, but is delayed until memory becomes stressed.

Our Go version is 1.15 and kernel version is 4.14, which is just right!

The results of the February

If the GO compiler version + system kernel version matches the Go Runtime GC policy, RSS will not drop after the heap is recycled.

Problem solving

The solution

There are two solutions:

  1. One is specified in an environment variableGODEBUG=madvdontneed=1

This method forces the Runtime to continue using MADV_DONTNEED. (see: github.com/golang/go/i…) . However, starting Madvise Dontneed triggers the TLB shootdown, and more Page faults. Delay-sensitive businesses are likely to be more affected. So this environment variable needs to be used with care!

  1. Upgrade the GO compiler version to 1.16 or later

See the update to go 1.16. This GC strategy has been abandoned in favor of releasing memory in a timely manner rather than lazily releasing memory when it becomes stressed. It seems that the go website also feels that the just-in-time free approach is preferable, and in most cases more appropriate.

Note: This can be done manually by calling debug.freeosMemory, but at a cost.

Also FreeOSMemory does not work in go1.13 (github.com/golang/go/i…) , caution is recommended.

Implementation of the results

We chose plan two. After the go1.16 upgrade, the instance did not experience a sustained rapid memory growth.

Looking at the example again with pprof, we found that the memory usage of the function also changed. Metrics. Glob has been reduced. The solution seems to be paying off.

Other pits encountered

Another possible memory leak problem was discovered during the troubleshooting process (the service failed to hit). If the mesh is not enabled, the service discovery component of KITC has the risk of memory leak.

As can be seen from the figure, cache.(*Asynccache). Refresher occupies a large amount of memory, and its memory usage will continue to increase as the amount of service processing increases.

It is natural to think that when creating a new Kiteclient, there may be repeated client builds. A code sweep was conducted and no duplicate builds were found. But to see kitC source code, it can be found that when service discovery, KITC has established a cache pool asyncache to store instance. The cache pool is flushed every 3 seconds, and fetch is called when it is refreshed, and FETCH makes service discovery. When the service is discovered, new instances will be created continuously according to the host, port and tags of the instance (which will change according to the environment env), and then the instance will be stored in the cache pool Asyncache. These instances are not cleaned up, so there is no memory release. So this is the cause of memory leaks.

The solution

The project was very early, so the framework used was old, which could be resolved by upgrading the latest framework.

Thinking summary

Let’s first define what a memory leak is:

A Memory Leak refers to a program that fails to release or release dynamically allocated heap Memory for some reason, resulting in a waste of system Memory, slowing down the program and even crashing the system.

Common scenarios

In the GO scenario, common memory leakage problems are as follows:

1. Goroutine leaks memory

(1) Too many goroutine applications

Problem Overview:

Too many goroutine applications grow faster than they are released, resulting in more and more Goroutines.

Scenario Example:

A client is created for each request. When the number of service requests is large, too many clients are created and cannot be released.

(2) Goroutine blocking

1) I/O problems

Problem Overview:

No timeout was set for the I/O connection, causing goroutine to wait.

Scenario Example:

If the timeout period is not set, the code will always block.

② The mutex is not released

Problem Overview:

The Goroutine cannot obtain the lock resource, causing the Goroutine to block.

Scenario Example:

Assuming there is a shared variable, goroutineA locks the shared variable but does not release it, causing other goroutineB, goroutineC,… Neither goroutineN nor goroutineN can obtain the lock resource, causing other Goroutines to block.

③ Improper use of WaitGroup

Problem Overview:

The waitGroup has a mismatch in the number of Adds, Done, and Waits, causing the wait to wait forever.

Scenario Example:

WaitGroup can be understood as a Goroutine manager. He needed to know how many goroutines were working for him, and he needed to tell him when he was finished, otherwise he would wait until all the goroutines were finished. After we add WaitGroup, the program waits until it receives a sufficient number of Done() signals. Suppose waitGroup Add(2), Done(1), then there is only one task left, so the WaitGroup waits. See the WaitGroup section of the Goroutine Exit mechanism for details.

2. Select the blocked

Problem Overview:

Select was used but the case was not fully covered, causing no case to be ready and eventually goroutine to block.

Scenario Example:

A block occurs when the select case does not cover all cases and there is no default. Example code is as follows:

func main() {
    ch1 := make(chan int)
    ch2 := make(chan int)
    ch3 := make(chan int)
    go Getdata("https://www.baidu.com",ch1)
    go Getdata("https://www.baidu.com",ch2)
    go Getdata("https://www.baidu.com",ch3)
    select{
        case v:=<- ch1:
            fmt.Println(v)
        case v:=<- ch2:
            fmt.Println(v)
    }
}
Copy the code

3. The channel blocking

Problem Overview:

  • Write block
    • Unbuffered channel blocking is usually the write operation blocking because there is no read
    • Write operations on buffered channels are blocked because the buffer is full
  • Read the block
    • Expecting to read data from a channel, no Goroutine writes to it

Scenario Example:

All three causes of code bugs can cause channel blocking. Here are a few examples of real channel blocking in production:

  • Lark_cipher library machine faults summary
  • Cipher Goroutine leakage analysis

4. The timer is incorrectly used

(1) Improper use of time.after()

Problem Overview:

The default time.after () is leaky because NewTimer() is generated every time.after (duratiuon x), and the newly created timer is not GC until duration X expires, but only After that.

Over time, especially if Duration X is large, memory leaks can occur.

Scenario Example:

func main() {
        ch := make(chan string, 100)
        go func() {
                for {
                        ch <- "continue"
                }
        }()
        for {
                select {
                case <-ch:
                case <-time.After(time.Minute * 3):
                }
        }
}
Copy the code

(2) Time. ticker is not stopped

Problem Overview:

Using time.ticker requires manually calling the stop method, otherwise permanent memory leaks will occur.

Scenario Example:

func main(){ ticker := time.NewTicker(5 * time.Second) go func(ticker *time.Ticker) { for range ticker.C { fmt.Println("Ticker1...." ) } fmt.Println("Ticker1 Stop") }(ticker) time.Sleep(20* time.Second) //ticker.Stop() }Copy the code

Suggestion: It is always recommended to initialize a timer outside of for and manually stop the timer when for ends.

5. Slice causes memory leaks

Problem Overview:

  1. Two slice shared addresses, one of which is a global variable and the other cannot be gc;
  2. Append Slice has been used but not cleaned.

Scenario Example:

  1. Go to the code directly, with this method, b array is not gc.
var a []int

func test(b []int) {
        a = b[:3]
        return
}
Copy the code
  1. Kitc’s service discovery code mentioned in other potholes encountered is an example of this problem.

Summary of investigation ideas

If you encounter Golang memory leakage problems in the future, you can follow the following steps to solve the problem:

  1. Observe server instances to check memory usage and identify memory leaks.
  • You can directly click on “Instance List” on the TCE platform.
  • It can also be viewed in the “runtime Monitoring” on the MS platform.
  1. Judge the Goroutine problem;
  • You can use the monitoring mentioned in 1 to observe the goroutine population, or you can use pprof to sample and determine if there is an abnormal increase in the goroutine population.
  1. Determine code problems;
  • Pprof can be used to locate the number of code lines by function name, graph, source and other means of Pprof.
  • Check whether the whole call chain has the problems in the above scenarios, such as SELECT blocking, channel blocking, improper use of slice, etc., give priority to its own code logic problems, and then consider whether the framework is unreasonable.
  1. Solve the corresponding problems and observe in the test environment, and observe online after passing the test;

Recommended troubleshooting tools

  • Pprof: Go language analysis program performance tool, it can provide a variety of performance data including CPU, heap, Goroutine, etc., through report generation, Web visualization interface, interactive terminal three ways to use Pprof
  • Nemo: Pprof based encapsulation, sampling a single process
  • ByteDog: Provides more metrics on top of pprof, sampling the entire container/physical machine
  • Lidar: Classification presentation of sampling results based on ByteDog (currently the platform’s preferred tool, compared to Nemo)
  • Smart OnCall gadget: A troubleshooting gadget developed by the Kite guru that is very easy to use. Just type podName into a swarm of bots

Join us

Feishu is an advanced enterprise collaboration and management platform of ByteDance, helping to upgrade the organization in an all-round way from the three dimensions of goals, information and people. One-stop integration of instant communication, calendar, audio and video conferencing, documents, cloud disk and other office collaboration suite, so that organizations and individuals work more efficient and more pleasant. At present, Feishu has served advanced enterprises in many fields including Internet, information technology, manufacturing, construction and real estate, education, media and so on. We’re the Lark Core Services team at Flybook, responsible for the Core IM capabilities of Flybook, including messaging, groups, user profiles, open capabilities, and more. Looking forward to your joining us

Recruitment link:

job.toutiao.com/s/Ne1ovPK

School Recruitment Link (Summer Internship)

jobs.toutiao.com/s/NJ3oxsp