Premature optimization is the root of all evil.

preface

This paper is from a talk of GopherCon 2019. Firstly, a GO program that calculates text word count is compared with WC, and the performance is gradually optimized by profile CPU and memory, involving the concurrency model and escape analysis of GO. Then a program that draws mandelbrot fractal graph is used to discuss the execution trace of concurrent programs and the boundary of performance improvement brought by concurrent programs.

pprof & trace

Go’s built-in ecosystem provides a number of apis and tools for diagnosing logic and performance problems in applications. They can be roughly divided into the following categories:

  • Profiling: Profiling tools such as pprof are used to analyze complexity and overhead in a program, such as memory usage and function call frequency, and identify the most expensive parts of the program.
  • Tracing: Tracing is used to analyze delays in the entire flow of a call or user request. It supports cross-process and shows the time spent by each component in the entire system.
  • Debugging: Debugging can check the program status and execution flow, suspend the program, and check its execution.
  • Runtime Statistics and Events: Collects and analyzes Runtime statistics and events and provides an advanced overview of program health. The peak/degree Angle helps us determine throughput, utilization, and performance changes.

More information about pprof and Trace can be found at golang.org/doc/diagnos…

Analyze CPU or memory usage using pprof

Here is a text word count program, with some Code smells, we will analyze the problems in the program step by step through pprof.

package main

import (
	"fmt"
	"io"
	"log"
	"os"
	"unicode"
)

func readbyte(r io.Reader) (rune, error) {
	var buf [1]byte
	_, err := r.Read(buf[:])
	return rune(buf[0]), err
}

func main(a) {
	f, err := os.Open(os.Args[1])
	iferr ! =nil {
		log.Fatalf("could not open file %q: %v", os.Args[1], err)
	}

	words := 0
	inword := false
	for {
		r, err := readbyte(f)
		if err == io.EOF {
			break
		}
		iferr ! =nil {
			log.Fatalf("could not read file %q: %v", os.Args[1], err)
		}
		if unicode.IsSpace(r) && inword {
			words++
			inword = false
		}
		inword = unicode.IsLetter(r)
	}
	fmt.Printf("%q: %d words\n", os.Args[1], words)
}
Copy the code

Here we have a text file for the test program, 1.2m in size.

$ls-lh moby.txt -rw-r--r-- 1 f1renze staff 1.2m Jan 19 16:32 moby.txtCopy the code

Let’s run the program and see, it took two seconds to read the + word count, which doesn’t seem very good.

$ time go run main.go moby.txt
"moby.txt": 181275 words
        2.13 real         1.41 user         1.81 sys
Copy the code

To profile the CPU, add the following code snippet to generate the profile file:

func main() {
  # Add the following code
	cpuProfile, _ := os.Create("cpu_profile")
	pprof.StartCPUProfile(cpuProfile)
	defer pprof.StopCPUProfile()
	
	....
Copy the code

Re-execute the program and start pprof on the Web (install Graphviz first) :

go tool pprof -http=:8081 cpu_profile
Copy the code

The default interface is a detailed function call relationship, time diagram. There are three main branches here.

Syscall takes up 0.93 seconds of CPU. Why did you spend so much time on syscall? Looking at the original code with the problem in mind, we see that in the for loop the program keeps calling the Readbyte method to read the file:

.for {
		r, err := readbyte(f)
		if err == io.EOF {
			break}...Copy the code

In this method, only one byte is read at a time:

func readbyte(r io.Reader) (rune, error) {
	var buf [1]byte
	_, err := r.Read(buf[:])
	return rune(buf[0]), err
}
Copy the code

That’s the problem!

It is the frequent unbuffered reads that cause the long syscall usage, which also causes another problem. If you look at the left side of the call stack, Runtime. pthread_cond_wait and runtime.pthread_cond_wait took 0.58 seconds and 1s respectively.

Since the go coroutine is scheduled by the built-in scheduler, the GMP model, each P is assigned an M (actual OS thread) to execute G, and when the M executing G gets caught in synchronous blocking system calls (such as file IO), P is moved to the new M (newly created OS thread or thread cache). The diagram below:

M1 executes G in a system call, the scheduler separates M1 from P (G1 is still connected to M1), assigns M2 to P, and then selects G2 to execute from P’s Local Run Queue (at which point a context switch is performed on M2).

When the synchronization blocking call from G1 is complete, G1 is moved back to P’s Local Run Queue. And M1 goes into the thread cache.

So the real reason is this: the overhead of multiple system calls, coupled with the fact that M gets caught up in system calls and causes P to move, makes the program perform very poorly.

After finding out why, let’s optimize the code:

words := 0
inword := falseIO.Reader b := bufio.newReader (f)for{# read r from buffer, err := readbyte(b)if err == io.EOF {
    break
  }
Copy the code

View the running time, which is a fraction of the previous execution time:

$ time go run main.go moby.txt
"moby.txt": 181275 words
        0.63 real         0.29 user         0.25 sys
Copy the code

Looking at the CPU profile again, there is only one branch left, and since there is only one system call in the program, the time is almost negligible:

Obviously, a PROFILE for the CPU doesn’t give us much information, so let’s profile program memory usage.

To demonstrate changing the sampling rate to 1 here, the profiler collects all memory allocation information. (This is generally not recommended, as it can slow down the application.)

func main(a){# add memProfile, _ := os.create ("mem_profile")
runtime.MemProfileRate = 1

defer func(a) {
  pprof.WriteHeapProfile(memProfile)
  memProfile.Close()
}()
...
Copy the code

View the call stack and memory allocation presentation

Readbyte allocates 1.2m memory, which is the same size as the moby. TXT file.

Comparing the main function with the readbyte function, you can see that the reader returned by Bufio takes up only 4KB of memory while the buF array in readbyte allocates 1.2MB!

Since the buF array is declared in a function, it should theoretically be allocated on the function stack and freed as the function returns. However, the profile results do not seem to be the case. Let’s look at the compiler’s escape analysis log:

$ go build -gcflags=-m main.go
# command-line-arguments. ./main.go:15:6: moved to heap: buf ./main.go:43:21: moved to heap: buf ....Copy the code

Omit extraneous information, and here you can see that the compiler assigns buF, which is a local variable in the function, to the heap. Since the allocation of variables to the heap or stack is determined by the GO compiler, buFs declared each time a readbyte function is entered are allocated to the heap, adding up to the size of the file itself.

We changed buf to global variable profile again, problem solved:

Analyze Groutine execution using Trace

Pprof is powerful enough, but often Go programs contain so many concurrent operations that the profile CPU or memory may not provide much useful information. There is a super useful trace to help us further analyze the performance of high concurrency programs.

This is a program that supports multiple working modes to draw Mandelbrot set. Its main job is to calculate values and fill colors according to the coordinates of each pixel. There is no need to know the details because the focus of this article is trace:

// mandelbrot example code adapted from Francesc Campoy's mandelbrot package.
// https://github.com/campoy/mandelbrot
package main

import (
	"flag"
	"image"
	"image/color"
	"image/png"
	"log"
	"os"
	"sync"
)

func main(a) {
	var (
		height  = flag.Int("h".1024."height of the output image in pixels")
		width   = flag.Int("w".1024."width of the output image in pixels")
		mode    = flag.String("mode"."seq"."mode: seq, px, row, workers")
		workers = flag.Int("workers".1."number of workers to use")
	)
	flag.Parse()

	const output = "mandelbrot.png"

	// open a new file
	f, err := os.Create(output)
	iferr ! =nil {
		log.Fatal(err)
	}

	// create the image
	c := make([][]color.RGBA, *height)
	for i := range c {
		c[i] = make([]color.RGBA, *width)
	}

	img := &img{
		h: *height,
		w: *width,
		m: c,
	}

	switch *mode {
	case "seq":
		seqFillImg(img)
	case "px":
		oneToOneFillImg(img)
	case "row":
		onePerRowFillImg(img)
	case "workers":
		nWorkersPerRowFillImg(img, *workers)
	default:
		panic("unknown mode")}// and encoding it
	iferr := png.Encode(f, img); err ! =nil {
		log.Fatal(err)
	}
}

type img struct {
	h, w int
	m    [][]color.RGBA
}

func (m *img) At(x, y int) color.Color { return m.m[x][y] }
func (m *img) ColorModel(a) color.Model { return color.RGBAModel }
func (m *img) Bounds(a) image.Rectangle { return image.Rect(0.0, m.h, m.w) }

// SEQSTART OMIT
func seqFillImg(m *img) {
	for i, row := range m.m {
		for j := range row {
			fillPixel(m, i, j)
		}
	}
}

// SEQEND OMIT

func oneToOneFillImg(m *img) {
	var wg sync.WaitGroup
	wg.Add(m.h * m.w)
	for i, row := range m.m {
		for j := range row {
			go func(i, j int) {
				fillPixel(m, i, j)
				wg.Done()
			}(i, j)
		}
	}
	wg.Wait()
}

func onePerRowFillImg(m *img) {
	var wg sync.WaitGroup
	wg.Add(m.h)
	for i := range m.m {
		go func(i int) {
			for j := range m.m[i] {
				fillPixel(m, i, j)
			}
			wg.Done()
		}(i)
	}
	wg.Wait()
}

func nWorkersFillImg(m *img, workers int) {
	c := make(chan struct{ i, j int })
	for i := 0; i < workers; i++ {
		go func(a) {
			for t := range c {
				fillPixel(m, t.i, t.j)
			}
		}()
	}

	for i, row := range m.m {
		for j := range row {
			c <- struct{ i, j int }{i, j}
		}
	}
	close(c)
}

func nWorkersPerRowFillImg(m *img, workers int) {
	c := make(chan int, m.h)
	var wg sync.WaitGroup
	wg.Add(workers)
	for i := 0; i < workers; i++ {
		go func(a) {
			for row := range c {
				for col := range m.m[row] {
					fillPixel(m, row, col)
				}
			}
			wg.Done()
		}()
	}

	for row := range m.m {
		c <- row
	}
	close(c)
	wg.Wait()
}

func fillPixel(m *img, x, y int) {
	const n = 1000
	const Limit = 2.0
	Zr, Zi, Tr, Ti := 0.0.0.0.0.0.0.0
	Cr := (2*float64(x)/float64(n) - 1.5)
	Ci := (2*float64(y)/float64(n) - 1.0)

	for i := 0; i < n && (Tr+Ti <= Limit*Limit); i++ {
		Zi = 2*Zr*Zi + Ci
		Zr = Tr - Ti + Cr
		Tr = Zr * Zr
		Ti = Zi * Zi
	}
	paint(&m.m[x][y], Tr, Ti)
}

func paint(c *color.RGBA, x, y float64) {
	n := byte(x * y * 2)
	c.R, c.G, c.B, c.A = n, n, n, 255
}
Copy the code

Let’s run it, it’s a little slow:

$time go run mandelbrot.go 1.93 real 1.70 user 0.27 sysCopy the code

Because the default operating mode is sequential, the program itself does no additional IO operations other than creating files and image-encoding operations, leaving limited room for further code optimization. Here we will use trace directly to see the analysis result. First we will add the generated trace snippet:

func main(a) {
	var (
		height  = flag.Int("h".1024."height of the output image in pixels")
		width   = flag.Int("w".1024."width of the output image in pixels")
		mode    = flag.String("mode"."seq"."mode: seq, px, row, workers")
		workers = flag.Int("workers".1."number of workers to use") flag.parse () # Add the following code snippetvar fn string
	switch *mode {
	case "seq":
		fn = "trace.seq"
	case "px":
		fn = "trace.px"
	case "row":
		fn = "trace.row"
	case "workers":
		fn = "trace.workers"
	}
	traceFile, _ := os.Create(fn)
	iferr := trace.Start(traceFile); err ! =nil {
		log.Fatal(err)
	}
	defer trace.Stop()
	
	....
Copy the code

Use the go Tool trace trace.seq command to view the following files in Chrome:

Tips: Shift + ? View command help, W/S zoom in/out

Observant students might notice that Proc 0 is working all the time, and the rest is mostly idle, since the computation tasks of the different pixels are independent of each other, so you can delegate the computation tasks to different Goroutines.

In this case, we change the working mode and directly calculate each pixel in parallel. The implementation has been given in the program:

func oneToOneFillImg(m *img) {
	var wg sync.WaitGroup
	wg.Add(m.h * m.w)
	for i, row := range m.m {
		for j := range row {
			go func(i, j int) {
				fillPixel(m, i, j)
				wg.Done()
			}(i, j)
		}
	}
	wg.Wait()
}
Copy the code

Let’s switch to the corresponding PX working mode and see:

$time go run mandelbrot. go-mode px 2.01 real 7.26 user 2.90 sysCopy the code

Ok, strangely enough, it’s slower. Let’s use the Go Tool trace trace.seq to see what’s going on.

The size of the generated trace file depends on the number of Goroutines, so running this command will be slow

Due to the large number of Goroutines, you can see that the trace file is divided into many fragments. Check out one of them:

There are no free Proc s this time, but the flow of execution seems to be intermittent, and when zoomed in it becomes obvious:

So here’s the problem: concurrency granularity is too small, and the amount of work per Goroutine is too small to even cover the startup and scheduling overhead. There’s no such thing as a free lunch, and too much Groutine can be an added burden.

Next, we adjust the concurrency granularity by assigning each row to a different Goroutine. The corresponding mode and source code are as follows:

time go run mandelbrot.go -mode row
        0.85 real         1.85 user         0.32 sys
Copy the code
func onePerRowFillImg(m *img) {
	var wg sync.WaitGroup
	wg.Add(m.h)
	for i := range m.m {
		go func(i int) {
			for j := range m.m[i] {
				fillPixel(m, i, j)
			}
			wg.Done()
		}(i)
	}
	wg.Wait()
}
Copy the code

You can see that it is obviously much faster than the previous two modes. Let’s see how the trace looks:

The pixel computation part of the execution flow looks extremely comfortable and makes good use of CPU resources, with all 12 threads running full (hyper-threaded 6-core CPU). When you zoom in, you can see that there is almost no space between the Goroutines, which is why Go is the best language in the world 😆.

Of course, this mode still has its drawbacks, notice that there is a noticeable fluctuation in the Groutine column at the very top of the Trace interface, and even though we’ve tweaked the concurrency granularity it still generates a lot of Goroutines in a short amount of time and then executes them in sequence. In fact, since the number of CPU cores is fixed and the number of OS threads executing Goroutine in parallel at the same time is fixed, we can Pool Goroutine to save resources, which is often referred to as Worker Pool.

There are many different implementations of Worker pools, but they are very similar, starting a certain Goroutine as the Worker, and then the Worker consumes the tasks that need to be executed through the Channel. A simple implementation is given as follows:

func nWorkersPerRowFillImg(m *img, workers int) {
	c := make(chan int, m.h)
	var wg sync.WaitGroup
	wg.Add(workers)
	for i := 0; i < workers; i++ {
		go func(a) {
			for row := range c {
				for col := range m.m[row] {
					fillPixel(m, row, col)
				}
			}
			wg.Done()
		}()
	}

	for row := range m.m {
		c <- row
	}
	close(c)
	wg.Wait()
}
Copy the code

Let’s take a look at the execution time, which is faster than row mode. Note that you need to manually specify the number of workers, which is equal to the maximum number of threads supported by the current hardware. You can use fmt.println (runtime.numcpu ()) in Go.

Time go run mandelbrot. Go-mode worker-workers 12 0.74 real 1.86 user 0.26 sysCopy the code

By checking the trace, it can be seen that Groutine, equal to the number of workers, is always running in the calculation phase

Worker Pool is very useful but not a silver bullet. Different solutions are suitable for different scenarios, so I won’t discuss them here. Back to the current program itself, these schemes ultimately speed up the calculation task part, to speed up the image coding part of the difficulty will undoubtedly be several grades.

Amdahl’s law

And the acceleration that parallelism can bring is not infinite. As shown above, Amdahl’s Law shows the relationship between the number of parallel processors and the increase in efficiency. The key to speeding up a program depends on the parts of it that must be executed sequentially (mandelbrot, for example, spends about 50% of its time coding images even at the fastest). When 95% of the execution in a program can be parallelized, even if the number of cpus is in the thousands, the acceleration is limited to a factor of 20.

conclusion

This was a great presentation that covered three ways to analyze Go performance and covered a lot of ground. If you want more, you can read the author’s article 👉 dave.cheney.net/high-perfor…