Small changes, big improvements, an optimization of the Go standard library

The final piece for 2019.

Carlo Alberto Ferraris has submitted a pr optimization for lockedSource in the math/rand library (CL#191538). The core code is actually only one line, but it brings a relatively huge performance improvement. Learn about the code optimization skills, improve our Go language low-level optimization experience.

Carlo improved the performance of rngSource by avoiding interface calls, allowing inlining and keeping it in the same cacheline:

The SRC field in lockedSource struct is changed from interface type Source64 to the struct pointer *rngSource, so that the *rngSource methods Int64, Uint64 can be inlined into the caller’s code.

In the actual test, the third optimization to keep the same cacheline did not work, and the pointer type performed slightly better. My actual test also found that this optimization did not have a particularly obvious optimization effect, so there is no such optimization method in the following test.

Let’s use an example to compare the performance of code before and after this approach, focusing on the performance improvement with the interface removed and the performance improvement with the interface inlined.

Start by defining a DryFruit interface that has some generic methods such as name, price, and Increase, because you don’t have to dig into the meaning of these methods just for demonstration purposes:

Type DryFruit interface {Name() string Price() uint64 Family() string Distribution() string Increase() }Copy the code

Let’s define a chestnut object that implements the nuts interface:

Type Chestnut struct {name string count uint64} // name name (c Chestnut) name () string {return Func (c Chestnut) Price() uint64 {return 10} // Family Family name.func (c Chestnut) Family() Func (c Chestnut) Distribution() string {return "East Asia"} // Increase(c *Chestnut) Increase() {c count++}Copy the code

With the interface and implementation defined, we need to define an object that uses them: the Gift.

The non-optimized gift is defined as follows, which defines an OriginGift object that contains an exclusive lock and also contains a dried fruit interface field:

Type OriginGift struct {mu sync.Mutex dryFruit dryFruit *OriginGift) Access() { g.dryFruit.Name() g.dryFruit.Price() g.dryFruit.Family() g.dryFruit.Distribution() g.dryFruit.Increase() }Copy the code

Our optimized Gift struct replaces the interface object directly with a concrete chestnut struct:

Type ImprovedGift struct {mu sync.mutex dryFruit *Chestnut} // Access to ImprovedGift object *ImprovedGift) Access() { g.dryFruit.Name() g.dryFruit.Price() g.dryFruit.Family() g.dryFruit.Distribution() g.dryFruit.Increase() }Copy the code

The Benchmark test code is as follows:

Func BenchmarkOriginGift(b *testing.B) {var nut = &OriginGift{dryFruit: &Chestnut{name: "Chestnut "},} for I := 0; i < b.N; i++ { nut.Access() } } func BenchmarkImprovedGift(b *testing.B) { var nut = &ImprovedGift{ dryFruit: &Chestnut{name: },} for I := 0; i < b.N; i++ { nut.Access() } } func BenchmarkOriginGiftParallel(b *testing.B) { var nut = &OriginGift{ dryFruit: &Chestnut{name: "Chestnut"}, } b.RunParallel(func(pb *testing.PB) { for pb.Next() { nut.mu.Lock() nut.Access() nut.mu.Unlock() } }) } func BenchmarkImprovedGiftParallel(b *testing.B) { var nut = &ImprovedGift{ dryFruit: &Chestnut{name: } b.parallel (func(pb *testing.pb) {for pb.next () {nut.mu.lock () nut.access () nut.mu.unlock ()})}Copy the code

Test Benchmark with no concurrency, and then test performance with concurrent access.

Bench = bench = bench = bench = bench = bench = bench

goos: darwin goarch: amd64 pkg: Github.com/smallnest/study/perf_interface BenchmarkOriginGift - 4 34669898 31.0 ns/op BenchmarkImprovedGift - 4, 58661895 17.9 ns/op BenchmarkOriginGiftParallel - 4, 7292043, 171 ns/op BenchmarkImprovedGiftParallel - 4, 8718816, 143 ns/opCopy the code

As you can see, the optimization of replacing the interface with a specific struct is still significant, with the time reduced to almost half for non-concurrent access and a significant performance improvement for concurrent access.

The second time we enable inlining, see how it compares to the case above where inlining is not enabled.

goarch: amd64 pkg: Github.com/smallnest/study/perf_interface BenchmarkOriginGift - 4 95278143 12.6 ns/op BenchmarkImprovedGift - 4, 549471100 2.16 ns/op BenchmarkOriginGiftParallel - 4, 11631438, 115 ns/op BenchmarkImprovedGiftParallel - 4 13815229 86.3 ns/opCopy the code

When inlining is enabled, you can see that the performance is improved, but the performance optimization is even more obvious when the interface is removed, which is reduced to 2.16 ns/op.

By comparing these two benchmarks, you should be able to see the huge benefits of these two optimizations (de-interface, inline).

Go test –gcflags “-m -m” -bench. Learn the nuts and bolts of inlining.

Small changes, big improvements, an optimization of the Go standard library

Related Posts

Learning Design Pattern from Scratch (14) : Mediator Pattern

MySQL query optimization (6) : MySQL query optimization sorting optimization mechanism

Lightweight distributed log tracking tool, 10 minutes to access, log tracking is easy