preface
The Go language supports real-time, highly concurrent messaging systems that can reduce latency to less than 100ms in million-level messaging systems, thanks in large part to Go’s efficient garbage collection system.
For real-time systems, garbage collection systems can be a major concern because the entire application is suspended while garbage is collected. So when we design our message bus system, we need to choose our language carefully. Go keeps talking about its low latency, but does it really deliver? And if so, how did it do it?
In this article, we will see how GC for Go is implemented (tricolor algorithm), why it achieves such low GC pauses, and most importantly, whether it really works (benchmar testing of these GC pauses, And compared to other types of languages).
The body of the
1. From Haskell to Go
We use the pub/ SUB message bus system as an example, which stores messages in in-memory at the time they are published. In the early days, we implemented the first version of the messaging system using Haskell, but after discovering that GHC’s Gabage Collector had some basic latency issues, we abandoned the system and implemented it using Go instead.
Here are some implementation details about the Haskell messaging system. The most important point in GHC is that its GC pause time is proportional to the size of the current working set (that is, GC time is related to the number of objects stored in memory). In our example, the number of objects stored in memory tends to be very large, resulting in gc times of hundreds of milliseconds. This causes the system to block during GC.
In Go, unlike GHC’s stop-the-world, the Go garbage collector runs in parallel with the main program. This avoids long pauses in the program. We’re more concerned about the low latency promised by Go and whether the latency improvements it promises in each new release really do live up to their hype.
2. How does parallel garbage collection work?
How does Go’s GC achieve parallelism? The key is the Tricolor Mark-and-sweep algorithm. This algorithm makes the system’s GC pause time a predictable problem. The scheduler can implement GC scheduling in a very short time with minimal impact on the source program. Here’s a look at how the tricolor tag removal algorithm works:
Suppose we have code for a list operation like this:
var A LinkedListNode;
var B LinkedListNode;
// ...
B.next = &LinkedListNode{next: nil};
// ...
A.next = &LinkedListNode{next: nil};
*(B.next).next = &LinkedListNode{next: nil};
B.next = *(B.next).next;
B.next = nil;
Copy the code
2.1. The first step
var A LinkedListNode;
var B LinkedListNode;
// ...
B.next = &LinkedListNode{next: nil};
Copy the code
At first we assume that there are three nodes A, B, and C, as the root node, and the red nodes A and B are always accessible. Then we perform an assignment. Initially, the garbage collector has three collections, black, gray, and white. At this point, all three nodes are in the white collection because the garbage collector is not yet running.
2.2. The second step
Let’s create a new node D and assign it to a.ext. That is:
var D LinkedListNode;
A.next = &D;
Copy the code
Note that, as a new memory object, it needs to be placed in a gray area. Why put it in a gray area? As a rule, if a pointer field changes, the object being pointed to needs to change color. Because all newly created memory objects need to assign their address to a reference, they will immediately turn gray. (Which begs the question, why isn’t C grey?)
2.3. The third step
At the beginning of the GC, the root node will be moved into the gray area. At this point, nodes A, B, and D are all in the gray area. Since all subroutines (processes, but not exactly threads in GO) are either program logic or GC procedures, and GC and GC are parallel, the program logic and GC procedures should alternate CPU resources.
2.4. Step 4 Scan memory objects
When scanning a memory object, the GC collector will mark the memory object as black and then its children as gray. At any stage, we all need to be able to compute the current GC collector’s mobile number: 2 * | white | | + grey |, in every scan GC collector for at least one mobile, until you reach the current grey area memory object number is 0.
2.5. Step 5
C ext (); c ext (); c ext ();
var E LinkedListNode;
C.next = &E;
Copy the code
According to our previous rules, new memory objects need to be placed in the gray area, as shown in the following figure:
Doing so requires more work for the collector, but it delays the final cleanup when creating many memory objects. It is worth noting that the volume of the white area processed in this way will be reduced until the collector actually cleans up the heap space and then refits and moves in new memory objects.
2.6. Step 6 Pointer reassignment
The program logic now assigns b.ext. Next to b.ext, that is, E to b.ext. The code is as follows:
*(B.next).next = &LinkedListNode{next: nil};
// Reassign the pointer:
B.next = *(B.next).next;
Copy the code
After doing so, C will be unreachable, as shown in the figure.
This means that the collector needs to remove C from the white area and then reclaim the memory it occupies in the GC loop.
2.7. Step 7
Move the memory object that has no reference dependencies in the gray area to the black area. At this point, D has no other dependencies in the gray area and the memory object A that depends on it is already in the black area. Move the memory object to the black area.
Step 2.8. 8
In program logic, assign b. next to nil, at which point E becomes unreachable. But if E is in the gray area, it will not be collected. Will this result in a memory leak? No, E will be reclaimed in the next GC cycle, and the tricolor algorithm guarantees this: if a memory object is unreachable at the beginning of a GC cycle, it will be frozen and reclaimed at the end of the GC cycle.
Step 2.9. 9
On the second GC loop, E is moved to the black area, but C is not moved because C references E, not E references C.
2.10. Step 10
The collector then scans the memory object B in the last gray area and moves it to the black area.
2.11. Step 11 Recycle the white area
The collector then scans the memory object B in the last gray area and moves it to the black area.
2.12. Step 12 Area discoloration
This step is the most interesting, it is not necessary to move all the memory objects back to the white area during the next GC cycle, just change the color of the black area and the white area, it is simple and efficient.
3. GC three-color algorithm summary
Above are some details of the three-color tag clearing algorithm. Under the current algorithm, there are still two stages that need to stop-the-world: one is to perform the stack scan of the root memory object; The second is the termination and suspension of marking stage. Excitingly, the termination pause of the marking phase will be removed. In practice, we find that the GC pause times achieved by this algorithm can achieve <1ms performance in the case of large heap space reclamation.
4. Delay vs. throughput
If a parallel GC collector can achieve extremely low latency when dealing with very large memory heaps, why would anyone use a stop-the-world GC collector? Isn’t Go’s GC collector good enough?
This is not absolute, as low latency is expensive. The main overhead is that low latency reduces throughput. Concurrency requires additional synchronization and assignment operations that consume the processing logic of the program. Haskell’S GHC is optimized for throughput, while Go focuses on delay. When considering which language to use, we need to choose according to our own needs. For a system with high real-time requirements such as push system, Go language is a tradeoff choice.
5. Actual performance
For now, Go seems to be able to meet the requirements of low-latency systems, but how does it perform in practice? Compare using the same benchmark test logic implementation: This benchmark will keep on push messages in a finite buffer size of the buffer, the old messages will expire and become the garbage needs to be constantly recovery, this requires memory heap needs to keep large, this is very important, because in the recovery phase of the whole heap memory needs to be scanned to determine whether there is a memory references. This is why the running time of the GC is directly proportional to the number of surviving memory objects and Pointers.
This is the benchmark code for the Go language version, where the buffer is implemented as an array:
package main
import (
"fmt"
"time"
)
const (
windowSize = 200000
msgCount = 1000000
)
type (
message []byte
buffer [windowSize]message
)
var worst time.Duration
func mkMessage(n int) message {
m := make(message, 1024)
for i := range m {
m[i] = byte(n)
}
return m
}
func pushMsg(b *buffer, highID int) {
start := time.Now()
m := mkMessage(highID)
(*b)[highID%windowSize] = m
elapsed := time.Since(start)
if elapsed > worst {
worst = elapsed
}
}
func main(a) {
var b buffer
for i := 0; i < msgCount; i++ {
pushMsg(&b, i)
}
fmt.Println("Worst push time: ", worst)
}
Copy the code
The test results of the same logic in different languages are as follows:
Java, surprisingly, did very well, while OCaml did very well, with a GC pause time of about 3ms. This is because the GC algorithm used by OCaml is a incremental GC algorithm (OCaml is not used in real-time systems because the language does not support multiple cores well).
As shown in the table,Go’s GC pause time is around 7ms, which is good enough to meet our requirements.
conclusion
The point of this investigation is that GC focuses on either low latency or high throughput. Of course, it all depends on how our program uses heap space (do we have a lot of objects in memory? Is the life cycle of each object long or short?
Understanding the underlying GC algorithm is important to see if the system works for your test cases. Of course, the actual implementation of the GC system is also crucial. Your benchmark should have a memory footprint similar to the real program you are implementing so that you can test in practice how efficient a GC system is for your program. As mentioned earlier, Go’s GC system is not perfect, but it is acceptable for our system.
Despite some issues, Go’s GC performance is better than most other languages that also have GC systems, and the Go development team has optimized for GC latency and continues to do so. Go’s GC does have its merits, both in theory and in practice.
reference
- Golang’s Real-time GC in Theory and Practice