origin
Recently, I want to try to implement clock_gettime’s CLOCK_REALTIME_COARSE and clock_spacesic_coarse in Golang. I just studied the implementation of time.Now in depth. By chance, I optimized time.Now (although Ian’s version was finally delivered).
The whole process is recorded here for reference.
Implementation principle of time.Now
First, let’s look at how time.now is implemented, starting with the code (the following code is based on Go <= 1.16) :
// Provided by package runtime. func now() (sec int64, nsec int32, mono int64) // Now returns the current local time. func Now() Time { sec, nsec, mono := now() mono -= startNano sec += unixToInternal - minWall if uint64(sec)>>33 ! = 0 { return Time{uint64(nsec), sec + minWall, Local} } return Time{hasMonotonic | uint64(sec)<<nsecShift | uint64(nsec), mono, Local} }Copy the code
As you can see, time.Now actually calls Now to get the corresponding time value, and then does a series of processing. This part of the processing does not say, there are more information on the Internet, is not the focus of this article. Let’s go to the Runtime package to find out how now is implemented:
//go:linkname time_now time.now
func time_now() (sec int64, nsec int32, mono int64) {
sec, nsec = walltime()
return sec, nsec, nanotime()
}
Copy the code
A keyword search will quickly bring up the above code in Runtime’s timestub. Go file, and you can see that two methods are actually called: The walltime and nanotime methods, which in turn call walltime1 and nanotime1, are implemented in assembly. Let’s take a closer look at the assembly implementation of the two methods, because the code is basically the same. Here, we’ll use walltime1 as an example:
// func walltime1() (sec int64, Nsec int32) // non-zero frame-size means BP is saved and restored TEXT Runtime ·walltime1(SB),NOSPLIT,$16-12 // We don't know how much stack space the VDSO code will need, // so switch to g0. // In particular, a kernel configured with CONFIG_OPTIMIZE_INLINING=n // and hardening can use a full page of stack space in gettime_sym // due to stack probes inserted to avoid stack/heap collisions. // See issue #20427. MOVQ SP, R12 // Save old SP; R12 unchanged by C code. get_tls(CX) MOVQ g(CX), AX MOVQ g_m(AX), BX // BX unchanged by C code. // Set vdsoPC and vdsoSP for SIGPROF traceback. // Save the old values on stack and restore them on exit, // so this function is reentrant. MOVQ m_vdsoPC(BX), CX MOVQ m_vdsoSP(BX), DX MOVQ CX, 0(SP) MOVQ DX, 8(SP) LEAQ sec+0(FP), DX MOVQ -8(DX), CX MOVQ CX, m_vdsoPC(BX) MOVQ DX, m_vdsoSP(BX) CMPQ AX, m_curg(BX) // Only switch if on curg. JNE noswitch MOVQ m_g0(BX), DX MOVQ (g_sched+gobuf_sp)(DX), SP // Set SP to g0 stack noswitch: SUBQ $16, SP // Space for results ANDQ $~15, SP // Align for C code MOVL $0, DI // CLOCK_REALTIME LEAQ 0(SP), SI MOVQ Runtime ·vdsoClockgettimeSym(SB), AX CMPQ AX, $0 JEQ Fallback CALL AX ret: MOVQ 0(SP), AX // sec MOVQ 8(SP), DX // nsec MOVQ R12, SP // Restore real SP // Restore vdsoPC, vdsoSP // We don't worry about being signaled between the two stores. // If we are not in a signal handler, we'll restore vdsoSP to 0, // and no one will care about vdsoPC. If we are in a signal handler, // we cannot receive another signal. MOVQ 8(SP), CX MOVQ CX, m_vdsoSP(BX) MOVQ 0(SP), CX MOVQ CX, m_vdsoPC(BX) MOVQ AX, sec+0(FP) MOVL DX, nsec+8(FP) RET fallback: MOVQ $SYS_clock_gettime, AX SYSCALL JMP retCopy the code
The comment of this code is very clear. According to this code, you can see that the vDSO call is actually used to get the current time information. However, since Go is a stack of self-maintained coroutines, and this stack is problematic for calling VDSO on some kernels, you need to switch to G0 (the system thread stack) first. So there’s a lot of extra work going on at the beginning and the end, creating and cleaning up the crime scene.
For those of you who may not understand vDSO, here is a brief introduction. In fact, at the beginning of getting the time information is through the system call, that is, to syscall, but as we all know, syscall performance is poor, and obtaining the time stamp is a high-frequency operation, so we have tried to optimize several versions. Finally, the vDSO scheme is adopted now. Virtual Dynamic Shared Object (VDSO) : virtual Dynamic Shared Object (VDSO) : Virtual Dynamic Shared Object (VDSO) : Virtual Dynamic Shared Object You can avoid the overhead of system calls. Specific can refer to: man7.org/linux/man-p…
After looking at walltime1, let’s take a look at nanotime1. Since the code to switch to G0 is the same at the beginning, we will only intercept the rest of the code:
noswitch: SUBQ $16, SP // Space for results ANDQ $~15, SP // Align for C code MOVL $1, DI // CLOCK_MONOTONIC LEAQ 0(SP), SI MOVQ Runtime ·vdsoClockgettimeSym(SB), AX CMPQ AX, $0 JEQ Fallback CALL AX ret: MOVQ 0(SP), AX // sec MOVQ 8(SP), DX // nsec MOVQ R12, SP // Restore real SP // Restore vdsoPC, vdsoSP // We don't worry about being signaled between the two stores. // If we are not in a signal handler, we'll restore vdsoSP to 0, // and no one will care about vdsoPC. If we are in a signal handler, // we cannot receive another signal. MOVQ 8(SP), CX MOVQ CX, m_vdsoSP(BX) MOVQ 0(SP), CX MOVQ CX, m_vdsoPC(BX) // sec is in AX, nsec in DX // return nsec in AX IMULQ $1000000000, AX ADDQ DX, AX MOVQ AX, ret+0(FP) RETCopy the code
As you can see, the only change is that the clockid of the call — CLOCK_MONOTONIC and the processing logic before RET — converts the returned result into nanoseconds.
Time. Now optimization
Now makes a call to wallTime and a call to nanotime, both of which have almost the same amount of code to switch to the G0 stack and recover. If we combine the two calls together, we can save the extra overhead of a stack switch and prep!
Ian from the official Go team and I (almost) simultaneously proposed the corresponding PR to optimize this part of the logic, and Ian got better performance (-20% versus -17% for my version), so we ended up with Ian’s version: Go-review.googlesource.com/c/go/+/3142…
inruntime
Outgoing callsvdso
?
Going back to the beginning, I wanted to implement clock_getTime’s CLOCK_REALTIME_COARSE and clock_mv IC_coarse myself, which required me to implement the above series of operations outside of the Runtime package. To do this, however, you would need to copy all of the runtime struct definitions (such that the include go_asm.h has an offset in the assembly code), which is not very maintainable, and if a version changes the order of structs, the behavior is undefined and dangerous. Or you have to make a separate copy of each version.
In view of this problem, we also discussed with the Go official, and ultimately there is no good idea, Go does not support to safely call VDSO outside of Runtime.
However, in the process of this discussion, the optimization of time.Now was facilitated, which is worth the trip.