Product | technology drops

The author | chun-hui cao



Preface: Syscall is the only means of language and system interaction. To understand Syscall in Go language, this article can help readers understand how Go language interacts with the system, and also understand some minor thoughts of syscall optimization in the bottom runtime, so as to have a deeper understanding of Go language.

— — –

▎ Reading Index

  • concept

  • The entrance

  • System call Management

  • The runtime of the SYSCALL

  • Interaction with scheduling

    • entersyscall

    • exitsyscallfast

    • exitsyscall

    • entersyscallblock

    • entersyscallblock_handoff

    • entersyscall_sysmon

    • entersyscall_gcwait

  • conclusion

▎ concept

▎ entrance

Syscall has the following entries in syscall/ ASM_linux_amd64.s.

1func Syscall(trap, a1, a2, a3 uintptr) (r1, r2 uintptr, err syscall.Errno)
2
3func Syscall6(trap, a1, a2, a3, a4, a5, a6 uintptr) (r1, r2 uintptr, err syscall.Errno)
4
5func RawSyscall(trap, a1, a2, a3 uintptr) (r1, r2 uintptr, err syscall.Errno)
6
7func RawSyscall6(trap, a1, a2, a3, a4, a5, a6 uintptr) (r1, r2 uintptr, err syscall.Errno)
8
Copy the code

The implementation of these functions is assembly. According to the Linux Syscall call specification, we only need to pass the parameters into the register in assembly and call the SYscall instruction to enter the kernel processing logic. After the execution of the system call, the return value is placed in RAX:



The only difference between Syscall and Syscall6 is that the parameters passed in are different:

1// func Syscall(trapint64, a1, a2, a3 uintptr) (r1, r2, err uintptr); 2 the text, the Syscall (SB), NOSPLIT,$0-56 3 CALL Runtime ·entersyscall(SB) 4 MOVQ A1 +8(FP), DI 5 MOVQ A2 +16(FP), SI 6 MOVQ A3 +24(FP), DX 7 MOVQ$0, R10
 8    MOVQ    $0, R8
 9    MOVQ    $0, R9
10    MOVQ    trap+0(FP), AX // syscall Entry 11 syscall 12 // 0xFFffFFFFF001 is Linux MAX_ERRNO take reversal unsigned, http://lxr.free-electrons.com/source/include/linux/err.h#L17
13    CMPQ    AX, $0xfffffffffffff001
14    JLS    ok
15    MOVQ    $-1, r1+32(FP)
16    MOVQ    $020 CALL Runtime ·exitsyscall(SB) 20 RET 21OK: 22 MOVQ AX, r1+32(FP) 23 MOVQ DX, r2+40(FP) 24 MOVQ$0, err+48(FP) 25 CALL Runtime · exitSyscall (SB) 26 RET 27 27trap, A1, A2, A3, A4, A5, a6 uintptr) (R1, R2, Err uintptr)$0-80 30 CALL Runtime ·entersyscall(SB) 31 MOVQ A1 +8(FP), DI 32 MOVQ A2 +16(FP), SI 33 MOVQ A3 +24(FP), DX 34 MOVQ a4+32(FP), R10 35 MOVQ a5+40(FP), R8 36 MOVQ a6+48(FP), R9 37 MOVQtrap+0(FP), AX    // syscall entry
38    SYSCALL
39    CMPQ    AX, $0xfffffffffffff001
40    JLS    ok6
41    MOVQ    $-1, r1+56(FP)
42    MOVQ    $030 CALL Runtime ·exitsyscall(SB) 46 RET 47ok6: 48 MOVQ AX, r1+56(FP) 49 MOVQ DX, r2+64(FP) 50 MOVQ$0, Err +72(FP) 51 CALL Runtime · exitSyscall (SB) 52 RETCopy the code

There’s not much difference between the two functions, so why not use one? Personally guess, Go function parameters are passed on the stack, probably to save a little stack space. I’m going to tell runtime before the normal Syscall operation, and then I’m going to do the Syscall operation Runtime ·entersyscall, and I’m going to call Runtime · exitSyscall when I exit.

 1// func RawSyscall(trapUintptr) (R1, R2, Err uintptr) 2TEXT ·RawSyscall(SB),NOSPLIT,$0-56
 3    MOVQ    a1+8(FP), DI
 4    MOVQ    a2+16(FP), SI
 5    MOVQ    a3+24(FP), DX
 6    MOVQ    $0, R10
 7    MOVQ    $0, R8
 8    MOVQ    $0, R9
 9    MOVQ    trap+0(FP), AX    // syscall entry
10    SYSCALL
11    CMPQ    AX, $0xfffffffffffff001
12    JLS    ok1
13    MOVQ    $-1, r1+32(FP)
14    MOVQ    $0, r2+40(FP)
15    NEGQ    AX
16    MOVQ    AX, err+48(FP)
17    RET
18ok1:
19    MOVQ    AX, r1+32(FP)
20    MOVQ    DX, r2+40(FP)
21    MOVQ    $0, err+48(FP)
22    RET
23
24// func RawSyscall6(trap(R1, R2, Err Uintptr) 25TEXT ·RawSyscall6(SB),NOSPLIT,$0-80
26    MOVQ    a1+8(FP), DI
27    MOVQ    a2+16(FP), SI
28    MOVQ    a3+24(FP), DX
29    MOVQ    a4+32(FP), R10
30    MOVQ    a5+40(FP), R8
31    MOVQ    a6+48(FP), R9
32    MOVQ    trap+0(FP), AX    // syscall entry
33    SYSCALL
34    CMPQ    AX, $0xfffffffffffff001
35    JLS    ok2
36    MOVQ    $-1, r1+56(FP)
37    MOVQ    $0, r2+64(FP)
38    NEGQ    AX
39    MOVQ    AX, err+72(FP)
40    RET
41ok2:
42    MOVQ    AX, r1+56(FP)
43    MOVQ    DX, r2+64(FP)
44    MOVQ    $0, err+72(FP)
45    RET
Copy the code

The difference between RawSyscall and Syscall is very subtle, except that the Runtime is not notified when it enters and exits Syscall, so there is no way for the Runtime to dispatch g’s m’s P. So if user code uses RawSyscall to make blocking system calls, it is possible to block other g’s.

Yes, if you call RawSyscall you may block other goroutines from running. The system monitor may start them up after a while, but I think there are cases where it won’t. I would say that Go programs should always call Syscall. RawSyscall exists to make it slightly more efficient to call system calls that never block, such as getpid. But it’s really an internal mechanism.

1// getTimeofday (TV *Timeval) (err uintptr) 2TEXT · getTimeofday (SB),NOSPLIT,$0-16
 3    MOVQ    tv+0(FP), DI
 4    MOVQ    $0, SI
 5    MOVQ    runtime·__vdso_gettimeofday_sym(SB), AX
 6    CALL    AX
 7
 8    CMPQ    AX, $0xfffffffffffff001
 9    JLS    ok7
10    NEGQ    AX
11    MOVQ    AX, err+8(FP)
12    RET
13ok7:
14    MOVQ    $0, err+8(FP)
15    RET
Copy the code

▎ system call management

First, the system call definition file:

1/syscall/syscall_linux.go
Copy the code

System calls can be divided into three categories:

  • Blocking system call

  • Non-blocking system calls

  • Wrapped system call

Blocking system calls are defined as follows:

1//sys   Madvise(b []byte, advice int) (err error)
Copy the code

Non-blocking system calls:

1//sysnb    EpollCreate(size int) (fd int, err error)
Copy the code

Then, based on these comments, the mksyscall.pl script generates a platform specific implementation. Mksyscall.pl is a Perl script for those who are interested.

Take a look at the results of blocking and non-blocking system calls:

1func Madvise(b []byte, advice int) (err error) {
 2    var _p0 unsafe.Pointer
 3    if len(b) > 0 {
 4        _p0 = unsafe.Pointer(&b[0])
 5    } else {
 6        _p0 = unsafe.Pointer(&_zero)
 7    }
 8    _, _, e1 := Syscall(SYS_MADVISE, uintptr(_p0), uintptr(len(b)), uintptr(advice))
 9    ife1 ! = 0 { 10 err = errnoErr(e1) 11 } 12return
13}
14
15func EpollCreate(size int) (fd int, err error) {
16    r0, _, e1 := RawSyscall(SYS_EPOLL_CREATE, uintptr(size), 0, 0)
17    fd = int(r0)
18    ife1 ! = 0 { 19 err = errnoErr(e1) 20 } 21return
22}
Copy the code

Obviously, the system call labeled sys uses Syscall or Syscall6, and the system call labeled SYSNb uses RawSyscall or RawSyscall6.

What about wrapped’s system call?

1func Rename(oldpath string, newpath string) (err error) {
2    return Renameat(_AT_FDCWD, oldpath, _AT_FDCWD, newpath)
3}
Copy the code

Maybe the name of the system call is not good, or there are too many arguments, so we’ll just wrap it up. Nothing special.

▎ SYSCALL in runtime

In addition to the blocking non-blocking and wrapped Syscall mentioned above, runtime defines some low-level syscall that are not exposed to the user.

The syscall library provided to the user, when used, puts goroutine and P into the Gsyscall and Psyscall states, respectively. But these Syscall encapsulated by the Runtime itself do not call enterSyscall and exitSyscall, whether or not they block. Although it is a “low-level” syscall.

However, the essence of Syscall is the same as that exposed to the user. This code is in runtime/sys_linux_amd64.s for a specific example:

1 text runtime, the write (SB), NOSPLIT,$0-28
 2    MOVQ    fd+0(FP), DI
 3    MOVQ    p+8(FP), SI
 4    MOVL    n+16(FP), DX
 5    MOVL    $SYS_write, AX
 6    SYSCALL
 7    CMPQ    AX, $0xfffffffffffff001
 8    JLS    2(PC)
 9    MOVL    $-1, AX
10    MOVL    AX, ret+24(FP)
11    RET
12
13TEXT runtime·read(SB),NOSPLIT,$0-28
14    MOVL    fd+0(FP), DI
15    MOVQ    p+8(FP), SI
16    MOVL    n+16(FP), DX
17    MOVL    $SYS_read, AX
18    SYSCALL
19    CMPQ    AX, $0xfffffffffffff001
20    JLS    2(PC)
21    MOVL    $-1, AX
22    MOVL    AX, ret+24(FP)
23    RET
Copy the code

Here is a list of all the additional syscAll defined by the Runtime:

 1#define SYS_read 0
 2#define SYS_write 1
 3#define SYS_open 2
 4#define SYS_close 3
 5#define SYS_mmap 9
 6#define SYS_munmap 11
 7#define SYS_brk 12
 8#define SYS_rt_sigaction 13
 9#define SYS_rt_sigprocmask 14
10#define SYS_rt_sigreturn 15
11#define SYS_access 21
12#define SYS_sched_yield 24
13#define SYS_mincore 27
14#define SYS_madvise 28
15#define SYS_setittimer 38
16#define SYS_getpid 39
17#define SYS_socket 41
18#define SYS_connect 42
19#define SYS_clone 56
20#define SYS_exit 60
21#define SYS_kill 62
22#define SYS_fcntl 72
23#define SYS_getrlimit 97
24#define SYS_sigaltstack 131
25#define SYS_arch_prctl 158
26#define SYS_gettid 186
27#define SYS_tkill 200
28#define SYS_futex 202
29#define SYS_sched_getaffinity 204
30#define SYS_epoll_create 213
31#define SYS_exit_group 231
32#define SYS_epoll_wait 232
33#define SYS_epoll_ctl 233
34#define SYS_pselect6 270
35#define SYS_epoll_create1 291
Copy the code

These syscall are theoretically not stripped of P by the scheduler during execution, so the Goroutine will continue to execute after a successful execution, unlike the user’s Goroutine, which would queue p if stripped.

▎ and schedule interaction

Since I’m going to interact with the scheduler, kindly tell me it’s syscall: enterSyscall, I’m done: exitSyscall.

So by interaction I mean user code interacting with the scheduler when using the Syscall library. Syscall in runtime does not follow this process.

▎ entersyscall

1// Standard entry for syscall libraries and CGO calls 2//go:nosplit 3funcentersyscall() { 4 reentersyscall(getcallerpc(), getcallersp()) 5} 6 7//go:nosplit 8func reentersyscall(pc, Sp uintptr) {9 _g_ := getg() 10 11 // Need to disable the preemption of G 12 _g_ _g_. Stackguard0 = stackPreempt 16 // Set throwsplit in newstack if throwsplit is found to betrue17 // will directly crash 18 // the following code is newStack 19 //if thisg.m.curg.throwsplit {
20    //     throw("runtime: stack split at bad time")
21    // }
22    _g_.throwsplit = true
23
24    // Leave SP around forGC and traceback. 25 // Save the scene, 26 Save (PC, sp) 27 _g_. Syscallsp = SP 28 _g_. Syscallpc = PC 29 CasgStatus (_g_, _Grunning, _Gsyscall) 30if _g_.syscallsp < _g_.stack.lo || _g_.stack.hi < _g_.syscallsp {
31        systemstack(func() {32print("entersyscall inconsistent ", hex(_g_.syscallsp), "[", hex(_g_.stack.lo), ",", hex(_g_.stack.hi), "]\n")
33            throw("entersyscall") 34}) 35} 36ifatomic.Load(&sched.sysmonwait) ! = 0 { 38 systemstack(entersyscall_sysmon) 39 save(pc, sp) 40 } 41 42if_g_.m.p.ptr().runSafePointFn ! = 0 { 43 // runSafePointFn may stack splitif run on this stack
44        systemstack(runSafePointFn)
45        save(pc, sp)
46    }
47
48    _g_.m.syscalltick = _g_.m.p.ptr().syscalltick
49    _g_.sysblocktraced = true
50    _g_.m.mcache = nil
51    _g_.m.p.ptr().m = 0
52    atomic.Store(&_g_.m.p.ptr().status, _Psyscall)
53    ifsched.gcwaiting ! = 0 { 54 systemstack(entersyscall_gcwait) 55 save(pc, sp) 56 } 57 58 _g_.m.locks-- 59}Copy the code

As you can see, G entering Syscall is guaranteed not to be preempted.

▎ exitsyscall

1// g has exited syscall 2// need to prepare g to run again on CPU 3// This function will only be called in syscall library, Syscall 4// does not require a write barrier. 6//go:nosplit 7//go: nowriteBarrierrec 8func exitSyscall (dummy int32) {9 _g_ := getg() 10 11 _g_.m.locks++  // see commentin entersyscall
12    if getcallersp(unsafe.Pointer(&dummy)) > _g_.syscallsp {
13        // throw calls print which may try to grow the stack,
14        // but throwsplit == true so the stack can not be grown;
15        // use systemstack to avoid that possible problem.
16        systemstack(func() {
17            throw("exitsyscall: syscall frame is no longer valid")
18        })
19    }
20
21    _g_.waitsince = 0
22    oldp := _g_.m.p.ptr()
23    if exitsyscallfast() {24if _g_.m.mcache == nil {
25            systemstack(func() {
26                throw("lost mcache") 27}) 28} 29 // There is currently p, Syscalltick++ 31 // change gstatus back to running 32 casgstatus(_g_, _Gsyscall, _Grunning) 33 34 // Garbage collection is not running (because our logic is executing) 35 // So it is safe to clean up syscallsp 36 _g_.syscallsp = 0 37 _g_.m.locks-- 38if_g_. Preempt {39 // prevent newStack from cleaning up the preempt flag 40 _g_. Stackguard0 = stackPreempt 41}else{42 / / or restore in entersyscall/entersyscallblock destroy normal _StackGuard _g_. 43 stackguard0 = _g_. Stack. Lo + _StackGuard 44} 45 _g_.throwsplit =false
46        returnSysexitticks = 0 50 _g_.m.ticks -- 51if _g_.m.mcache == nil {
56        systemstack(func() {
57            throw("lost mcache"61 // The scheduler returned, so we can clean up syscallSP information prepared for garbage collector 62 // during Syscall 63 // need to wait until Gosched returns, We are not sure if the garbage collector is running 64 _g_.syscallSP = 0 65 _g_.m.p.tr ().syscalltick++ 66 _g_.throwsplit =false
67}
Copy the code

Exitsyscallfast and exitSyscall0 are also called here.

▎ exitsyscallfast

1//go:nosplit
 2func exitsyscallfast() bool {
 3    _g_ := getg()
 4
 5    // Freezetheworld sets stopwait but does not retake P's. 6 if sched.stopwait == freezeStopWait { 7 _g_.m.mcache = nil 8 _g_.m.p = 0 9 return false 10 } 11 12 // Try to re-acquire the last P. 13 if _g_.m.p ! = 0 && _g_.m.p.ptr().status == _Psyscall && atomic.Cas(&_g_.m.p.ptr().status, _Psyscall, _Prunning) { 14 // There's a cpu for us, so we can run.
15        exitsyscallfast_reacquired()
16        return true
17    }
18
19    // Try to get any other idle P.
20    oldp := _g_.m.p.ptr()
21    _g_.m.mcache = nil
22    _g_.m.p = 0
23    ifsched.pidle ! = 0 { 24 var ok bool 25 systemstack(func() {
26            ok = exitsyscallfast_pidle()
27        })
28        if ok {
29            return true30} 31} 32return false
33}
Copy the code

In short, try to get a P to execute the logic after Syscall. If there’s no P for us anywhere, we’re going to go to ExitSyscall0.

1mcall(exitsyscall0)
Copy the code

When exitSyscall0 is called, it switches to the G0 stack.

▎ exitsyscall0

2// Set g state to runnable, //go: nowriteBarrierrec 4func exitSyscall0 (gp *g) {5 _g_ := getg() 6 7 casgstatus(gp, _Gsyscall, _Grunnable) 8 dropg() 9 lock(&sched.lock) 10 _p_ := pidleget() 11if_p_ == nil {12 globrunqput(gp) 14}else ifatomic.Load(&sched.sysmonwait) ! = 0 { 15 atomic.Store(&sched.sysmonwait, 0) 16 notewakeup(&sched.sysmonnote) 17 } 18 unlock(&sched.lock) 19if_p_ ! Acquirep (_p_) 22 execute(gp,false) // Never returns.
23    }
24    if_g_.m.lockedg ! Stoplockedm () 27 execute(gp,false) // Never returns.
28    }
29    stopm()
30    schedule() // Never returns.
31}
Copy the code

▎ entersyscallblock

I know I can block, so I just handed over my p.

1// Just like entersyscall, it will just hand over P, 2//go:nosplit 3func enterSyscallBlock (dummy Int32) {4 _g_ := getg() 5 6 _g_.m.locks++ // see commentin entersyscall
 7    _g_.throwsplit = true
 8    _g_.stackguard0 = stackPreempt // see comment in entersyscall
 9    _g_.m.syscalltick = _g_.m.p.ptr().syscalltick
10    _g_.sysblocktraced = true
11    _g_.m.p.ptr().syscalltick++
12
13    // Leave SP around for GC and traceback.
14    pc := getcallerpc()
15    sp := getcallersp(unsafe.Pointer(&dummy))
16    save(pc, sp)
17    _g_.syscallsp = _g_.sched.sp
18    _g_.syscallpc = _g_.sched.pc
19    if _g_.syscallsp < _g_.stack.lo || _g_.stack.hi < _g_.syscallsp {
20        sp1 := sp
21        sp2 := _g_.sched.sp
22        sp3 := _g_.syscallsp
23        systemstack(func() {24print("entersyscallblock inconsistent ", hex(sp1), "", hex(sp2), "", hex(sp3), "[", hex(_g_.stack.lo), ",", hex(_g_.stack.hi), "]\n")
25            throw("entersyscallblock")
26        })
27    }
28    casgstatus(_g_, _Grunning, _Gsyscall)
29    if _g_.syscallsp < _g_.stack.lo || _g_.stack.hi < _g_.syscallsp {
30        systemstack(func() {31print("entersyscallblock inconsistent ", hex(sp), "", hex(_g_.sched.sp), "", hex(_g_.syscallsp), "[", hex(_g_.stack.lo), ",", hex(_g_.stack.hi), "]\n")
32            throw("entersyscallblock"37 Systemstack (enterSyscallBlock_handoff) 38 39 // Resavefor traceback during blocked call.
40    save(getcallerpc(), getcallersp(unsafe.Pointer(&dummy)))
41
42    _g_.m.locks--
43}
Copy the code

This function has only one caller, Notesleepg, which I won’t repeat here.

▎ entersyscallblock_handoff

1func entersyscallblock_handoff() {
2    handoffp(releasep())
3}
Copy the code

It’s easy.

▎ entersyscall_sysmon

1func entersyscall_sysmon() {
2    lock(&sched.lock)
3    ifatomic.Load(&sched.sysmonwait) ! = 0 { 4 atomic.Store(&sched.sysmonwait, 0) 5 notewakeup(&sched.sysmonnote) 6 } 7 unlock(&sched.lock) 8}Copy the code

▎ entersyscall_gcwait

1func entersyscall_gcwait() {
 2    _g_ := getg()
 3    _p_ := _g_.m.p.ptr()
 4
 5    lock(&sched.lock)
 6    if sched.stopwait > 0 && atomic.Cas(&_p_.status, _Psyscall, _Pgcstop) {
 7        _p_.syscalltick++
 8        if sched.stopwait--; sched.stopwait == 0 {
 9            notewakeup(&sched.stopnote)
10        }
11    }
12    unlock(&sched.lock)
13}
Copy the code

▎ summary

The runtime is notified of all system calls provided to the user, in the form of entersyscall or exitSyscall. If syscall blocks, the Runtime decides whether to release P for another M. Unbinding refers to the unbinding between M and P. If the binding is unbound, the g will be put into the execution queue RUNq when syscall returns.

At the same time, the Runtime retains the privilege of not getting my P removed while executing its logic, ensuring that any syscall used in Go’s “low-level” will be processed as soon as it returns.

Epollwait does not have the same privilege as syscall. epollwait.

▎ END

References are as follows

z.didi.cn/1HecgP



        

Xargin, open source enthusiast. Active on Github and various tech communities. A passion for tech confrontations. Author of the open source book Advanced Programming for Go