preface
Recently, the importance of high availability of services has become more and more important. High availability usually refers to the corresponding operations such as failover to redundant modules to ensure the external availability of the system, and detailed operations such as program offline/restart. What are the processing methods in Go? Today we are going to talk about the graceful shutdown and restart of the Go application, how to make the application process the old connection before closing or restarting, as much as possible without feeling the switch.
Concept is introduced
Interprocess communication mode
We know that processes communicate in several common ways:
- The pipe
- A semaphore
- Network socket
- The Shared memory
Today let’s talk about a semaphore, such as P/V semaphore, often used for process when access to critical region, used to wake up or wait for the other processes of the critical area, semaphore is essentially operating system sends an interrupt mechanism, in addition to the P/V semaphore, and common scenarios such as we the interrupt press Ctrl + C to inform the process exits, It sends an interrupt signal, also known as SIGINT.
In Go, semaphore semantics for Windows are as follows:
const (
// More invented values for signals
SIGHUP = Signal(0x1)
SIGINT = Signal(0x2)
SIGQUIT = Signal(0x3)
SIGILL = Signal(0x4)
SIGTRAP = Signal(0x5)
SIGABRT = Signal(0x6)
SIGBUS = Signal(0x7)
SIGFPE = Signal(0x8)
SIGKILL = Signal(0x9)
SIGSEGV = Signal(0xb)
SIGPIPE = Signal(0xd)
SIGALRM = Signal(0xe)
SIGTERM = Signal(0xf))var signals = [...]string{
1: "hangup".2: "interrupt".3: "quit".4: "illegal instruction".5: "trace/breakpoint trap".6: "aborted".7: "bus error".8: "floating point exception".9: "killed".10: "user defined signal 1".11: "segmentation fault".12: "user defined signal 2".13: "broken pipe".14: "alarm clock".15: "terminated",}Copy the code
Using 15 digits in hexadecimal, how do we listen for system semaphores in GO?
func Notify(c chan<- os.Signal, sig ... os.Signal) {
if c == nil {
panic("os/signal: Notify using nil channel")}// omit some code...
add := func(n int) {
if n < 0 {
return
}
if! h.want(n) { h.set(n)if handlers.ref[n] == 0 {
enableSignal(n)
// Singleton start listener to ensure that the corresponding processing logic is registered before the program starts
watchSignalLoopOnce.Do(func(a) {
ifwatchSignalLoop ! =nil {
// New coroutine polling listener
go watchSignalLoop()
}
})
}
handlers.ref[n]++
}
}
// omit some code...
}
Copy the code
WatchSignalLoop, in Unix versions, is a polling function
func loop(a) {
for {
process(syscall.Signal(signal_recv()))
}
}
func init(a) {
watchSignalLoop = loop
}
Copy the code
Now we know the general process of registering and listening semaphore, by registering a context with the target semaphore, asynchronously create a coroutine for system signal listening.
Let’s take the example of interrupts, which listen for interrupt requests from the system. Go can be registered as follows:
// Register to return os.interrupt bound CTX
ctx, stop := signal.NotifyContext(context.Background(), os.Interrupt)
/ /...
// Where stop() is used to unbind context and semaphore
defer stop()
Copy the code
After listening for the context returned by os.interrupt, if the system call is interrupted, the CTX performs a termination, ctx.done (), which we can use as a semaphore for our subsequent processing.
Elegant closed
So once we get the interrupt semaphore, let’s see how to exit gracefully, let’s look at this function, okay
// Shutdown gracefully shuts down the server without interrupting any
// active connections. Shutdown works by first closing all open
// listeners, then closing all idle connections, and then waiting
// indefinitely for connections to return to idle and then shut down.
// If the provided context expires before the shutdown is complete,
// Shutdown returns the context's error, otherwise it returns any
// error returned from closing the Server's underlying Listener(s).
func (srv *Server) Shutdown(ctx context.Context) error {
// ...
}
Copy the code
As you can see from the comments, the Shutdown() execution closes the open connection, then closes the idle connection, and then waits for the used connection to become idle before closing it. In addition, Shutdown() returns an error if the incoming CTX context expires before a Shutdown is performed.
So we can use Shutdown() to allow the program to perform the final closure at the interrupt point, and use the context lifecycle to control the buffer period for the closure.
Code examples:
var (
server http.Server
)
// Stop demo gracefully
func main(a) {
// Register to return os.interrupt bound CTX
ctx, stop := signal.NotifyContext(context.Background(), os.Interrupt)
defer stop()
server = http.Server{
Addr: ": 8080",}// Register the route
http.HandleFunc("/".func(w http.ResponseWriter, r *http.Request) {
time.Sleep(time.Second * 3)
fmt.Fprint(w, "Hello World")})// Start the listener
go server.ListenAndServe()
// Triggers the interrupt signal
<-ctx.Done()
// Unbind context and semaphore
stop()
fmt.Println("shutting down gracefully, press Ctrl+C again to force")
// Reclaim the connection in the last 10 seconds
timeoutCtx, cancelFunc := context.WithTimeout(context.Background(), 10*time.Second)
defer cancelFunc()
iferr := server.Shutdown(timeoutCtx); err ! =nil {
fmt.Println(err)
}
}
Copy the code
- Register a simple routing request, wait 3 seconds and return “Hello World”
- Bind system semaphores
Signal.SIGNINT
The context - Context-aware interrupts
- Create a new 10-second lifetime context
- Pass a context with a lifecycle to
Shutdown()
Function to control closure
The example output starts the program and Ctrl+C is pressed, and the program terminates quickly without a request.
$go run main.go 2021/11/06 23:58:03 2021/11/06 23:58:03 Program shutdown complete.Copy the code
We then execute the request after the program has started to take time to process
$curl 127.0.0.1:8080 Hello WorldCopy the code
Press Ctrl+C on the server
$go run main.go 2021/11/06 23:58:33 2021/11/06 23:58:35 Program shutdown complete.Copy the code
As you can see in the log output, instead of exiting immediately, the program will wait for the request to terminate before closing. If we adjust the request logic to take longer to execute, the program will return a context timeout error when the processing time exceeds the context cycle of the shutdown function binding.
Receive SIGINT signal, perform graceful stop, wait for end... 2021/11/07 00:02:51 Graceful Stop error: Context Deadline exceededCopy the code
The topic
So that’s the rough implementation of graceful exit, the idea of extensibility: This is primarily an elegant offline before processing, production scenarios, service offline or not with other detection measures, such as heartbeat packets lost timeout, k8s service referrals can through polling cycle listening to a local file/handle and so on, actually a semaphore is a method of program interrupt our perception, based on the service offline, We know that we can finally use Shutdown() to perform the closure. In addition, if context Deadline exceeded is encountered after the execution of the end, the business processing layer can generally archive the uncompleted requests, put them in the retry queue or log them, archive them and put them in the future repair.
Graceful restart
Now that we’re done talking about exiting gracefully, we’ll look at how to reboot gracefully. Some time ago, I saw an article about the implementation of semaphore interaction. I think it is very interesting, so I took it out and sorted it out. The link to the article will be put in the resources.
In fact, the core of an elegant restart is that we need to have a connection man. When the offline service has unfinished connections, we need to provide a new service/process to handle as much as possible, and continue to listen for new requests to provide external availability, so that the requestor side has no perception.
In short, achieving an elegant reboot requires solving two problems:
- At the operating system level, how to keep the socket created so that the newly restarted process can continue to listen
- Ensure that all subsequent requests execute responses or time out
This sounds ideal, but let’s break it down step by step to see how it works.
The core and dismantling
- Fork a child process in the process currently listening on the socket
- A new (child) process takes over, reusing the old socket
- The new (child) process tells the old (parent) process to stop receiving requests and closes
State transition
We won’t go into the details of what happens after a process is started, but first we’ll look at what states the program needs to listen to, or what events it needs to respond to at restart time. In fact, the current service is no more than two states,
- One is the first startup
- Another is a version change that starts a new process to replace the old one
In fact, there is no essential difference between state 1 and ordinary services, just start up and listen.
Let’s talk about state two. State two is actually an extension of state one, so the program needs to listen to both states simultaneously, and the listening event is the semaphore we mentioned above in graceful stop.
I have drawn a general flow chart for further understanding:
Concept of pre –
Let’s familiarize ourselves with some of the concepts in network Socket programming so we know how to reuse connections.
We know that in a network environment, it is possible to use TCP quads to establish an end-to-end connection, namely < SRC ADDR >, < SRC port>,
,
to lock a unique connection identifier.
When a TCP connection is disconnected, there is a TIME_WAIT state, which is used to wait for the connection to be closed or for the buffer data to be sent. This state does not change (default: 2min). The current TCP tuple cannot be reused unless SO_REUSEADDR is set.
There are two key parameters: SO_REUSEADDR and SO_REUSEPORT.
parameter | meaning |
---|---|
SO_REUSEADDR | Allows the connection IP address to be reused without being completely disconnected |
SO_REUSEPORT | In the openSO_REUSEADDR The IP address of the connected port can be multiplexed |
If multiplexing is enabled and the connection is successful, at the operating system level, if multiple file handles are bound to the system’S IP +port, how will the system deal with it? The answer is load balancing, that is, the system will allocate according to the request, similar to random polling, to interact with the connection of the same IP +port.
Some people may say, “Does the access of different client processes have the problem of overreaching?” Indeed, there will be, so there is a convention based on security considerations:
To prevent “port hijacking”, there is one special limitation: All sockets that want to share the same address and port combination must belong to processes that share the same effective user ID
All connections to be enabled to reuse the same address port must belong to the same userID, and our context is created by the same process or the same user, so the original connection can be reused to avoid malicious hijacking.
The sample program
So let’s see how does that work
- Pass in configuration items for multiplexing connections
func control(network, address string, c syscall.RawConn) error {
var err error
c.Control(func(fd uintptr) {
err = unix.SetsockoptInt(int(fd), unix.SOL_SOCKET, unix.SO_REUSEADDR, 1)
iferr ! =nil {
return
}
err = unix.SetsockoptInt(int(fd), unix.SOL_SOCKET, unix.SO_REUSEPORT, 1)
iferr ! =nil {
return}})return err
}
Copy the code
- Checks whether the TCP tuple currently listening is listening
func listener(a) (net.Listener, error) {
lc := net.ListenConfig{
Control: control,
}
if l, err := lc.Listen(context.TODO(), "tcp".": 8080"); err ! =nil {
// The port is not used. Err is returned
return nil, err
} else {
return l, nil}}Copy the code
- Monitor system semaphores
func upgradeLoop(l *net.Listener, s *http.Server) {
sig := make(chan os.Signal)
signal.Notify(sig, syscall.SIGQUIT, syscall.SIGUSR2)
for t := range sig {
switch t {
case syscall.SIGUSR2:
// Receive upgrade semaphore
log.Println("Received SIGUSR2 upgrading binary")
// the child process is upgraded gracefully
iferr := spawnChild(); err ! =nil {
log.Println(
"Cannot perform binary upgrade, when starting process: ",
err.Error(),
)
continue
}
case syscall.SIGQUIT:
// Receive semaphore to kill the current process
s.Shutdown(context.Background())
os.Exit(0)}}}// fork creates the child process and overwrites the global parent process ID before the new process
func spawnChild(a) error {
// Get the current boot passed executable file parameters, such as./main
argv0, err := exec.LookPath(os.Args[0])
iferr ! =nil {
return err
}
wd, err := os.Getwd()
iferr ! =nil {
return err
}
files := make([]*os.File, 0)
files = append(files, os.Stdin, os.Stdout, os.Stderr)
// Save the current process. This ID will be killed after the new process starts
ppid := os.Getpid()
os.Setenv("APP_PPID", strconv.Itoa(ppid))
Start a new process
os.StartProcess(argv0, os.Args, &os.ProcAttr{
Dir: wd,
Env: os.Environ(),
Files: files,
Sys: &syscall.SysProcAttr{},
})
return nil
}
Copy the code
- Master coroutine logic
func main(a) {
log.Println("Started HTTP API, PID: ", os.Getpid())
var l net.Listener
// First startup
iffd, err := listener(); err ! =nil {
log.Println("Parent does not exists, starting a normal way")
l, err = net.Listen("tcp".": 8080")
iferr ! =nil {
panic(err)
}
} else {
// The current port is monitored
l = fd
// Send quit to the parent process
killParent()
time.Sleep(time.Second)
}
// Start server listener
s := &http.Server{}
http.HandleFunc("/".func(rw http.ResponseWriter, r *http.Request) {
log.Printf("New request! From: %d, path: %s, method: %s: ", os.Getpid(),
r.URL, r.Method)
})
go s.Serve(l)
// Listen for semaphore
upgradeLoop(&l, s)
}
Copy the code
Code reference:
The source code of zero-downtime- Application project is actually the core of the process of graceful migration from new process to old process. Once you understand the code, it will look a little clearer.
This is why the main function logic block needs to be compatible with both normal server flow and receiving the end of the old process.
Their application
The graceful restart trigger mechanism above is that the user sends the semaphore pkill -sigusR2 to the process as a seamless switch for manual upgrade.
Can expand actually based on the function, such as monitoring service to join connection detection, alarm request response time, etc., when reaching a trigger mechanism, can trigger the graceful restart, so as to realize the effect of dynamic pull up, of course the follow-up still need checking location service where is the problem, after all, sometimes restart doesn’t solve all problems.
Refer to the link
- Graceful Shutdowns in Golang with signal.NotifyContext
- Zero downtime API in Golang
- Graceful Restart in Golang
- K8s-based elegant closure
- Discussion of BSD socket parameters