preface

Error handling, both within and between systems, throughout the entire development, run, and dead life cycle of a system, is a particularly important aspect of code writing. An error was reported somewhere. Do I return directly, or print a line of log and return? Nested function error, how to find the root cause of the error? Should the error code in the HTTP or RPC interface be defined in each response structure, or should it be returned through HTTP code and RPC Error? This article will explain the definition, handling methods and related causes of errors from two aspects of the system and system. Since I mainly use GO for development at ordinary times, the error handling in the system is more from the GO perspective.

System error handling

"Go Proverbs"Copy the code

Defining error variables

Disadvantage 1: cannot wrap error information

Errorf(“xx file: %v”, err) is used in business systems to return io.eof, which directly causes a decision failure in the outermost layer.

Disadvantage 2: The Errors variable is a public API

Because the errors variable needs to be used everywhere, it must be public. If some interface uses this variable for method definition, all structs implementing that interface need to recognize and handle such errors. Some method implementations even have to return other types of errors. But because to realize this interface, also need to do more design, coding work, very inconvenient. Another effect is that because each module defines its own errors variables, it is easy to associate these packages with each other during use, and as the errors variables increase, it is easy to create logical and code loop dependencies.

Therefore, use the errors variable as little as possible.

Use the errors text message

Error text messages are more for people than for code, but they are more common in everyday use (Dave Cheney recommends avoiding them, but it is a matter of personal preference).

Use errors assertion

The errors assertion defines only one struct that implements the error interface, for example:

Therefore, use non-exported error assertions as little as possible.

The black box passes errors and handles errors with predicate behavior (and only once)

Dave’s favorite is Opaque Errors. The idea of Opaque errors is as follows: as the caller of functions and methods, you can only know whether the result of the operation is OK or not, but you can’t expect the error that may occur, so you should return directly (but with additional context in the return process). In this way, errors carrying context can be passed between caller black boxes. We do not care about the content of the error returned by each layer, but we care about the behaviors implemented by the corresponding error, which can be summarized as follows:

Errors are just values

The above elaboration of several error handling methods also shows that errors is indeed a special variable. As for these methods, I think you can use them in a targeted way after understanding the advantages and disadvantages of Opaque errors. The processing method of Opaque errors is quite impressive and worth trying. It is behavior oriented rather than error programming. Some of the questions in the preface are actually answered, we don’t need to print a log at every error, we just need to pass on the context of the error, the error only needs to be handled centrally in one place. So how do errors get passed in context? This leads to the following PKG/Errors library (Dave not only pointed the way, but also made the implementation, enlightened and overjoyed, hahaha…).

Passing context in an error

The most common way we use it in development is: If err! =nil{return err} This method makes a quick return and is handled by the outer layer. However, it lacks richer information, such as xx module failed/ xx file open failed. If FMT. This can lead to Sentinel errors failures at the top level, so we need a way to ensure that the source of the error cause is found and that each layer of context is passed, which is what the PKG/Errors library does for us.

Wrap does error wrapping, passing context

The code is fairly simple. Here are some examples from Dave to show how to use Wrap:

  • Reading files (Wrapper One)
  • Call the file read function (Wrapper Two)
  • Possible error results
  • Get the call stack by errors.Print

Cause Decompress the error package to obtain the error source

You can obtain the error source through Cause. If we need to do different things according to the source of the error, we need to use Cause, as shown in the following example:

System error handling summary

Davey’s articles and his own understanding above summarize several ways to deal with errors in GO and weigh the pros and cons as follows:

  • Minimize the use of error variables, error assertions, and error content (you can use them, but with caution).
  • Treat errors as invisible special variables in the process, and try to assert behavior rather than type.
  • An error should only be handled once, and the process of handling an error should determine different behaviors based on the contents of the error.
  • Wrap the error context with Wrap and obtain the error source from Cause.
  • The Bible is just a story, not a norm.

Intersystem error handling

The first part mainly describes how to define, transfer and deal with errors in go system, and the second part mainly analyzes the definition and transfer of errors between systems. When dealing with HTTP or RPC requests, we might wonder if HTTP code should always be passed 200 and then pass error codes through a custom structure. How should errors be passed between RPCS? Should network errors be passed through the same structure as business errors? Should error codes be uniform across the company? Does APP error copy need to be configured in a system centralization?

The error is defined externally

In the Thrift service, we often define answers like this:

 struct DeleteProductRes {
    1: optional DeleteProductData data
    1000: optional ThriftUtil.ErrInfo errinfo
 }
Copy the code

Errinfo contains error codes and error messages, and each structure has a similar representation. The problem is that at the framework level it is difficult to count the business system SLI, which includes the availability, quality, and so on. For example, when A calls B and B calls C, the fuse is triggered between B and C due to the high load of C. At this time, the fuse information returned by B to A is included in errinfo. However, at this time, A’s SLA is actually affected, but we have no way to show the corresponding person in charge in time and visually. So errinfo here should go to the outermost layer. In contrast, here is the GRPC structure definition and error handling:

message QueryChangeResponse {
    message Item{
         string service_name = 1;
    }
    message Data{
        repeated Item items= 1;
    }
    Data data = 1;
}
rpc QueryChange(QueryChangeRequest) returns (QueryChangeResponse);
Copy the code

The IDL of GRPC does not contain the definition of error messages, but the Status between the CLIENT and server of GRPC is native and can be converted to and from the standard error.

  • Protobuf definition (GRPC self-contained, no custom implementation required):
package google.rpc; message Status { // A simple error code that can be easily handled by the client. The // actual error code is defined by 'google.rpc.Code'. // a Code value that can be handled by the client int32 Code = 1; // A developer-facing human-readable error message in English. It should // both explain the error and offer an Actionable resolution to it. // debug The specific reason for the error string message = 2; // Additional error information that the client code can use to handle // the error, Such as retry delay or a help link. // Add error information, such as retry delay, retry policy, error help link, etc. }Copy the code
  • Server error assignment:
// SayHello implements helloworld.GreeterServer
func (s *server) SayHello(ctx context.Context, in *pb.HelloRequest) (*pb.HelloReply, error) {
	log.Printf("Received: %v", in.GetName())
	/ / an error
	return nil, status.Errorf(codes.Unimplemented, "method SayHello not implemented")
	// No error, request successful
	return &pb.HelloReply{Message: "Hello " + in.GetName()}, nil
}
Copy the code
  • Client processing logic:
r, err := c.SayHello(ctx, &pb.HelloRequest{Name: name})
iferr ! =nil {
	s,ok := status.FromError(err)
	if ok{// Can be changed to Status
		log.Println(s.Code())
		log.Println(s.Message())
		log.Println(s.Details())
	}else{/ / common error}}else{
    // No error, request successful
    log.Printf("Greeting: %s", r.GetMessage())
}
Copy the code
  • The interceptor:
// server rpc cost, record to log and prometheus
func monitorServerInterceptor(a) grpc.UnaryServerInterceptor {
	return func(ctx context.Context, req interface{}, info *grpc.UnaryServerInfo, handler grpc.UnaryHandler) (resp interface{}, err error){resp, err = handler(CTX, REQ) framework layer of various common processing...return resp, err
}
Copy the code

The framework layer can use interceptors to easily get err(Status) information for unified processing, which can be used to monitor, alarm, assess system SLA, and so on. HTTP achieves a similar effect by assigning different values to HTTP code (and passing information like GRPC Status uniformly through the header). For errors at the system level that are irrelevant to business, the Status library also converts system error into status and retains cause information. We can easily process errors according to the code or Message of status. So the error definition should be appropriate externally.

Error code design

A well-defined error code can be used to easily locate the system that reported the error.

  • A good way to define this is to assign a segment code to each line of business and system. For example, code is an integer of 8 digits (1000,0400). The first four digits represent the service line, and the last four digits can be the error code defined by the service line. The two can be combined to form a complete error code.
  • Error codes of each service system are uniformly defined in the base library to facilitate error information sharing. We can also define some common error codes, similar to those of 400 and 500, and the specific information of such error codes can be displayed through the Message field. The framework level will be sensitive to such error codes and conduct unified dot and alarm.
  • In fact, it is difficult to promote error codes at the company level. If there are different languages, it is even more troublesome to unify the definition of error codes. Therefore, the definition of error codes is more a convention than a mandatory one. By standardizing error codes, the communication cost of troubleshooting exceptions can be reduced, and the corresponding system can also enjoy the benefits brought by the framework level. Our idea is to attract people to gradually use standardized error codes through these benefits. It doesn’t matter if we don’t use them, because after all, the normal business process will not be affected.

Incorrect copywriting unified configuration

Error text is closer to the user, and we definitely don’t want our users to see 127.0.0.1:8000 I/O timeout errors on our APP. At the same time, the user requests an interface that should ultimately handle the error and determine the location of the behavior, so the error code must be escaped into a message that the user can accept. Error copywriting content and template will also change frequently, so a unified copywriting configuration system is still necessary. The basis of obtaining copywriting can be the standard error code defined by the above business, or a key-content mapping of the copywriting system’s own conditions. The design will be relatively simple, so I will not expand it here.

At the end

A well-designed error processing system can clearly show the link where errors occur within the system, reduce the communication cost when errors occur between systems, and quickly locate the cause of errors when troubleshooting online problems. The above two parts of error handling within and between systems are my reflections on error handling. Due to the length, some points such as encapsulation between error codes and Status, increasing ease of use, and practical examples for behavior assertions are not covered. As long as the system can be roughly done within the system, system error series can achieve a more ideal effect.

reference

Dave.cheney.net/2019/01/27/…

Dave.cheney.net/2014/11/04/…

Dave.cheney.net/paste/gocon…

Github.com/grpc/grpc.i…

Cloud.google.com/apis/design…