Design and implementation of Golang Failpoint

For a large complex system, usually contains multiple modules or more components, simulate the faults of each subsystem is indispensable part of the test, and the fault simulation must be integrated without intrusion into an automated test system, by automatically activated the point of failure in the automated test to simulate the failure, And observe whether the final result conforms to the expected result to judge the correctness and stability of the system. If a colleague is required to plug and unplug network cables to simulate network anomalies in a distributed system, or a hard disk is destroyed to simulate disk damage in a storage system, expensive testing costs will make testing a disaster, and it is difficult to simulate tests that require delicate control. So we need some automated way to do deterministic failure testing.

Failpoint project (github.com/pingcap/failpoint) is born for this, it is a FreeBSD failpoints Golang implementation, allow the injection of mistakes in the code or abnormal behavior, These abnormal behaviors are triggered by dynamic activation of environment variables or code. Failpoint can be used to simulate error processing in a variety of complex systems to improve the system’s fault tolerance, correctness and stability. For example:

A random delay or unavailability occurs in a microservice.
The STORAGE system disk I/O latency increases, the I/O throughput is low, and the disk fall time is long.
A hotspot occurs in the scheduling system, and a scheduling command fails.
In the recharge system, the callback interface is used to simulate the success of repeated recharge requests from the third party.
Game development simulation player network instability, frame drop, delay is too large, and a variety of abnormal input (plug-in request) under the circumstances of the system is working correctly.
…

Why repeat the wheel?

The Etcd team made a great contribution to the Golang ecosystem by developing GoFail in 2016 to greatly simplify error injection. We had already introduced GoFail for error injection testing in 2018, but we found some problems with functionality and convenience, so we decided to build a better “wheel.”

How do I use Gofail

Use comments to inject a failpoint into a program:

 // gofail: var FailIfImportedChunk int // if merger, ok := scp.merger.(*ChunkCheckpointMerger); ok && merger.Checksum.SumKVS() >= uint64(FailIfImportedChunk) { // rc.checkpointsWg.Done() // rc.checkpointsWg.Wait() //  panic("forcing failure due to FailIfImportedChunk") // } // goto RETURN1 // gofail: RETURN1: // gofail: var FailIfStatusBecomes int // if merger, ok := scp.merger.(*StatusCheckpointMerger); ok && merger.EngineID >= 0 && int(merger.Status) == FailIfStatusBecomes { // rc.checkpointsWg.Done() // rc.checkpointsWg.Wait() // panic("forcing failure due to FailIfStatusBecomes") // } // goto RETURN2 // gofail: RETURN2:
Copy the code

Converted code using Gofail Enable:

if vFailIfImportedChunk, __fpErr := __fp_FailIfImportedChunk.Acquire(); __fpErr == nil { defer __fp_FailIfImportedChunk.Release(); FailIfImportedChunk, __fpTypeOK := vFailIfImportedChunk.(int); if ! __fpTypeOK { goto __badTypeFailIfImportedChunk} if merger, ok := scp.merger.(*ChunkCheckpointMerger); ok && merger.Checksum.SumKVS() >= uint64(FailIfImportedChunk) { rc.checkpointsWg.Done() rc.checkpointsWg.Wait() panic("forcing failure due to FailIfImportedChunk") } goto RETURN1; __badTypeFailIfImportedChunk: __fp_FailIfImportedChunk.BadType(vFailIfImportedChunk, "int"); }; /* gofail-label */ RETURN1: if vFailIfStatusBecomes, __fpErr := __fp_FailIfStatusBecomes.Acquire(); __fpErr == nil { defer __fp_FailIfStatusBecomes.Release(); FailIfStatusBecomes, __fpTypeOK := vFailIfStatusBecomes.(int); if ! __fpTypeOK { goto __badTypeFailIfStatusBecomes} if merger, ok := scp.merger.(*StatusCheckpointMerger); ok && merger.EngineID >= 0 && int(merger.Status) == FailIfStatusBecomes { rc.checkpointsWg.Done() rc.checkpointsWg.Wait() panic("forcing failure due to FailIfStatusBecomes") } goto RETURN2; __badTypeFailIfStatusBecomes: __fp_FailIfStatusBecomes.BadType(vFailIfStatusBecomes, "int"); }; /* gofail-label */ RETURN2:
Copy the code

Problems encountered in the use of Gofail

Using comments to inject failpoint into code that is error-prone and undetected by the compiler.
It can only take effect globally. Large projects will introduce parallel testing in order to shorten the time of automatic testing, and there will be interference between different parallel tasks.
// goTO RETURN2 and // gofail: RETURN2: must be generated with a blank line.

What should we design a Failpoint?

What would the ideal failPoint implementation look like?

Ideally, failpoint should be defined in code and non-intrusive to business logic. If in a language that supports macros (such as Rust), we can define a fail_point macro to define failpoint:

                            fail_point! ( "transport_on_send_store", |sid| if let Some(sid) = sid { let sid: u64 = sid.parse().unwrap(); if sid == store_id { self.raft_client. wl().addrs.remove(&store_id); }})
                            Copy the code

But we ran into some problems:

Golang does not support macro language features.
Golang does not support compiler plug-ins.
Golang tags also don’t provide an elegant implementation (go build –tag=”enable-failpoint-a”).

Failpoint design criteria

Define failpoint using Golang code, not comments or other forms.
Failpoint code should not have any overhead:

Do not affect the normal function logic, do not have any intrusion into the function code.
Performance rollback cannot be caused after injection of failpoint code.
The Failpoint code ultimately does not appear in the final release binary.

Failpoint code must be readable, easy to write, and able to introduce compiler detection.
The resulting code must be readable.
The line number of the functional logic code cannot be changed in the generated code (for debugging purposes).
Parallel testing is supported. You can use context.Context to control whether a specific failpoint is activated.

How does Golang implement a failpoint macro?

What is the nature of macros? If we trace back to the source, we find that the FAILpoint meeting the above conditions can be realized in Golang through AST rewriting, as shown in the following figure:

For any source file of Golang code, you can parse out the syntax tree of the file, traverse the entire syntax tree, find all failpoint injection points, and then rewrite the syntax tree to convert it to the desired logic.

Relevant concepts

Failpoint

Failpoint isa code snippet and is executed only when the corresponding Failpoint name is activated. If Failpoint.Disable(“failpoint-name-for-demo”) is disabled, Then the corresponding Failpoint will never trigger. All failPOiint code snippets are not compiled into the final binary, as we simulated file system permissions:

func saveTo(path string) error {    failpoint.Inject("mock-permission-deny", func() error {         // It's OK to access outer scope variable         return fmt.Errorf("mock permission deny: %s", path)    })}
Copy the code

Marker function

The AST rewriting phase marks the part that needs to be rewritten, which has the following functions:

Prompts Rewriter to rewrite as an equal IF statement.

The parameters of the marker function are the parameters needed in the rewrite process.
The tag function is an empty function, and the compilation process is inline, further eliminated.
The failpoint injected in the tag function is a closure. IF the closure accesses external variables, the closure syntax allows you to capture the external scope variables without compiling errors. The converted code is an IF statement, which accesses the external scope variables without causing any problems. So closure capture is just syntactically legitimate, and ultimately doesn’t have any extra overhead.

Simple, easy to read and write.
By introducing compiler detection, if the parameter of Marker function is not correct, the program cannot be compiled, so as to ensure the correctness of the translated code.

List of Marker functions currently supported:

func Inject(fpname string , fpblock func(val Value)) {}
func InjectContext(fpname string , ctx context.Context , fpblock func(val Value)) {}
func Break(label ... string) {}
func Goto(label string) {}
func Continue(label ... string) {}
func Fallthrough() {}
func Return(results ... interface{}) {}
func Label(label string) {}

How to use FailPoint injection in your application?

Inject a Failpoint at the place where it is called. The failpoint.Inject call is rewritten as an IF statement, where mock-io-error is used to determine whether it is fired. The logic in failpoint-closure is executed when triggered. Let’s say we inject an IO error into a function that reads a file:

failpoint.Inject("mock-io-error", func(val failpoint.Value) error {    return fmt.Errorf("mock error: %v", val.(string))})
Copy the code

The final converted code looks like this:

if ok, val := failpoint.Eval(_curpkg_("mock-io-error")); ok {    return fmt.Errorf("mock error: %v", val.(string))}
Copy the code

Enable(“mock- IO -error”, “return(“disk error”)”)) to activate the failpoint in the program. If you need to assign a custom Value to failpoint. You need to pass a Failpoint expression, such as return(“disk error”). For more syntax, refer to the Failpoint syntax.

Closures can be nil, such as failpoint.enable (“mock-delay”, “sleep(1000)”), which is intended to sleep for a second at the injection point without performing additional logic.

                                    failpoint.Inject(
                                        "mock-delay",
                                            nil)failpoint.
                                            Inject("mock-delay
                                                "
                                                    , func(){})
                                    
                                    Copy the code

This results in the following code:

failpoint.Eval(_curpkg_("mock-delay"))failpoint.Eval(_curpkg_("mock-delay"))
Copy the code

If we only want to perform a panic in failpoint and do not need to receive failpoint.Value, we can omit this Value in the closure’s arguments. Such as:

failpoint.Inject("mock-panic", func(_ failpoint.Value) error {    panic("mock panic")})// ORfailpoint.Inject("mock-panic", func() error {    panic("mock panic")})
Copy the code

The best practices are as follows:

failpoint.Enable("mock-panic", "panic")failpoint.Inject("mock-panic", nil)// GENERATED CODEfailpoint.Eval(_curpkg_("mock-panic"))
Copy the code

Context to prevent interference between different test tasks in parallel testing, we can include a callback function in context. context to fine control failpoint activation and shutdown:

failpoint.InjectContext(ctx, "failpoint-name", func(val failpoint.Value) {    fmt.Println("unit-test", val)})
Copy the code

Converted code:

if ok, val := failpoint.EvalContext(ctx, _curpkg_("failpoint-name")); ok {    fmt.Println("unit-test", val)}
Copy the code

Examples using Failpoint.WithHook:

func (s *dmlSuite) TestCRUDParallel() { sctx := failpoint.WithHook(context.Backgroud(), func(ctx context.Context, fpname string) bool { return ctx.Value(fpname) ! = nil // Determine by ctx key }) insertFailpoints = map[string]struct{} { "insert-record-fp": {}, "insert-index-fp": {}, "on-duplicate-fp": {}, } ictx := failpoint.WithHook(context.Backgroud(), func(ctx context.Context, fpname string) bool { _, found := insertFailpoints[fpname] // Only enables some failpoints. return found }) deleteFailpoints = map[string]struct{} { "tikv-is-busy-fp": {}, "fetch-tso-timeout": {}, } dctx := failpoint.WithHook(context.Backgroud(), func(ctx context.Context, fpname string) bool { _, found := deleteFailpoints[fpname] // Only disables failpoints. return ! found }) // other DML parallel test cases. s.RunParallel(buildSelectTests(sctx)) s.RunParallel(buildInsertTests(ictx)) s.RunParallel(buildDeleteTests(dctx))}
Copy the code

If we use failpoint in the loop, we might use another Marker function:

failpoint.Label("outer")for i := 0; i < 100; i++ {    inner:        for j := 0; j < 1000; j++ {            switch rand.Intn(j) + i {            case j / 5:                failpoint.Break()            case j / 7:                failpoint.Continue("outer")            case j / 9:                failpoint.Fallthrough()            case j / 10:                failpoint.Goto("outer")            default:                failpoint.Inject("failpoint-name", func(val failpoint.Value) {                    fmt.Println("unit-test", val.(int))                    if val == j/11 {                        failpoint.Break("inner")                    } else {                        failpoint.Goto("outer")                    }                })        }    }}
Copy the code

The above code will eventually be rewritten as follows:

outer:    for i := 0; i < 100; i++ {    inner:        for j := 0; j < 1000; j++ {            switch rand.Intn(j) + i {            case j / 5:                break            case j / 7:                continue outer            case j / 9:                fallthrough            case j / 10:                goto outer            default:                if ok, val := failpoint.Eval(_curpkg_("failpoint-name")); ok {                    fmt.Println("unit-test", val.(int))                    if val == j/11 {                        break inner                    } else {                        goto outer                    }                }            }        }    }
Copy the code

Why do label, break, continue and fallthrough Marker functions remain? Why not just use keywords?

Golang will not compile if a variable or tag is not used.

label1: // compiler error: unused label1     failpoint.Inject("failpoint-name", func(val failpoint.Value) {         if val.(int) == 1000 {             goto label1 // illegal to use goto here         }         fmt.Println("unit-test", val)     })
Copy the code

Break and continue can only be used in the context of a loop, within a closure.

Some complex injection examples

Example 1: Infuse failpoint in INITIAL and CONDITIONAL IF statements

if a, b := func() {    failpoint.Inject("failpoint-name", func(val failpoint.Value) {        fmt.Println("unit-test", val)    })}, func() int { return rand.Intn(200) }(); b > func() int {    failpoint.Inject("failpoint-name", func(val failpoint.Value) int {        return val.(int)    })    return rand.Intn(3000)}() && b < func() int {    failpoint.Inject("failpoint-name-2", func(val failpoint.Value) {        return rand.Intn(val.(int))    })    return rand.Intn(6000)}() {    a()    failpoint.Inject("failpoint-name-3", func(val failpoint.Value) {        fmt.Println("unit-test", val)    })}
Copy the code

The above code will eventually be rewritten as:

if a, b := func() {    if ok, val := failpoint.Eval(_curpkg_("failpoint-name")); ok {        fmt.Println("unit-test", val)    }}, func() int { return rand.Intn(200) }(); b > func() int {    if ok, val := failpoint.Eval(_curpkg_("failpoint-name")); ok {        return val.(int)    }    return rand.Intn(3000)}() && b < func() int {    if ok, val := failpoint.Eval(_curpkg_("failpoint-name-2")); ok {        return rand.Intn(val.(int))    }    return rand.Intn(6000)}() {    a()    if ok, val := failpoint.Eval(_curpkg_("failpoint-name-3")); ok {        fmt.Println("unit-test", val)    }}
Copy the code

Example 2: Inject failpoint into a SELECT statement CASE to dynamically control whether a CASE is blocked

                                        func                                            (s *StoreService) ExecuteStoreTask()
                                            { select                                            { case                                            <-func() chan                                            *StoreTask { failpoint.Inject(
                                            "priority-fp",
                                                func(_ failpoint.Value)
                                                { return                                                make(chan                                                *StoreTask) })                                                return s.priorityHighCh
                                                 }(): fmt.Println(
                                                    "execute high priority task")
                                                         case                                                            <- s.priorityNormalCh:
                                                             fmt.Println("execute
                                                                normal priority task")
                                                                 case <-                                                                    s.priorityLowCh:                                                                    fmt.Println("execute
                                                                    normal low task")                                                                    }}
                                        
                                        Copy the code

The above code will eventually be rewritten as:

func (s *StoreService) ExecuteStoreTask() {    select {    case <-func() chan *StoreTask {        if ok, _ := failpoint.Eval(_curpkg_("priority-fp")); ok {            return make(chan *StoreTask)        })        return s.priorityHighCh    }():        fmt.Println("execute high priority task")    case <- s.priorityNormalCh:        fmt.Println("execute normal priority task")    case <- s.priorityLowCh:        fmt.Println("execute normal low task")    }}
Copy the code

Example 3: Dynamically injecting a SWITCH CASE

switch opType := operator.Type(); {case opType == "balance-leader":    fmt.Println("create balance leader steps")case opType == "balance-region":    fmt.Println("create balance region steps")case opType == "scatter-region":    fmt.Println("create scatter region steps")case func() bool {    failpoint.Inject("dynamic-op-type", func(val failpoint.Value) bool {        return strings.Contains(val.(string), opType)    })    return false}():    fmt.Println("do something")default:    panic("unsupported operator type")}
Copy the code

The above code will eventually be rewritten as follows:

switch opType := operator.Type(); {case opType == "balance-leader":    fmt.Println("create balance leader steps")case opType == "balance-region":    fmt.Println("create balance region steps")case opType == "scatter-region":    fmt.Println("create scatter region steps")case func() bool {    if ok, val := failpoint.Eval(_curpkg_("dynamic-op-type")); ok {        return strings.Contains(val.(string), opType)    }    return false}():    fmt.Println("do something")default:    panic("unsupported operator type")}
Copy the code

In addition to the above example, more complex cases can be written:

Circular INITIAL statements, CONDITIONAL expressions, and POST statements
FOR the RANGE statement
The SWITCH INITIAL statement
Slice construction and index
The structure is dynamically initialized
…

In fact, failPoint can be injected anywhere you can call a function, so use your imagination.

Failpoint naming best practices

The above generated code automatically adds a _curpkg_ call to failpoint-name. Because the name is global, to avoid naming conflicts, the package name is the final name. _curpkg_ is a macro that automatically expands with the package name at run time. You do not need to implement _curPKg_ in your application, it is automatically generated and added when failpoint-ctl enable is enabled and deleted when failpoint-ctl disable is enabled.

Package DDL // DDL's parent package is' github.com/pingcap/tidb 'func demo() {// _curpkg_("the-original-failpoint-name")  will be expanded as `github.com/pingcap/tidb/ddl/the-original-failpoint-name` if ok, val := failpoint.Eval(_curpkg_("the-original-failpoint-name")); ok {... }}
Copy the code

Since all failpoints under the same package are in the same namespace, careful naming is required to avoid naming conflicts. Here are some recommended rules to improve this situation:

Ensure that the name is unique within the package.
Use a self-explanatory name.

Failpoint can be activated using environment variables:

GO_FAILPOINTS= “github.com/pingcap/tidb/ddl/renameTableErr=return(100); github.com/pingcap/tidb/planner/core/illegalPushDown=return(true); github.com/pingcap/pd/server/schedulers/balanceLeaderFailed=return(true) “

Thank you

Thanks to Gofail for providing the initial implementation and inspiration to iterate failPoint on the shoulders of giants.
Thanks to FreeBSD for defining the syntax specification.

Finally, we welcome you to discuss with us and improve Failpoint project together.

GO China call for papers!

Since the “Go China” official account was launched, Gopher has been deeply loved by Gopher for its solid dry goods (shy), cutting-edge interpretation (shy) and full of benefits. In order to bring you more powerful dry goods and Go language project development experience, we will start to call for contributions outside!

Now we are calling for contributions. If you have excellent Go language technology articles want to share, hot industry information need to report, etc., welcome to contact in the menu bar reply “contribute” “cooperation” contact our small editor for submission.

Design and implementation of Golang Failpoint

Related Posts

This article thoroughly understand static and dynamic libraries, display links and implicit links

The Rebalance RocketMQ

Linux installation mongo