For a large complex system, usually contains multiple modules or more components, simulate the faults of each subsystem is indispensable part of the test, and the fault simulation must be integrated without intrusion into an automated test system, by automatically activated the point of failure in the automated test to simulate the failure, And observe whether the final result conforms to the expected result to judge the correctness and stability of the system. If a colleague is required to plug and unplug network cables to simulate network anomalies in a distributed system, or a hard disk is destroyed to simulate disk damage in a storage system, expensive testing costs will make testing a disaster, and it is difficult to simulate tests that require delicate control. So we need some automated way to do deterministic failure testing.
Failpoint project (github.com/pingcap/failpoint) is born for this, it is a FreeBSD failpoints Golang implementation, allow the injection of mistakes in the code or abnormal behavior, These abnormal behaviors are triggered by dynamic activation of environment variables or code. Failpoint can be used to simulate error processing in a variety of complex systems to improve the system’s fault tolerance, correctness and stability. For example:
-
A random delay or unavailability occurs in a microservice.
-
The STORAGE system disk I/O latency increases, the I/O throughput is low, and the disk fall time is long.
-
A hotspot occurs in the scheduling system, and a scheduling command fails.
-
In the recharge system, the callback interface is used to simulate the success of repeated recharge requests from the third party.
-
Game development simulation player network instability, frame drop, delay is too large, and a variety of abnormal input (plug-in request) under the circumstances of the system is working correctly.
-
…
Why repeat the wheel?
The Etcd team made a great contribution to the Golang ecosystem by developing GoFail in 2016 to greatly simplify error injection. We had already introduced GoFail for error injection testing in 2018, but we found some problems with functionality and convenience, so we decided to build a better “wheel.”
How do I use Gofail
-
Use comments to inject a failpoint into a program:
// gofail: var FailIfImportedChunk int // if merger, ok := scp.merger.(*ChunkCheckpointMerger); ok && merger.Checksum.SumKVS() >= uint64(FailIfImportedChunk) { // rc.checkpointsWg.Done() // rc.checkpointsWg.Wait() // panic("forcing failure due to FailIfImportedChunk") // } // goto RETURN1 // gofail: RETURN1: // gofail: var FailIfStatusBecomes int // if merger, ok := scp.merger.(*StatusCheckpointMerger); ok && merger.EngineID >= 0 && int(merger.Status) == FailIfStatusBecomes { // rc.checkpointsWg.Done() // rc.checkpointsWg.Wait() // panic("forcing failure due to FailIfStatusBecomes") // } // goto RETURN2 // gofail: RETURN2:
Copy the code -
Converted code using Gofail Enable:
if vFailIfImportedChunk, __fpErr := __fp_FailIfImportedChunk.Acquire(); __fpErr == nil { defer __fp_FailIfImportedChunk.Release(); FailIfImportedChunk, __fpTypeOK := vFailIfImportedChunk.(int); if ! __fpTypeOK { goto __badTypeFailIfImportedChunk} if merger, ok := scp.merger.(*ChunkCheckpointMerger); ok && merger.Checksum.SumKVS() >= uint64(FailIfImportedChunk) { rc.checkpointsWg.Done() rc.checkpointsWg.Wait() panic("forcing failure due to FailIfImportedChunk") } goto RETURN1; __badTypeFailIfImportedChunk: __fp_FailIfImportedChunk.BadType(vFailIfImportedChunk, "int"); }; /* gofail-label */ RETURN1: if vFailIfStatusBecomes, __fpErr := __fp_FailIfStatusBecomes.Acquire(); __fpErr == nil { defer __fp_FailIfStatusBecomes.Release(); FailIfStatusBecomes, __fpTypeOK := vFailIfStatusBecomes.(int); if ! __fpTypeOK { goto __badTypeFailIfStatusBecomes} if merger, ok := scp.merger.(*StatusCheckpointMerger); ok && merger.EngineID >= 0 && int(merger.Status) == FailIfStatusBecomes { rc.checkpointsWg.Done() rc.checkpointsWg.Wait() panic("forcing failure due to FailIfStatusBecomes") } goto RETURN2; __badTypeFailIfStatusBecomes: __fp_FailIfStatusBecomes.BadType(vFailIfStatusBecomes, "int"); }; /* gofail-label */ RETURN2:
Copy the code
Problems encountered in the use of Gofail
-
Using comments to inject failpoint into code that is error-prone and undetected by the compiler.
-
It can only take effect globally. Large projects will introduce parallel testing in order to shorten the time of automatic testing, and there will be interference between different parallel tasks.
-
// goTO RETURN2 and // gofail: RETURN2: must be generated with a blank line.
What should we design a Failpoint?
What would the ideal failPoint implementation look like?
Ideally, failpoint should be defined in code and non-intrusive to business logic. If in a language that supports macros (such as Rust), we can define a fail_point macro to define failpoint:
fail_point! ( "transport_on_send_store", |sid| if let Some(sid) = sid { let sid: u64 = sid.parse().unwrap(); if sid == store_id { self.raft_client. wl().addrs.remove(&store_id); }})
Copy the code
But we ran into some problems:
-
Golang does not support macro language features.
-
Golang does not support compiler plug-ins.
-
Golang tags also don’t provide an elegant implementation (go build –tag=”enable-failpoint-a”).
Failpoint design criteria
-
Define failpoint using Golang code, not comments or other forms.
-
Failpoint code should not have any overhead:
-
Do not affect the normal function logic, do not have any intrusion into the function code.
-
Performance rollback cannot be caused after injection of failpoint code.
-
The Failpoint code ultimately does not appear in the final release binary.
-
Failpoint code must be readable, easy to write, and able to introduce compiler detection.
-
The resulting code must be readable.
-
The line number of the functional logic code cannot be changed in the generated code (for debugging purposes).
-
Parallel testing is supported. You can use context.Context to control whether a specific failpoint is activated.
How does Golang implement a failpoint macro?
What is the nature of macros? If we trace back to the source, we find that the FAILpoint meeting the above conditions can be realized in Golang through AST rewriting, as shown in the following figure:
For any source file of Golang code, you can parse out the syntax tree of the file, traverse the entire syntax tree, find all failpoint injection points, and then rewrite the syntax tree to convert it to the desired logic.
Relevant concepts
Failpoint
Failpoint isa code snippet and is executed only when the corresponding Failpoint name is activated. If Failpoint.Disable(“failpoint-name-for-demo”) is disabled, Then the corresponding Failpoint will never trigger. All failPOiint code snippets are not compiled into the final binary, as we simulated file system permissions:
func saveTo(path string) error { failpoint.Inject("mock-permission-deny", func() error { // It's OK to access outer scope variable return fmt.Errorf("mock permission deny: %s", path) })}
Copy the code
Marker function
The AST rewriting phase marks the part that needs to be rewritten, which has the following functions:
-
Prompts Rewriter to rewrite as an equal IF statement.
-
The parameters of the marker function are the parameters needed in the rewrite process.
-
The tag function is an empty function, and the compilation process is inline, further eliminated.
-
The failpoint injected in the tag function is a closure. IF the closure accesses external variables, the closure syntax allows you to capture the external scope variables without compiling errors. The converted code is an IF statement, which accesses the external scope variables without causing any problems. So closure capture is just syntactically legitimate, and ultimately doesn’t have any extra overhead.
-
Simple, easy to read and write.
-
By introducing compiler detection, if the parameter of Marker function is not correct, the program cannot be compiled, so as to ensure the correctness of the translated code.
List of Marker functions currently supported:
-
func Inject(fpname string
,fpblock func(val Value)) {}
-
func InjectContext(fpname string
,ctx context.Context
,fpblock func(val Value)) {}
-
func Break(label ... string) {}
-
func Goto(label string) {}
-
func Continue(label ... string) {}
-
func Fallthrough() {}
-
func Return(results ... interface{}) {}
-
func Label(label string) {}
How to use FailPoint injection in your application?
Inject a Failpoint at the place where it is called. The failpoint.Inject call is rewritten as an IF statement, where mock-io-error is used to determine whether it is fired. The logic in failpoint-closure is executed when triggered. Let’s say we inject an IO error into a function that reads a file:
failpoint.Inject("mock-io-error", func(val failpoint.Value) error { return fmt.Errorf("mock error: %v", val.(string))})
Copy the code
The final converted code looks like this:
if ok, val := failpoint.Eval(_curpkg_("mock-io-error")); ok { return fmt.Errorf("mock error: %v", val.(string))}
Copy the code
Enable(“mock- IO -error”, “return(“disk error”)”)) to activate the failpoint in the program. If you need to assign a custom Value to failpoint. You need to pass a Failpoint expression, such as return(“disk error”). For more syntax, refer to the Failpoint syntax.
Closures can be nil, such as failpoint.enable (“mock-delay”, “sleep(1000)”), which is intended to sleep for a second at the injection point without performing additional logic.
failpoint.Inject(
"mock-delay",
nil)failpoint.
Inject("mock-delay
"
, func(){})
Copy the code
This results in the following code:
failpoint.Eval(_curpkg_("mock-delay"))failpoint.Eval(_curpkg_("mock-delay"))
Copy the code
If we only want to perform a panic in failpoint and do not need to receive failpoint.Value, we can omit this Value in the closure’s arguments. Such as:
failpoint.Inject("mock-panic", func(_ failpoint.Value) error { panic("mock panic")})// ORfailpoint.Inject("mock-panic", func() error { panic("mock panic")})
Copy the code
The best practices are as follows:
failpoint.Enable("mock-panic", "panic")failpoint.Inject("mock-panic", nil)// GENERATED CODEfailpoint.Eval(_curpkg_("mock-panic"))
Copy the code
Context to prevent interference between different test tasks in parallel testing, we can include a callback function in context. context to fine control failpoint activation and shutdown:
failpoint.InjectContext(ctx, "failpoint-name", func(val failpoint.Value) { fmt.Println("unit-test", val)})
Copy the code
Converted code:
if ok, val := failpoint.EvalContext(ctx, _curpkg_("failpoint-name")); ok { fmt.Println("unit-test", val)}
Copy the code
Examples using Failpoint.WithHook:
func (s *dmlSuite) TestCRUDParallel() { sctx := failpoint.WithHook(context.Backgroud(), func(ctx context.Context, fpname string) bool { return ctx.Value(fpname) ! = nil // Determine by ctx key }) insertFailpoints = map[string]struct{} { "insert-record-fp": {}, "insert-index-fp": {}, "on-duplicate-fp": {}, } ictx := failpoint.WithHook(context.Backgroud(), func(ctx context.Context, fpname string) bool { _, found := insertFailpoints[fpname] // Only enables some failpoints. return found }) deleteFailpoints = map[string]struct{} { "tikv-is-busy-fp": {}, "fetch-tso-timeout": {}, } dctx := failpoint.WithHook(context.Backgroud(), func(ctx context.Context, fpname string) bool { _, found := deleteFailpoints[fpname] // Only disables failpoints. return ! found }) // other DML parallel test cases. s.RunParallel(buildSelectTests(sctx)) s.RunParallel(buildInsertTests(ictx)) s.RunParallel(buildDeleteTests(dctx))}
Copy the code
If we use failpoint in the loop, we might use another Marker function:
failpoint.Label("outer")for i := 0; i < 100; i++ { inner: for j := 0; j < 1000; j++ { switch rand.Intn(j) + i { case j / 5: failpoint.Break() case j / 7: failpoint.Continue("outer") case j / 9: failpoint.Fallthrough() case j / 10: failpoint.Goto("outer") default: failpoint.Inject("failpoint-name", func(val failpoint.Value) { fmt.Println("unit-test", val.(int)) if val == j/11 { failpoint.Break("inner") } else { failpoint.Goto("outer") } }) } }}
Copy the code
The above code will eventually be rewritten as follows:
outer: for i := 0; i < 100; i++ { inner: for j := 0; j < 1000; j++ { switch rand.Intn(j) + i { case j / 5: break case j / 7: continue outer case j / 9: fallthrough case j / 10: goto outer default: if ok, val := failpoint.Eval(_curpkg_("failpoint-name")); ok { fmt.Println("unit-test", val.(int)) if val == j/11 { break inner } else { goto outer } } } } }
Copy the code
Why do label, break, continue and fallthrough Marker functions remain? Why not just use keywords?
-
Golang will not compile if a variable or tag is not used.
label1: // compiler error: unused label1 failpoint.Inject("failpoint-name", func(val failpoint.Value) { if val.(int) == 1000 { goto label1 // illegal to use goto here } fmt.Println("unit-test", val) })
Copy the code -
Break and continue can only be used in the context of a loop, within a closure.
Some complex injection examples
Example 1: Infuse failpoint in INITIAL and CONDITIONAL IF statements
if a, b := func() { failpoint.Inject("failpoint-name", func(val failpoint.Value) { fmt.Println("unit-test", val) })}, func() int { return rand.Intn(200) }(); b > func() int { failpoint.Inject("failpoint-name", func(val failpoint.Value) int { return val.(int) }) return rand.Intn(3000)}() && b < func() int { failpoint.Inject("failpoint-name-2", func(val failpoint.Value) { return rand.Intn(val.(int)) }) return rand.Intn(6000)}() { a() failpoint.Inject("failpoint-name-3", func(val failpoint.Value) { fmt.Println("unit-test", val) })}
Copy the code
The above code will eventually be rewritten as:
if a, b := func() { if ok, val := failpoint.Eval(_curpkg_("failpoint-name")); ok { fmt.Println("unit-test", val) }}, func() int { return rand.Intn(200) }(); b > func() int { if ok, val := failpoint.Eval(_curpkg_("failpoint-name")); ok { return val.(int) } return rand.Intn(3000)}() && b < func() int { if ok, val := failpoint.Eval(_curpkg_("failpoint-name-2")); ok { return rand.Intn(val.(int)) } return rand.Intn(6000)}() { a() if ok, val := failpoint.Eval(_curpkg_("failpoint-name-3")); ok { fmt.Println("unit-test", val) }}
Copy the code
Example 2: Inject failpoint into a SELECT statement CASE to dynamically control whether a CASE is blocked
func (s *StoreService) ExecuteStoreTask()
{ select { case <-func() chan *StoreTask { failpoint.Inject(
"priority-fp",
func(_ failpoint.Value)
{ return make(chan *StoreTask) }) return s.priorityHighCh
}(): fmt.Println(
"execute high priority task")
case <- s.priorityNormalCh:
fmt.Println("execute
normal priority task")
case <- s.priorityLowCh: fmt.Println("execute
normal low task") }}
Copy the code
The above code will eventually be rewritten as:
func (s *StoreService) ExecuteStoreTask() { select { case <-func() chan *StoreTask { if ok, _ := failpoint.Eval(_curpkg_("priority-fp")); ok { return make(chan *StoreTask) }) return s.priorityHighCh }(): fmt.Println("execute high priority task") case <- s.priorityNormalCh: fmt.Println("execute normal priority task") case <- s.priorityLowCh: fmt.Println("execute normal low task") }}
Copy the code
Example 3: Dynamically injecting a SWITCH CASE
switch opType := operator.Type(); {case opType == "balance-leader": fmt.Println("create balance leader steps")case opType == "balance-region": fmt.Println("create balance region steps")case opType == "scatter-region": fmt.Println("create scatter region steps")case func() bool { failpoint.Inject("dynamic-op-type", func(val failpoint.Value) bool { return strings.Contains(val.(string), opType) }) return false}(): fmt.Println("do something")default: panic("unsupported operator type")}
Copy the code
The above code will eventually be rewritten as follows:
switch opType := operator.Type(); {case opType == "balance-leader": fmt.Println("create balance leader steps")case opType == "balance-region": fmt.Println("create balance region steps")case opType == "scatter-region": fmt.Println("create scatter region steps")case func() bool { if ok, val := failpoint.Eval(_curpkg_("dynamic-op-type")); ok { return strings.Contains(val.(string), opType) } return false}(): fmt.Println("do something")default: panic("unsupported operator type")}
Copy the code
In addition to the above example, more complex cases can be written:
-
Circular INITIAL statements, CONDITIONAL expressions, and POST statements
-
FOR the RANGE statement
-
The SWITCH INITIAL statement
-
Slice construction and index
-
The structure is dynamically initialized
-
…
In fact, failPoint can be injected anywhere you can call a function, so use your imagination.
Failpoint naming best practices
The above generated code automatically adds a _curpkg_ call to failpoint-name. Because the name is global, to avoid naming conflicts, the package name is the final name. _curpkg_ is a macro that automatically expands with the package name at run time. You do not need to implement _curPKg_ in your application, it is automatically generated and added when failpoint-ctl enable is enabled and deleted when failpoint-ctl disable is enabled.
Package DDL // DDL's parent package is' github.com/pingcap/tidb 'func demo() {// _curpkg_("the-original-failpoint-name") will be expanded as `github.com/pingcap/tidb/ddl/the-original-failpoint-name` if ok, val := failpoint.Eval(_curpkg_("the-original-failpoint-name")); ok {... }}
Copy the code
Since all failpoints under the same package are in the same namespace, careful naming is required to avoid naming conflicts. Here are some recommended rules to improve this situation:
-
Ensure that the name is unique within the package.
-
Use a self-explanatory name.
Failpoint can be activated using environment variables:
GO_FAILPOINTS= “github.com/pingcap/tidb/ddl/renameTableErr=return(100); github.com/pingcap/tidb/planner/core/illegalPushDown=return(true); github.com/pingcap/pd/server/schedulers/balanceLeaderFailed=return(true) “
Thank you
-
Thanks to Gofail for providing the initial implementation and inspiration to iterate failPoint on the shoulders of giants.
-
Thanks to FreeBSD for defining the syntax specification.
Finally, we welcome you to discuss with us and improve Failpoint project together.
GO China call for papers!
Since the “Go China” official account was launched, Gopher has been deeply loved by Gopher for its solid dry goods (shy), cutting-edge interpretation (shy) and full of benefits. In order to bring you more powerful dry goods and Go language project development experience, we will start to call for contributions outside!
Now we are calling for contributions. If you have excellent Go language technology articles want to share, hot industry information need to report, etc., welcome to contact in the menu bar reply “contribute” “cooperation” contact our small editor for submission.