The original address: impracticalprogrammer.com/TestLikeHar…

The original author: impracticalprogrammer.com/

Release date: August 2021

Test like a hardware engineer

I’m a software engineer and spend most of my time working with hardware engineers of some sort, especially CPU designers and validation teams. Hardware and software engineering are superficially very similar — architecting a design, implementing it in a coding language, writing tests, running through some tools, turning the code into something useful, and shipping it. I don’t usually envy my computer engineering friends — release timelines are inherently long, expertise tends to be narrow, waterfall reign, and tools have a reputation for being substandard. But I do envy testing in the hardware world. To make chips they have confidence in, hardware companies like Intel and ARM combine very normal and esoteric but powerful technologies. Although software projects rarely use these testing methods, the basic principles can be used to make your next test plan very comprehensive.

Generate the correct data

In order for tests to be meaningful, there must be a definition of correctness. Software typically uses a combination of hand-coded expected values for each input and internal assertions. Hardware companies also use these (especially assertions), but generating expected values for a complex program running on a modern CPU is often difficult to do by hand. Instead, a CPU company might have an emulator. Or, a lot of emulators.

An emulator is typically a C/C++ behavior model of some architecture being tested, and its quality and correspondence to the real device varies from company to company and requirement. Many companies will have a variety of simulators, for example, a cycle accurate simulator that can be ridiculously slow but lets you check that everything happens when you expect it. A faster emulator might just do dynamic binary conversions and run the target architecture on x86 — the exact times might not match, but you’ll get the expected data values, almost as fast as the native run, especially for cyclic tests where the conversions can be reused.

If a chip uses a good documentation structure, the emulator doesn’t even need to be written manually — the specification can be executed directly to generate the expected values. This is obviously not easy, especially when the specification includes undefined behavior, but it’s a good way to arbitrate between conflicting emulators. It can also be a good way to experiment with possible architectural changes and help formally validate components with techniques we’ll see later.

Most software projects can’t maintain a second implementation for comparison, but it doesn’t have to double the development cost. A reference implementation can skip over many performance and extensibility issues. References may use higher-level languages or off-the-shelf components that are not suitable for production. In fact, because many refactorings preserve the behavior but improve performance at the expense of clarity, the reference implementation may even be just an unoptimized, pre-refactored version.

Make it fun

Now that we have a way to check if our code is doing the right thing in a situation, we need to put it in a situation that reveals the error. For a CPU, this usually means “fun” little programs, but for a software project, it can mean any series of API calls. CPU companies do manually write a lot of unit and integration tests for specific features, but the process is basically the same as white-box integration tests in writing software, so we take it for granted and focus on random and formal testing.

Random test

Random test generators (often called fuzzers) learn all possible input fields (for example, 32-bit instruction sequences for cpus, JSON documents for many software systems) and extract test cases from that universe. The level of complexity here can be enormous.

A simple, fast, and sometimes surprisingly effective test generator might just read from /dev/Random and throw bytes into an API. That’s fine, if the input is random it might make sense. Many purely random inputs just don’t pass a rough input check, and you’ll end up testing your input validation pretty well, with none of the core functionality at all.

Otherwise, a mutation-based generator might take a known good input, randomly modify it, and use the modified input as a test case. The more drastically the input is modified, the more types of tests will result, but the more likely it is that the test cases will fail input validation. Mutation-based generators can also be very effective, but depend on a wide range of representative seed inputs to cover many places.

To explore a wide range of internal states, the best way is to build a fuzzier that knows what the input “should” look like and randomly generate most of the valid inputs. For example, a calculator’s fuzzier can understand the “+” operator between two results, that an open parenthesis requires a closed parenthesis, and so on. These requirements can be formally written.

Number = {[0-9], [0-9]$Number} // A Number is a digit, optionally followed by more digits
Binary_Operator = {-, +, *, /, ^} // e.g. 4 ^ 3,
Unary_Operator = {- sin cos tan} // Allow negative numbers, trig functions
Statement = {$Number, "(" $Statement ")", $Statement $Binary_Operator $Statement, $Unary_Operator $Statement}
Copy the code

A fuzzler can read these definitions and generate random statements by selecting one of the four types of statements and repeatedly replacing the $variable with the generated instance of that type (test case). For instance

Test case = $Statement Test case = $Statement $Binary_Operator $Statement Test case = ( $Statement ) ^ $Number Test case  = ( $Number ) ^ 4$Number Test case = ( 8 ) ^ 42$Number Test case = ( 8 ) ^ 428Copy the code

This type of test can produce very complex, interesting tests that might explore areas that humans wouldn’t consider (for example, ‘- (- 2)’ would be a valid test in this syntax). More interesting areas can be given more weight. For example, the tan operator can be more error-prone than indexation, so more testing should be done. However, generating test results quickly can be tricky, and the most interesting test cases can be difficult to express, especially if there are dependencies between parts of the input.

So far, we’ve talked about building one input at a time (test case). In the CPU, the most interesting errors actually come from generating a series of inputs — for example, a program might have an instruction that overwrites instructions (a form of self-modifying code) after the process flow. Then, another instruction might cause an exception, temporarily flushing out a lot of state. When the program reaches the overridden instruction, does it see the original code or the modified code? Which one should it be looking at? Because the CPU maintains so much state, interesting bugs tend to be in the interaction of several instructions, and may even depend on the exact time between events.

To find sequence-dependent errors, the CPU validation team used static and dynamic generator tools. A static random sequence generator simply generates a large number of instructions through one of the above methods and hopes to find interesting sequences through sheer numerical power. The dynamic generator simulates the internal state of the device under test and dynamically selects the next item in the sequence according to the state. For example, if the memory unit is already overloaded, you are more likely to find an error if the next instruction flushes the cache line at the most recently used address than if the instruction is completely random. Most sequence-based tools fall somewhere in the middle — for example, generating most instructions from static templates, but carefully selecting load, store, and branch to ensure that most of them go to valid addresses. Dynamic generation is somewhat related to symbolic execution in the software world, although it generally relies less on checks on the system under test and more on code written by the tester indicating what type of input should be generated based on the current system state.

Formal testing

If you’ve generated a bunch of random inputs and eliminated errors, the obvious next step is to try all possible inputs to make sure there can be no errors. This is rarely feasible literally. Instead, automated “formal” analysis tools can be used to prove in detail certain properties of a design, such as that a component is deadlock-free. The most common formal tools in hardware rely on assumptions to prove assertions. For any assertion, the tool examines the code flow and looks for any combination of relevant inputs that, in hypothetical cases, might violate the assertion. A proven assertion becomes a new hypothesis or provides a counterexample. This technique is very similar to a satisfiability solver, which is also NP-complete, but can be done in a reasonable amount of time for many practical problems. In practice, the hardest part of formalizing validation is coding enough assumptions and assertions so that the tool progresses in a reasonable amount of time to really cover all viable inputs and cover all of the correctness attributes of the system. Formal tools are used for selected hardware subsystems, but rarely for software, with notable exceptions.

Inspection test

You’ve spent a lot of time on these tests, but how do you know if the tests themselves are high quality?

Inspection coverage

In hardware and software, the most common measure of test quality is coverage of a certain version — how much of what is likely to happen actually happens? For example, how many lines of code were tested, or how many if and else branches of if() statements were executed (readers can practice why 100% line coverage is not equal to 100% conditional coverage? Coverage in the hardware domain usually consists of hand-coded “coverage points,” each of which declares the possible state of the property. For example, there might be a cover point to experience each possible exception type, or a cover point to run each possible guard ring, a cover point to test if a buffer is full, and so on. Typically, CPU test plans require crossover, which means that every value that might be taken by one override point should appear with every possible value at another override point, and logically impossible exemptions should be manually added (for example, some privilege exceptions would not occur if the CPU was running at a privilege level). The tendency for coverage to include intersections has led to an exponential explosion in total coverage. To keep coverage as a useful indicator, a CPU company may maintain teams of engineers whose entire job is to write test coverage requirements. Writing good coverage requirements is key to ensuring test quality, because coverage teams need to address intersections in interactions that might actually reveal errors, without requiring so many intersections that the test plan is completely unworkable.

Even when coverage is carefully chosen, it can be difficult to write targeted or random tests to hit certain coverage intersections. For example, an overwrite intersection might require a memory access failure while the storage buffer is full. However, filling some very large buffers is inherently tricky, and sometimes requires careful tweaking of code to maximize performance in some parts of the CPU while limiting responsiveness in others. Adding additional requirements on top would make the coverage crossover nearly impossible to achieve, and it would be difficult to prove that it is impossible to achieve (which would justify an exemption for that crossover).

To achieve these hard-to-reach situations, a handy trick is to test a different system that is relevant to the one you really want to test. For example, templating the buffer size and building a new version of the system with a buffer size 1/10 the size of the production buffer. Test the modified version, and if the modified version shows no errors in tests, then you can at least have some confidence that a real system will not have errors related to full buffers.

Check the bug

Rather than checking how many designs are being tested, assess the integrity of your tests by what bugs you find and how you find them. The simplest heuristic is to stop testing when you continue testing and no more bugs are found. Early on in the project, you will find a lot of bugs, and you won’t be able to continue testing effectively until those bugs are fixed. Later, each hand-written or randomly generated test will be less and less likely to find a bug. At this point, you can announce that if there are more bugs, they are unlikely to be very important, and start shipping the product.

If you stare at bug curves and decide that they look flat and not as data-driven as you’d like, you can try to estimate how many bugs your tests will find in a theoretical system and guess that this is the part they will find in your system. For example, in a mutation test, you deliberately introduce a defect into the design and run the test suite. If the test finds the error, the mutant is “killed.” The mutants were killed at a rate close to the rate of error found by the test suite. The biggest problem with mutation testing is realism — immature mutations can introduce errors so that they never make it into the test suite, so mutation testing tends to overestimate the validity of the test suite.

The idea behind mutation testing is sound, but requires a more realistic error set. Fortunately, every project has an entire team constantly introducing real-world bugs into the code — the engineers who wrote the code in the first place. Rather than automatically generating new errors, reintroduce old errors identified through versioning commit information, or errors from related projects, and check that the test cases for this project identify all of these errors. There’s obviously some sample bias here — you’ll never find that your test misses a type of error that you’ve never found.

An interesting, if probably unrealistic, idea for estimating test integrity is to infer the total number of errors from the overlap of two independent test suites. If two different test suites find only the same bugs, it may be because those are all the bugs that exist. If they both found completely different errors, it might be because there are a large number of errors, but both test suites found only a fraction of them. In the capture-mark-recapture approach, run two test suites and report the number of bugs each system individually caught, as well as the bugs captured by the two test suites. Then assume that these test suites are independent. The estimated total number of bugs is.

The estimated total number of errors = the number caught by suite A * the number caught by suite B/the number caught by both suites

Like mutation testing, this approach may overestimate the validity of the test suite. Some mistakes are easier to catch than others. These errors are disproportionately represented in the “caught by both” category, which will drag down estimates of total errors.

Handling errors

Despite all efforts, sometimes mistakes are discovered. The cost of reviewing and replacing a CPU is very expensive. So computer engineering companies plan for the possibility of error. Usually, errors are minor. Every company produces errata files listing small errors that are usually ignored except in dedicated workloads. Sometimes, CPU designers sense that an optimization might be wrong, so they add unlogged functionality to disable the optimization and fall back to another process. If there is a real problem, these “chicken bits” can later be set in the kernel, for example, sacrificing a bit of performance for correctness. For example, after Spectre, Intel released a microcode update to enable access to a new register that controls optimization and makes the CPU easy to exploit. Some software projects have a similar fallback process — think of setting a secret environment variable to resolve an incorrect instruction. Finally, sometimes the DESIGN of the CPU can be changed during manufacturing. After all the transistors have been built, wires are deposited on top of the wafer to connect the transistors to each other. Redesigning these metal layers is cheaper than redesigning each layer in the process, so companies sometimes use “metal changes” to fix an error. The closest software analogy here is to fix a software bug by hacking a linker script, because recompiling the software is too expensive. I don’t recommend taking inspiration from metal changes for your next bug fix.

2021


www.deepl.com translation