Brief Intro Into Random/Stochastic/Probabilistic/Simulation/Property-Based Testing

June 28, 2013

John Hughes and Stuart Halloway had very interesting talks about random testing at NDC Oslo, a topic I have been ignorant of but want to explore more now. Contrary to the typical example-based unit tests where the developer specifies inputs, interactions, and specific validations, random testing generates random input data and/or sequences of interactions and the verification is based on more general checks. Random testing can check many more cases than a developer would ever write and cases that a human would never think of. It can thus discover defects impossible to find by the traditional testing, as has been demonstrated f.ex. on Riak.

Random testing typically starts by creating (likely a very simplified) model of the system under test. The model is then used to generate the random data inputs and/or sequences of actions (method calls). Then the tests are executed and their input and output data captured. Finally the results are validated, either against predefined "system properties," i.e. invariants that should always hold true, or manually by the developer.

Related/also known as: generative testing, property-based testing (a paper).

Disclaimer: My knowledge of random testing and the tools described below is still quite shallow so there are likely to be inaccuracies, if not mistakes. Corrections are highly welcomed.

Case 1: Stochastic Testing With QuickCheck

(Video: John Hughes: Race Conditions, Distribution, Interactions--Testing the Hard Stuff and Staying Sane. Also Certifying your car with Erlang.)

John Hughes, one of the key people behind Haskell, created the famous Haskell testing library QuickCheck in 1999. He has later founded a commercial company QuviQ that uses an Erlang version of QuickCheck to test critical systems, such as those in cars, written in Erlang, C, or other supported languages.

QuickCheck generates random sequences of calls with random input data.

If we wanted to test an implementation of a stack, we would:

Define a model describing the available methods and when it is legal to call them (e.g. we cannot pop an empty stack)
Define system properties/postconditions that should always be true, f.ex. that the length of the stack after pop is one less than before
Run QuickCheck - it would generate in a controlled random manner and execute many test sequences, likely using some smart logic
In the case of a failure, QuickCheck would try to find the smallest subset of the input leading to the failure (shrinking); it would also save the state to make it possible to rerun the same failed test again

According to Hughes, already surprisingly simple models can discover a number of defects. For example in the case of the key-value store Riak, which is nowadays perhaps one of the best tested databases thanks to QuviQ, a failing corner case was found using only a single key and few concurrent clients.

Randomly generated operation sequences can discover corner cases that a human would never think of, as demonstrated on the slide below, where the sequence "open file - close file - open file - lookup - insert - insert" seems to be OK but renders the file invalid, throwing premature eof upon next access.

QuivQ has successfully used QuickCheck f.ex. to check embedded software in Volvo cars, to track notorious race conditions in Erlang's built-in database (the dets slide above), to improve Riak and in many other cases.

QuickCheck has been ported to many other languages - there is f.ex. ClojureCheck, ScalaCheck, Quickcheck for Java. See also introduction into Haskell QuickCheck2.

Note: The Certifying your car with Erlang introduces QuickCheck a little more and than describes its application to testing car software stacks and components from different vendors and their conformance to a standard (found 200+ bugs in the implementations and 100+ ambiguities in the standard and thus different implementations, with relatively little code). Also introduces mocking (to test components in isolation) and testing of clusters of components.

Case 2: Simulation/Probabilistic Testing With Simulant

(Video: Stuart Halloway: Simulation Testing.)

Simulant is a Clojure library for simulation testing based on a probabilistic model, developed by Stuart Halloway. All inputs, outputs, runtime and other information is stored in a database and it is thus possible to compare runs in different times and even to perform new validations on old data.

Using Simulant involves the following steps, demonstrated on the example of a simple system of traders:

You construct a model and corresponding test descriptions that describes what agents there are in the system, what actions they can perform, and the type, frequency, and intensity of those actions in probabilistic terms. In the example, there is only one type of agent, a typical trader, that has start amount $1000, mean time between trades 1 hour, and mean trade amount $100. We want to have 1000 of them.
Next you run the simulation, on one or multiple machines, likely using the ability to "speed up time." Simulant will simulate the agents, recording everything into Datomic.
Next you can manually explore the end state of the system and the data collected during and about the simulation and/or run automated checks. F.ex. we expect that there is still $100k in the system, that no trader has negative balance, that there has been no failed trades, and that the number of trades is approximately 400 +- 10%.

In a typical system you would have more than one agents (i.e. different user profiles - frequent trader, small trader, big trader, ...) and more actions and more randomness (different start balances, trades could depend on a probability function changing in time etc.)

The fact that everything is recorded in a DB makes it possible to for example go later back and compare execution performance of the trade function over time.

Other Tools

Randoop is a java tool that "randomly, but smartly, generates sequences of methods and constructor invocations for the classes under test, and uses the sequences to create [JUnit] tests." It "uses feedback obtained from executing the sequence as it is being constructed, in order to guide the search toward sequences that yield new and legal object states. Inputs that create redundant or illegal states are never extended; this has the effect of pruning the search space." The paper Feedback-directed random test generation (2007) describes both the principles behind Randoop and also results from using it and its .NET sibling for testing of 14 widely-deployed, well-tested Java and .NET libraries totaling 780KLOC and a comparison to other types of random testing, f.ex. JPF.

NASA's Java Path Finder (JPF) - "JPF is a highly customizable execution environment for verification of Java™ bytecode programs." Essentially, as far as I understand it, JPF provides a special JVM running on top of normal JVM, that collects additional information about the executing program and modifies its execution, for example by executing all - instead of just one - branches of the code, to make it possible to check them all. It is very flexible and supports different execution modes and heuristics for what to skip. Read What is JPF? and Testing vs. Model Checking. (Model checking, contrary to testing, is a "method that exhaustively explores all possible SUT behaviors" - at least in theory. In practice the state space is too large and must be pruned, leading to blanding of the two.)

Conclusion

Random testing is a very powerful and irreplacable technique and we will hopefully see its increased application. There are many approches targetting different situation and trying to deal with the problem of an effective selection of inputs from an impractically large input space. It is worth starting to explore them now.

Resources

D. Hamlet. Random testing. In Encyclopedia of Software Engineering. John Wiley and Sons, 1994 - the seminal paper about random testing, as it seems

Tags: clojure