It turns out you can be told and don’t have to experience it for yourself. However it also turns out that in the intervening year and a bit since I wrote this code, I’d mostly forgotten how it worked, so I thought I would complete a long ago promise and document its behaviour as a way to help me remember.
First, some history
Around the beginning of 2014 I was working on a C project called intmap. It was more or less a C implementation of Chris Okasaki and Andy Gill, “Fast Mergeable Integer Maps”, the basis for Haskell’s intmap type, with an interesting reference counting scheme borrowed from jq internals. It also has some stuff where it takes advantage of lazy evaluation to optimize expressions (e.g. if you do a bunch of expensive operations and then intersect with an empty set you can throw away the expensive operations without ever computing them). I had some vague intention of trying to use this to give jq a better table and array implementation (it has immutable types, but internally they’re really copy on write), but I never actually got around to that and was mostly treating it as a fun hack project.
The internals of the project are quite fiddly in places. There are a bunch of low level optimisations where it takes advantage of the the reference counting semantics to go “Ah, I own this value, I can totally mutate it instead of returning a new one”, and even without that the core algorithms are a bit tricky to get right. So I needed lots of tests for correctness.
I wrote a bunch, but I wasn’t really very confident that I had written enough – both because I had coverage telling me I hadn’t written enough and because even if I had 100% coverage I didn’t really believe I was testing all the possible interactions.
So I turned to my old friend, randomized testing. Only it wasn’t just enough to generate random data to feed to simple properties, I wanted to generate random programs. I knew this was possible because I’d had quite good luck with Hypothesis and Scalacheck’s stateful testing, but this wasn’t a stateful system, so what to do?
Well the answer was simple: Turn it into a stateful system.
Thus was born the baby version of testmachine. Its core was a stack of intmaps mapping to string values. It generated random stack operations: Pushing singleton maps onto the stack, performing binary operations on the top two elements of the stack, rotations, deletions, copies, etc. It also remembered what keys and values it had previously used and sometimes reused them – testing deletion isn’t very interesting if the key you’re trying to delete is in the map with probability ~0.
It then had a model version of what should be on the stack written in python and generated assertions by comparing the two. So it would generate a random stack program, run the stack program, and if the results differed between the model and the reality, it would minimize the stack into a smaller program and spit out a corresponding C test case. It also did some shenanigans with forking first so it could recover from segfaults and assertion failures and minimize things that produced those too.
The initial versions of the C test cases explicitly ran the stack machine, but this was ugly and hard to reason about, so I wrote a small compiler that turned the stack machine into SSA (this is much easier than it sounds because there are no non-trivial control structures in the language, so a simple algorithm where you just maintain a single stack of variable labels in parallel was entirely sufficient for doing so). It then spat out the C program corresponding to this SSA representation with no explicit stack. You can see some of the generated tests here.
At this point I looked at what I had wrought for testing intmap and concluded this was much more exciting than intmap itself and started to figure out how to generalise it. And from that was born testmachine.
I experimented with it for a while, and it was looking really exciting, but then uh 2014 proper happened and all my side projects got dropped on the floor.
Then end of 2014 happened and it looked like Hypothesis was where the smart money was, so I picked it up again and left testmachine mostly abandoned.
And finally I realised that actually what Hypothesis really needed for some of the things I wanted to do was the ideas from testmachine. So soon it will live again.
What’s the relation between testmachine and Hypothesis?
At the time of this writing, there is no relation between testmachine and Hypothesis except that they are both randomized testing libraries written in Python by me.
Originally I was considering testmachine as the beginnings of a successor to Hypothesis, but I about faced on that and now I’m considering it a prototype that will inspire some new features in Hypothesis. As per my post on the new plans for Hypothesis 1.0, the concepts presented in this post should soon (it might take ~2 weeks) be making it into a version of Hypothesis near you. The result should be strictly better than either existing Hypothesis or testmachine – a lot of new things become possible that weren’t previously possible in Hypothesis, a lot of nice things like value simplification, parametrized data generation and an example database that weren’t present in testmachine become available.
So what is testmachine?
Testmachine is a form of randomized testing which rather than generating data generates whole programs, by combining sequences of operations until it finds one that fails. It doesn’t really distinguish assertions from data transforming operations – any operation can fail, and any failing operation triggers a minimizing example.
To cycle back to our intmap example: We have values which are essentially a pair (intmap, mirror_intmap), where mirror_intmap is a python dictionary of ints to strings, and intmap is our real intmap type. We can define a union operation which performs a union on each and fails if the resulting mirror does correspond to the resulting map.
There are approximately three types of testmachine operation:
- Internal data shuffling operations
- Directly generate some values
- Take 1 or more previously generated values and either fail or generate 0 or more new values
(You can unify the latter two but it turns out to be slightly more convenient not to)
TestMachine generates a valid sequence of operations of a fixed size (200 operations by default). It then runs it. If it fails it finds a minimalish (it can’t be shrunk by deleting fewer than two instructions and still get a failing test case) subsequence of the program that also fails.
The internal representation for this is a collection of named stacks, usually corresponding to types of variables (e.g. you might have a stack for strings, a stack for ints, and a stack for intmaps). An operation may push, pop, or read data from the stacks, or it may just shuffle them about a bit. Testmachine operations are then just instructions on this multi-stack machine.
Simple, right?
Well… there are a few complications.
The first is that they’re not actually stacks. They’re what I imaginatively named varstacks. A varstack is just a stack of values paired with integer labels. Whenever a variable is pushed onto the stack it gets a fresh label. When you perform a dup operation on the stack (pushing the head of the stack onto it a second time) or similar the variable label comes along with it for free.
This is important both for the later compilation stage and it also adds an additional useful operation: Because we can track which values on the stack are “really the same” we can invalidate them all at once. This allows us to implement things like “delete” operations which mark all instances of a value as subsequently invalid. In the intmap case this is an actual delete and free the memory operation (ok, it’s actually a “decrement the reference count and if it hits zero delete and free some memory”, but you’re supposed to pretend it’s a real delete). It could also be e.g. closing a collection, or deleting some object from the database, or similar.
The variable structure of the varstack also lets us define one of the most important invariants that lets testmachine do its work: An operation must have a consistent stack effect that depends only on the variable structure of the stacks. Further an operation’s validity must depend only on the variable structure of the stacks (whether or not it fails obviously depends on the values on the stack. The point is that it can’t have been invalid to execute that operation in the first place depending on the available data).
This is a bit restrictive, and definitely eliminates some operations you might want to perform – for example “push all the values from this list onto that stack” – but in practice it seems to work well for the type of things that you actually want to test.
The reason it’s important is that it allows you to decouple program generation from program execution. This is desirable because it lets you do things like forking before executing the program to protect against native code, but it also just allows a much cleaner semantics for the behaviour of the library.
In fact the actual invariant that testmachine maintains is even more restrictive. All testmachine operations have one of the following stack effects:
- They operate on a single stack and do not change the set of available variables on that stack (but may cause the number of times a variable appears on the stack to change)
- They perform a read argspec operation, followed by a series of push operations
What is a read argspec operation?
An argspec (argument specifier. I’m really not very good at names) is a tuple of pairs (stack name, bool), where the bool is a flag that indicates whether that value is consumed.
The interpretation of an argspec like e.g. ((intmaps, True), (ints, False), (strings, False)) is that it reads the top intmap, int and string from each of their stacks, and then invalidates all intmaps with that label. An argspec like ((intmaps, True), (intmaps, True)) reads and invalidates the top two intmaps. An operation like ((ints, False), (ints, False)) reads the to two ints and doesn’t invalidate them.
It’s a little under-specified what happens when you have something like ((intmaps, True), (intmaps, False)). Assume something sensible and consistent happens that if I’d continued with the project I’d totally have pinned down.
The reason for this super restrictive set of operations is a) It turned out to be enough for everything I cared about and b) It made the compilation process and thus the resulting program generation much cleaner. Every operation is either invisible in the compiled code or can be written as r1, …, rn = some_function(t1, …, tn).
So we have the testmachine (the collection of varstacks), and we have a set of permitted operations on it. How do we go from that to a program?
And the answer is that we then introduce the final concept: That of a testmachine language.
A language is simply any function that takes a random number generator and the variable structure of a testmachine program and produces an operation that would be valid given that variable structure.
Note that a language does not have to have a consistent stack effect, only the operations it generates – it’s possible (indeed, basically essential) for a single language to generate operations with a wide variety of different stack effects on a wide variety of different stacks.
So we have a testmachine and a language, and that’s enough to go on.
We now create a simulation testmachine. We repeatedly ask the language to generate an operation, simulate the result of that operation (which we can do because we know the stack effect) by just putting None values in everywhere we’d actually run the operation, and iterate this process until we have a long enough program. We then run the program and see what happens. If it fails, great! We have a failing test case. Minimize that and spit it out as output. If not, start again from scratch until you’ve tried enough examples.
And that’s pretty much it, modulo a bunch of details around representation and forking that aren’t really relevant to the interesting parts of the algorithm.
This is strictly more general than quickcheck, or Hypothesis or Scalacheck’s stateful testing: You can represent a simple quickcheck program as a language that pushes values onto a stack an an operation that reads values from that stack and fails if the desired property isn’t true.
With a little bit of work you can even make it just better at Quickcheck style.
Suppose you want to test the property that for all lists of integers xs, after xs.remove(y), y not in xs. This property is false because remove only removes the first element.
So in Hypothesis you could write the test:
@given([int], int) def test_not_present_after_remove(xs, y): try: xs.remove(y) except ValueError: pass assert y not in xs |
And this would indeed fail.
But this would probably pass:
@given([int], int) def test_not_present_after_remove(xs, y): try: xs.remove(y) xs.remove(y) except ValueError: pass assert y not in xs |
Because it only passes due to some special cases for being able to generate low entropy integers, and it’s hard to strike a balance between being low entropy enough to generate a list containing the same value three times and being high enough entropy enough to generate an interesting range of edge cases.
Testmachine though? No sweat. Because the advantage of testmachine is that generation occurs knowing what values you’ve already generated.
You could produce a testmachine with stacks ints and intlists, and a language which can generate ints, generate lists of ints from the ints already generated, and can perform the above test, and it will have very little difficulty falsifying it, because it has a high chance of sharing values between examples.
The future
Testmachine is basically ideal for testing things like ORM code where there’s a mix of returning values and mutating the global state of the database, which is what caused me to make the decision to bring it forward and include it in (hopefully) the next version of Hypothesis. It’s not going to be an entirely easy fit, as there are a bunch of mismatches between how the two work, but I think it will be worth it. As well as allowing for a much more powerful form of testing it will also make the quality of Hypothesis example generation go up, while in turn Hypothesis will improve on testmachine by improving the quality of the initial data generation and allowing simplification of the generated values.