Notes on the implementation of Hypothesis stateful testing

I’ve just released a version of Hypothesis with an implementation of a form of state machine based testing.

This is pretty exciting for me as it’s something that I’ve wanted to do for about a year and a half now. Back in the beginning of 2014 I wrote testmachine, which at the time I thought was revolutionary. I’ve since realised that it’s just a variation on the theme of state machine based testing that’s been popular in the Erlang world for a while. I’ve been wanting to merge the ideas for it with those in Hypothesis for some time.

In theory this should have been easy. To be honest, it should have been easy in practice too. The two-fold problems were that each project had somewhat incompatible assumptions about how the world works, and I wanted to put it all into a single uniform framework rather than half-arsing it.

Well, in the end I’ve half-arsed it, but that was because I came at it from another angle: there are a bunch of things I wanted to test that were fundamentally impossible in either Hypothesis or testmachine. Thinking hard about what sort of API I would actually need in order to test these made me realise a very simple generic state machine testing API which made building a testmachine like API so incredibly easy that it seemed a shame not to do it.

I’d like to talk about the design and the implementation of that underlying API in this piece. First though I’d like to note some motivating examples:

  1. Testing a game where the decisions available to you at any point depend heavily on the game state
  2. Testing a website using selenium webdriver
  3. Testing your data access API against a database (e.g. in Django)

The crucial thing about all of these is that you simply can’t provide a valid spec for the steps to execute in advance of executing them – at each stage the set of actions available to you is so dependent on the state you’ve executed so far that there’s really no way to make a decision in advance about what it is you’re doing. The only way to know what you can do for step N + 1 is to run the first N steps.

So I tried to think what the API I would need to make something like this work would be and what I came up with was this:

class GenericStateMachine(object):
    def steps(self):
        raise NotImplementedError('%r.steps()' % (self,))
    def execute_step(self, step):
        raise NotImplementedError('%r.execute_steps()' % (self,))
    def teardown(self):
        pass

The idea is that steps returns a SearchStrategy over the set of possible actions, which can depend on the current state in essentially arbitrary ways. execute_step runs a value drawn from the search strategy, and then teardown shuts your machine down at the end. So you’re running something that looks more or less like:

machine = MyMachine()
try:
    for i in range(n_steps):
        machine.execute_step(machine.steps.example())
finally:
    machine.teardown()

(what is actually run is notably more complicated for a variety of reasons, but that’s the essence of it)

The system then basically just generates programs of this form until one of them errors.

This will obviously work: At each point you choose a random action available to you, then you execute it. If anything like this is going to work for testing your stateful system this will.

There’s a problem though: How can you possibly simplify such a thing? Your steps() function can depend in arbitrary ways on the state of the system. If you’re unlucky it might even depend on an external source of randomness! (Normal Hypothesis’s answer to this is “Well don’t do that then”. Stateful testing has to be a little more robust, although it will still complain that your test is flaky if the same program doesn’t fail twice).

The first part is easy: We save the random seeds we use for examples. This means that we can always delete intermediate steps without affecting the draw for later steps. So we can start by considering the stateful test execution as just a list of random number seeds and attempt to find a minimal such list by deleting elements from it.

It is important to note that just because we’ve saved the seed at each step doesn’t mean that the sequence of step values is going to be the same! e.g. if earlier steps populate some list and a later one draws from it, even though the seed is the same the choice from the now smaller list might not be.

In theory this can work arbitrarily badly: e.g. If each step was generated by taking a cryptographic hash of the state, turning that into a random number generator, and return the strategy just(random.random()), essentially any change you make to your execution by deleting previous steps might break the failure. In practice however it seems to work rather well – most natural choices of strategies and execution approaches are relatively robust to deleting intermediate steps and not changing the meaning of later ones.

However we still want to be able to minimize the values in the steps. If we started with having generated the value “aweraweraiouuovawenlmnlkewar” and actually all we cared about was that it was a non-empty string of latin alphabet characters we’d sure like to be able to shrink it down to the value “a”.

But there’s a problem. Shrinking in Hypothesis happens on strategy templates (the intermediate representation that gets turned into values) and is defined by the strategy, not based on the type of the value. This is not a technical limitation, it’s genuinely an important feature: Otherwise you might shrink things to things not permitted by the strategy. e.g. if our strategy was actually sampled_from([“aweraweraiouuovawenlmnlkewar”, 2]) it’s pretty important we don’t try to shrink the string to “a”.

Additionally, what do we do if the strategy changes under us? If on a previous run we got to this point and demanded a string to progress, but now it wants an int?

So, uh, this is where it gets a little weird.

I realised an important trick a little while ago: Hypothesis templates don’t have to be immutable. They have to do a pretty good job of faking it in a lot of roles, but it’s actually perfectly OK for a bit of mutation to be happening behind the scenes and to make use of this. This is used in the support for generating infinite stream, where we keep track of how much of the stream has been evaluated so we know how far to bother simplifying up to (no point in simplifying element #10 if we’ve only looked at the first 9 values).

And we can also use this here, combined with another Hypothesis feature.

Hypothesis has the ability to serialize and deserialize all templates. Moreover it makes very strong guarantees: You can feed it literally any data in and you will get either a valid template for the strategy with all invariants satisfied or a BadData exception. This is an important feature for compatibility with older databases – Hypothesis can’t necessarily use a serialized example from a previous version, but it will never break because of it.

And this means that we can convert two templates between arbitrary strategies if we know both strategy: We serialize the template on the one end and then deserialize it on the other. No problem. Either we got a valid template and everything is fine, or we didn’t and we know about it.

So the first step is that our templates for executing a state machine actually have two parts: The seed and a serialized blob of data. When deciding on a step to take, we first attempt to deserialize the blob. If that works, we use that as the template for the step. If it doesn’t, we draw a template from the random seed we have.

And this, combined with the mutability, allows us to simplify individual elements.

As we execute the state machine, we store the strategies and the serialized form of their template on the template for our execution. If we started with a serialized blob this will often leave it unchanged, but it may refresh the strategy.

This means that at the time we come to simplify the template, we have a strategy and a representation of a template for each value. So we use that strategy to simplify the value and serialize the results back. This means that the next time we try the example we have a candidate for something simpler to run with.

This works fairly well, albeit partly due to the choice of fairly stable strategies and compatible serialization formats – e.g. I’ve used this with sampled_from() based strategies a lot, and these just serialize as integer indices so if you’ve picked something early in the list it will remain valid if you simplify it, even if you remove an element of two from the list elsewhere in the simplify.

This is still a work in progress, and there are a bunch more things I’d like to do to improve it – for example one problem that would be relatively easily solved is if there were several incompatible strategies you could get at a given stage. Currently this will not be correctly simplified, but by storing a couple of different serialized representations you could try multiple until one works.

All told I’m not really sure where I feel about this on a scale of great to guilty. I’m really happy with the end result, and I’m somewhat amazed that this was even possible, but I’m looking forward to great improvements in both the implementation and the API.

This entry was posted in Hypothesis, Python, Uncategorized on by .