Category Archives: Hypothesis

Speeding up Hypothesis simplification by warming up

This post might be a bit too much of Hypothesis inside baseball, but it’s something I did this morning that I found pretty interesting so I thought I’d share it. Hypothesis has a pretty sophisticated system for example simplification. A simplification pass looks roughly like

def simplify_such_that(search_strategy, random, value, condition):
    changed = True
    while changed:
        changed = False
        for simplify in search_strategy.simplifiers(random, t):
            while True:
                for s in simplify(random, value):
                    if condition(s):
                        changed = True
                        value = s
                        break
                else:
                    break
    return value

Instead of having a single simplify function we split our simplification up into multiple passes. Each pass is repeatedly applied until it stops working. Then we move onto the next one. If any pass succeeded at simplifying the value we start again from the beginning, else we’re done and return the current best value. This works well for a number of reasons, but the principle goal of it is to avoid spending time on steps that we’re pretty sure are going to be useless: If one pass deletes elements of a list and another simplifies them, we don’t want to try deleting again every time we successfully shrink a value because usually we’ve already found the smallest list possible (the reason we start again from the beginning if any pass worked is because sometimes later passes can unblock earlier ones, but generally speaking this doesn’t happen much). This generally works pretty well, but there’s an edge case where it has pathologically bad performance, which is when we have a pass which is useless but has a lot of possible values. This happens e.g. when you have large lists of complicated values. One of my example quality tests involves finding lists of strings with high unicode codepoints and long strictly increasing subsequences. This normally works pretty well, but I was finding it hit this pathological case sometimes and the test would fail because it used up all its time trying simplification passes that wouldn’t work because they were blocked by the previous step. This morning I figured out a trick which seems to improve this behaviour a lot. The idea is to spend a few passes (5 is my highly scientifically derived number here) where we cut each simplifier off early: We give it a few attempts to improve matters and if it doesn’t we bail immediately rather than running to the end. The new code looks roughly like this:

from itertools import islice
 
max_warmups = 5
 
def simplify_such_that(search_strategy, random, value, condition):
    changed = true
    warmup = 0
    while changed or warmup < max_warmups:
        warmup += 1
        changed = false
        for simplify in search_strategy.simplifiers(random, t):
            while true:
                simpler = simplify(random, value)
                if warmup < max_warmups:
                    simpler = islice(simpler, warmup)
                for s in simpler:
                    if condition(s):
                        changed = true
                        value = s
                        break
                else:
                    break
    return value

This seems to avoid the pathological case because rather than getting stuck on a useless simplifier we simply skip over it fairly quickly and give other simplifiers that are more likely to work a chance to shine. Then once the warmup phase is over we get to do the full simplification algorithm as before, but because we’ve already chewed it down to something much less complicated than we started with there isn’t as much of a problem – we tend not to have individually long simplifier passes because most of the really complex structure has already been thrown away. Empirically, for the test cases this was designed to improve this is more than an order of magnitude speed improvement (even when they’re not hitting the pathological case where they fail altogether), going from giving up after hitting the timeout at 30 seconds to completing the full pass in 3, and for everything else it merely seems about the same or a little better. So, yeah, I’m pretty pleased with this. This is definitely in the category of “Hypothesis’s example simplification is an overly sophisticated solution to problems that are mostly self-inflicted”, but who doesn’t like nice examples?

This entry was posted in Hypothesis, Python, Uncategorized on by .

Notes on the implementation of Hypothesis stateful testing

I’ve just released a version of Hypothesis with an implementation of a form of state machine based testing.

This is pretty exciting for me as it’s something that I’ve wanted to do for about a year and a half now. Back in the beginning of 2014 I wrote testmachine, which at the time I thought was revolutionary. I’ve since realised that it’s just a variation on the theme of state machine based testing that’s been popular in the Erlang world for a while. I’ve been wanting to merge the ideas for it with those in Hypothesis for some time.

In theory this should have been easy. To be honest, it should have been easy in practice too. The two-fold problems were that each project had somewhat incompatible assumptions about how the world works, and I wanted to put it all into a single uniform framework rather than half-arsing it.

Well, in the end I’ve half-arsed it, but that was because I came at it from another angle: there are a bunch of things I wanted to test that were fundamentally impossible in either Hypothesis or testmachine. Thinking hard about what sort of API I would actually need in order to test these made me realise a very simple generic state machine testing API which made building a testmachine like API so incredibly easy that it seemed a shame not to do it.

I’d like to talk about the design and the implementation of that underlying API in this piece. First though I’d like to note some motivating examples:

  1. Testing a game where the decisions available to you at any point depend heavily on the game state
  2. Testing a website using selenium webdriver
  3. Testing your data access API against a database (e.g. in Django)

The crucial thing about all of these is that you simply can’t provide a valid spec for the steps to execute in advance of executing them – at each stage the set of actions available to you is so dependent on the state you’ve executed so far that there’s really no way to make a decision in advance about what it is you’re doing. The only way to know what you can do for step N + 1 is to run the first N steps.

So I tried to think what the API I would need to make something like this work would be and what I came up with was this:

class GenericStateMachine(object):
    def steps(self):
        raise NotImplementedError('%r.steps()' % (self,))
    def execute_step(self, step):
        raise NotImplementedError('%r.execute_steps()' % (self,))
    def teardown(self):
        pass

The idea is that steps returns a SearchStrategy over the set of possible actions, which can depend on the current state in essentially arbitrary ways. execute_step runs a value drawn from the search strategy, and then teardown shuts your machine down at the end. So you’re running something that looks more or less like:

machine = MyMachine()
try:
    for i in range(n_steps):
        machine.execute_step(machine.steps.example())
finally:
    machine.teardown()

(what is actually run is notably more complicated for a variety of reasons, but that’s the essence of it)

The system then basically just generates programs of this form until one of them errors.

This will obviously work: At each point you choose a random action available to you, then you execute it. If anything like this is going to work for testing your stateful system this will.

There’s a problem though: How can you possibly simplify such a thing? Your steps() function can depend in arbitrary ways on the state of the system. If you’re unlucky it might even depend on an external source of randomness! (Normal Hypothesis’s answer to this is “Well don’t do that then”. Stateful testing has to be a little more robust, although it will still complain that your test is flaky if the same program doesn’t fail twice).

The first part is easy: We save the random seeds we use for examples. This means that we can always delete intermediate steps without affecting the draw for later steps. So we can start by considering the stateful test execution as just a list of random number seeds and attempt to find a minimal such list by deleting elements from it.

It is important to note that just because we’ve saved the seed at each step doesn’t mean that the sequence of step values is going to be the same! e.g. if earlier steps populate some list and a later one draws from it, even though the seed is the same the choice from the now smaller list might not be.

In theory this can work arbitrarily badly: e.g. If each step was generated by taking a cryptographic hash of the state, turning that into a random number generator, and return the strategy just(random.random()), essentially any change you make to your execution by deleting previous steps might break the failure. In practice however it seems to work rather well – most natural choices of strategies and execution approaches are relatively robust to deleting intermediate steps and not changing the meaning of later ones.

However we still want to be able to minimize the values in the steps. If we started with having generated the value “aweraweraiouuovawenlmnlkewar” and actually all we cared about was that it was a non-empty string of latin alphabet characters we’d sure like to be able to shrink it down to the value “a”.

But there’s a problem. Shrinking in Hypothesis happens on strategy templates (the intermediate representation that gets turned into values) and is defined by the strategy, not based on the type of the value. This is not a technical limitation, it’s genuinely an important feature: Otherwise you might shrink things to things not permitted by the strategy. e.g. if our strategy was actually sampled_from([“aweraweraiouuovawenlmnlkewar”, 2]) it’s pretty important we don’t try to shrink the string to “a”.

Additionally, what do we do if the strategy changes under us? If on a previous run we got to this point and demanded a string to progress, but now it wants an int?

So, uh, this is where it gets a little weird.

I realised an important trick a little while ago: Hypothesis templates don’t have to be immutable. They have to do a pretty good job of faking it in a lot of roles, but it’s actually perfectly OK for a bit of mutation to be happening behind the scenes and to make use of this. This is used in the support for generating infinite stream, where we keep track of how much of the stream has been evaluated so we know how far to bother simplifying up to (no point in simplifying element #10 if we’ve only looked at the first 9 values).

And we can also use this here, combined with another Hypothesis feature.

Hypothesis has the ability to serialize and deserialize all templates. Moreover it makes very strong guarantees: You can feed it literally any data in and you will get either a valid template for the strategy with all invariants satisfied or a BadData exception. This is an important feature for compatibility with older databases – Hypothesis can’t necessarily use a serialized example from a previous version, but it will never break because of it.

And this means that we can convert two templates between arbitrary strategies if we know both strategy: We serialize the template on the one end and then deserialize it on the other. No problem. Either we got a valid template and everything is fine, or we didn’t and we know about it.

So the first step is that our templates for executing a state machine actually have two parts: The seed and a serialized blob of data. When deciding on a step to take, we first attempt to deserialize the blob. If that works, we use that as the template for the step. If it doesn’t, we draw a template from the random seed we have.

And this, combined with the mutability, allows us to simplify individual elements.

As we execute the state machine, we store the strategies and the serialized form of their template on the template for our execution. If we started with a serialized blob this will often leave it unchanged, but it may refresh the strategy.

This means that at the time we come to simplify the template, we have a strategy and a representation of a template for each value. So we use that strategy to simplify the value and serialize the results back. This means that the next time we try the example we have a candidate for something simpler to run with.

This works fairly well, albeit partly due to the choice of fairly stable strategies and compatible serialization formats – e.g. I’ve used this with sampled_from() based strategies a lot, and these just serialize as integer indices so if you’ve picked something early in the list it will remain valid if you simplify it, even if you remove an element of two from the list elsewhere in the simplify.

This is still a work in progress, and there are a bunch more things I’d like to do to improve it – for example one problem that would be relatively easily solved is if there were several incompatible strategies you could get at a given stage. Currently this will not be correctly simplified, but by storing a couple of different serialized representations you could try multiple until one works.

All told I’m not really sure where I feel about this on a scale of great to guilty. I’m really happy with the end result, and I’m somewhat amazed that this was even possible, but I’m looking forward to great improvements in both the implementation and the API.

This entry was posted in Hypothesis, Python, Uncategorized on by .

How to improve your Quickcheck implementation with this one weird trick

When asked what’s genuinely novel in Hypothesis as opposed to mere optimisation, I usually cite three things:

  1. Strategies carry simplification and serialization with them but still compose monadically.
  2. The data generation can hit corners of the search space that standard Quickcheck can not, and adapt dynamically to requirements.
  3. The simplification is much more sophisticated than the standard Quickcheck interface and can adapt to the structure of the problem much better.

It took me a surprisingly long time to notice that all three of these rely on essentially the same tweak to the standard API.

The tweak is this: Introduce an intermediate representation.

  1. By introducing an intermediate template type which is then reified to the final result, you can simplify a mapped object by simplifying the argument.
  2. Instead of generating a random value, you first generate a random parameter and then draw a conditional value given that parameter. As well as producing more interesting results this way (because of how you compose parameters for e.g. lists) this also lets you reuse parameters which are more likely to give you a useful result.
  3. Instead of having a single simplify function, you have a list of simplifiers which you apply in turn.

Each of these have taken what was a single step and broken it up into two steps which you can naturally compose to give the original. We then don’t compose these in the obvious way, but instead change things around just enough to give something strictly more powerful while still mostly retaining the original characteristics.

I’m not sure what the broader significance of this is yet, but it seems likely that this approach has other areas to which you could apply it.

This entry was posted in Hypothesis, Python, Uncategorized on by .

Some empirically derived testing principles

Here are some principles I have found useful when testing Hypothesis. I don’t promise any of these are universally good, all I promise is that all of them have resulted in my finding bugs that I would otherwise have missed, and that together they seem to give me a lot more confidence in my software than I otherwise would have.

1. 100% coverage is mandatory.

This is a slight lie in that I do have the occasionally nocover or ignore pragma sprinkled around the code, but those generally occur in places that I’ve tried really hard to cover and it’s just not possible to hit reliably.
People seem to think that coverage is a bad metric and doesn’t really tell you a lot about your testing. There’s a line I have here which I believe is stolen from someone from the sqlite project:

100% coverage tells you nothing, but less than 100% coverage tells you something.

Code that is not tested almost certainly contains bugs. Code that is tested probably still contains bugs, but the probability is lower. Therefore, all other things being equal, code with 100% coverage is likely to contain fewer bugs.
This is essentially operating on the principle that metrics make great constraints and lousy goals. The goal is not to maximize coverage, the goal is to test our project well. 100% coverage is the starting point, not the finish line.
(“Coverage” here means “branch coverage”. In reality it means “The finest grained coverage metric that you can get out of your tooling and that it’s meaningful to claim 100% coverage on”)

2. All bugs result in tests.

The core principle of this is hopefully both self-explanatory and uncontroversial. Step 1 to fixing a bug is to write a test for it. Otherwise the bug might reoccur.
But it goes deeper. Every bug is actually two bugs: A bug in your software, and a bug in your test suite.
The former is obvious. A bug in your software is a bug in your software. The latter is more subtle: It’s a bug in your test suite because your test suite didn’t find this bug, and that indicates a failure of your testing.
So. You’ve found a bug. Why didn’t your test suite catch it? What general failure of testing this indicates, and can you fix that failure in a way that would catch other instances of similar bugs?
As a concrete example: In Hypothesis’s data serialization layer, one of the invariants is that an attempt to deserialize bad data can only raise a BadData exception. Any other exception is a bug. This invariant is used to guarantee that Hypothesis can always read from old databases – the worst case scenario is that you can’t reuse that data, not that it crashes your tests.
In preparing for the 1.1 release I found an instance where this wasn’t the case. This then caused me to write some generic tests that tried fuzzing the data Hypothesis reads in order to find exceptions that weren’t BadData. I found 5 more.

3. Builds should fail fast.

There’s nothing more frustrating than getting to the end of a long build and then having the trivial thing that you could have found out 30 seconds into the build failing everything.
Linting is a major culprit here. lint should run at the beginning of builds, not at the end. I also have a separate entry in the build matrix which runs only a fast subset of the tests and checks the results for 100% coverage. This means that if I’ve forgot to test some area of the code I find out fast rather than at the end.

4. Principles should never be satisfied by accident.

Suppose your test suite catches a bug in your change with a failing test. Is that test actually the one that should have caught it? This particularly comes up with internal APIs I find. Testing internals is important, and if a bug in an internal was only caught because of a bug it caused in the public API, that internal could use a more specific test.
This also plays well with the faster build step for coverage. By forcing coverage to happen on simple examples it ensures that code is covered deliberately rather than as a result of testing other things. This helps offset the problem that you can often get to 100% coverage without ever really testing anything.
Flaky tests are often an example of something happening by accident. Sometimes it’s just that the test is bad, but often it’s that there’s some hard to trigger condition that could use a dedicated test that hits it reliably.

5. You need both very general and very specific tests.

Tests need both breadth and depth. Broad tests are things like Hypothesis which ensure that your code is well behaved in every scenario (or at least as close to every scenario as they can), but they tend to have very shallow definitions of “well behaved”. Highly specific tests can assert a lot more about what the code should do and as a result are true in a much narrower set of environments.
These tend to catch different classes of bugs.
So far to the best of my knowledge nobody has come up with a way of testing things that achieves both breadth and depth at the same time other than “do a massive amount of work manually writing tests”, but you can do both by writing both kinds of tests, and this is worth doing.

6. Keep trying new ways of testing things.

It’s been my experience that every time I try to come up with a new way of testing Hypothesis I find at least one new bug in it. Keeping on coming up with new and creative ways of testing things is a great way to keep ahead of your users in finding bugs.

7. There is no substitute for hard work.

Unfortunately, quality software is hard. I spend at least as much effort testing Hypothesis as I do writing it, and there’s probably no way around that. If you’re taking quality seriously, you need to be spending as much work on quality as functionality.

This entry was posted in Hypothesis, Uncategorized on by .

Tests are a license to delete

I’ve spent the majority of my career working on systems that can loosely be described as “Take any instance of this poorly specified and extremely messy type of data found in the wild and transform it into something structured enough for us to use”.

If you’ve never worked on such a system, yes they’re about as painful as you might imagine. Probably a bit more. If you have worked on such a system you’re probably wincing in sympathy right about now.

One of the defining characteristics of such systems is that they’re full of kludges. You end up with lots of code with comments like “If the audio track of this video is broken in this particular way, strip it out, pass it to this external program to fix it, and then replace it in the video with the fixed version” or “our NLP code doesn’t correctly handle wikipedia titles of this particular form, so first apply this regex which will normalize it down to something we can cope with” (Both of these are “inspired by” real examples rather than being direct instances of this sort of thing).

This isn’t surprising. Data found in the wild is messy, and your code tends to become correspondingly messy to deal with all its edge cases. However kludges tend to accumulate over time, making the code base harder and harder to work with, even if you’re familiar with it.

It has however historically made me very unhappy. I used to think this was because I hate messy code.

Fast-forward to Hypothesis however. The internals are full of kludges. They’re generally hidden behind relatively clean APIs and abstraction layers, but there’s a whole bunch of weird heuristics with arbitrary magic numbers in them and horrendous workarounds for obscure bugs in other peoples’ software (Edit: This one is now deleted! Thanks to Marius Gedminas for telling me about a better way of doing it).

I’m totally fine with this.

Some of this is doubtless because I wrote all these kludges, but it’s not like I didn’t write a lot of the kludges in the previous system! I have many failings and many virtues as a developer, but an inability to write terrible code is emphatically not either of them.

The real reason why I’m totally fine with these kludges is that I know how to delete them: Every single one of these kludges was introduced to make a test pass. Obviously the weird workarounds for bugs all have tests (what do you take me for?), but all the kludges for simplification or generation have tests too. There are tests for quality of minimized examples and tests for the probability of various events occurring. Tuning these are the two major sources of kludges.

And I’m pretty sure that this is what makes the difference: The problem with the previous kludges is that they could never go away. A lot of these systems were fairly severely under-tested – sometimes for good reasons (we didn’t have any files which were less that 5TB that could reproduce a problem), some for code quality reasons (our pipeline was impossible to detangle), sometimes just as a general reflection of the culture of the company towards testing (are you saying we write bugs??).

This meant that the only arbiter for whether you could remove a lot of those kludges was “does it make things break/worse on the production system?”, and this meant that it was always vastly easier to leave the kludges in than it was to remove them.

With Hypothesis, and with other well tested systems, the answer to “Can I replace this terrible code with this better code?” is always “Sure, if the tests pass”, and that’s immensely liberating. A kludge is no longer a thing you’re stuck with, it’s code that you can make go away if it causes you problems and you come up with a better solution.

I’m sure there will always be kludges in Hypothesis, and I’m sure that many of the existing kludges will stay in it for years (I basically don’t see myself stopping supporting versions of Python with that importlib bug any time in the near future), but the knowledge that every individual kludge can be removed if I need to is very liberating, and it takes away a lot of the things about them that previously made me unhappy.

This entry was posted in Hypothesis, programming on by .