A new approach to property based testing

Edit: I’ve put together a prototype of these ideas called Conjecture. The design document may be a better place to start. I’ve had multiple people say “I had no idea what you were talking about when I read that blog post but it’s suddenly incredibly obvious now I’ve read the design document” or similar.

Do you ever have one of those thought processes where you go:

  1. Hmm. That’s an interesting idea.
  2. Ooh. And then I can do this.
  3. And that.
  4. Holy shit this solves so many problems.
  5. …by invalidating almost all the work I’ve done on this project.
  6. …and a lot of work other people have done over the last few years.
  7. I don’t know whether to be happy or sad.
  8. I think this might change literally everything.

Well, I’m having one right now.

Through a collision of multiple ideas – some mine, some other peoples’ – I’ve come up with an entirely new backend and API for property based testing. It’s drastically simpler to use, drastically simpler to implement, and works more or less in any language because it requires basically no advanced features. I’m pretty sure I can rebuild Hypothesis on top of it and that in the course of doing so I will be able to throw away approximately half the code. I’m also strongly considering doing a port to C and rebuilding Hypothesis as a set of bindings to that.

Here’s how it works:

We introduce a type TestContext. A TestContext is three pieces of data:

  1. An immutable sequence of bytes.
  2. An index into that sequence.
  3. A file handle, which may be None/NULL/whatever.

Given a test case, we work with testing functions. A testing function is any function which takes a TestContext as its first argument and then does one of three things:

  1. Returns a value
  2. Rejects the TestContext
  3. Fails

And in the course of doing so writes some data to the test context’s file handle.

A test case is then just a test function which takes no extra arguments and returns an ignored value. We generate a fresh test context, feed it to the test function and see what happens. If we get a failure, the test fails. We then try again until we get some fixed number of examples that are neither failures or rejections.

The primitive functions we build everything on top of are:

  1.  draw_bytes(context, n): If there are more than n bytes left in the buffer starting from the current index, return the next n bytes and then increment the index by n. Otherwise, reject the TestContext.
  2. fail(context): Essentially an ‘assert False’. Mark the current context as a failure.
  3. report(context, message): Print a message to the context’s file handle.

Reporting is used to show intermediate values so you can track what your test case does. Some additional state for controlling quality of reporting may also be useful, as are some helper functions (e.g. for my prototype I ended up defining declare(context, name, value) which prints name=value and then returns value)

We can build everything on top of this in the usual way that you would build an random number generator: e.g. you can generate an n-byte integer by drawing n bytes and combining them.

The important realisation is that this interface supports minimization! Once we have generated a sequence of bytes that produces a failure, we can start to perform standard quickcheck style shrinking on it, but because we’re only interested in sequences of bytes we can do a lot of clever things that are normally not open to us that are specialised to shrinking byte sequences.

So we generate, find a failure, then shrink the failure, then we run one final time with a real file handle passed into the test context this time so the log of our minimized test run is printed.

And this seems to work really rather well. It unifies the concepts of strategy and test case, and handles the problem you can have when defining your own generators of how to display data

You need to arrange your generators a little carefully, and some special shrinks are useful: For example, in my experiments I’ve found that generating integers in big-endian order is important, but you then also need a shrinking operation that lets you swap adjacent bytes x, y when you have x > y. Given that, integer shrinking appears to work quite well. I haven’t yet got a floating point generator working properly (there’s the obvious one where you generate a word of the appropriate size and then reinterpret it as a float, but this is a terrible distribution which is unable to find interesting bugs).

I suspect in general the shrinks this produce will often be a bit worse than that in classic quickcheck because they don’t have access to the structure, but in some cases that may actually improve matters – there are a lot of things where classic quickcheck shrinking can get stuck in a local optimum which this essentially allows you to redraw for.

This supports example saving like the normal Hypothesis interface does, and if anything it supports it better because your examples are literally a sequence of bytes and there’s almost nothing to do.

It supports composition of strategies with side effects in the middle in a way that is currently incredibly complicated.

Another nice feature is that it means you can drive your testing with american fuzzy lop if you like, because you can just write a program that reads in a buffer from standard in, feeds it to your test case and sees what happens.

In general, I’m really very excited by this whole concept and think it is the future of both the Hypothesis internals and the API. I don’t yet know what the path to implementing that is, and I’m currently considering doing the C implementation first, but I’ve put together a prototype which seems to work pretty well. Watch this space to see what comes next.

This entry was posted in Hypothesis on by .

3 thoughts on “A new approach to property based testing

  1. Christian Long

    Fascinating. It reminds me of this talk Margo Seltzer gave at RICON

    Automatically Scalable Computation

    She talks about scaling single-threaded programs to multiple cores by using the extra cores to speculatively explore the possible future state of the program. When the current state of the program matches one of the speculatively-calculated states, they can jump straight to the end of the chain of work that the speculative core has already done.

    The similarity is in treating a program and its state as just a stream of bits, amenable to manipulation just like any other stream.

    Thank you for the post. Fun stuff!

    1. david Post author

      I really haven’t, you’ve just not understood the post or its context at all.

      The “isomer of delta debugging” you’re talking about is example shrinking, which is an integral part of this sort of software. Of course it’s not novel. It’s the application of it that is.

Comments are closed.