Author Archives: david

A new approach to property based testing

Do you ever have one of those thought processes where you go:

  1. Hmm. That’s an interesting idea.
  2. Ooh. And then I can do this.
  3. And that.
  4. Holy shit this solves so many problems.
  5. …by invalidating almost all the work I’ve done on this project.
  6. …and a lot of work other people have done over the last few years.
  7. I don’t know whether to be happy or sad.
  8. I think this might change literally everything.

Well, I’m having one right now.

Through a collision of multiple ideas – some mine, some other peoples’ – I’ve come up with an entirely new backend and API for property based testing. It’s drastically simpler to use, drastically simpler to implement, and works more or less in any language because it requires basically no advanced features. I’m pretty sure I can rebuild Hypothesis on top of it and that in the course of doing so I will be able to throw away approximately half the code. I’m also strongly considering doing a port to C and rebuilding Hypothesis as a set of bindings to that.

Here’s how it works:

We introduce a type TestContext. A TestContext is three pieces of data:

  1. An immutable sequence of bytes.
  2. An index into that sequence.
  3. A file handle, which may be None/NULL/whatever.

Given a test case, we work with testing functions. A testing function is any function which takes a TestContext as its first argument and then does one of three things:

  1. Returns a value
  2. Rejects the TestContext
  3. Fails

And in the course of doing so writes some data to the test context’s file handle.

A test case is then just a test function which takes no extra arguments and returns an ignored value. We generate a fresh test context, feed it to the test function and see what happens. If we get a failure, the test fails. We then try again until we get some fixed number of examples that are neither failures or rejections.

The primitive functions we build everything on top of are:

  1.  draw_bytes(context, n): If there are more than n bytes left in the buffer starting from the current index, return the next n bytes and then increment the index by n. Otherwise, reject the TestContext.
  2. fail(context): Essentially an ‘assert False’. Mark the current context as a failure.
  3. report(context, message): Print a message to the context’s file handle.

Reporting is used to show intermediate values so you can track what your test case does. Some additional state for controlling quality of reporting may also be useful, as are some helper functions (e.g. for my prototype I ended up defining declare(context, name, value) which prints name=value and then returns value)

We can build everything on top of this in the usual way that you would build an random number generator: e.g. you can generate an n-byte integer by drawing n bytes and combining them.

The important realisation is that this interface supports minimization! Once we have generated a sequence of bytes that produces a failure, we can start to perform standard quickcheck style shrinking on it, but because we’re only interested in sequences of bytes we can do a lot of clever things that are normally not open to us that are specialised to shrinking byte sequences.

So we generate, find a failure, then shrink the failure, then we run one final time with a real file handle passed into the test context this time so the log of our minimized test run is printed.

And this seems to work really rather well. It unifies the concepts of strategy and test case, and handles the problem you can have when defining your own generators of how to display data

You need to arrange your generators a little carefully, and some special shrinks are useful: For example, in my experiments I’ve found that generating integers in big-endian order is important, but you then also need a shrinking operation that lets you swap adjacent bytes x, y when you have x > y. Given that, integer shrinking appears to work quite well. I haven’t yet got a floating point generator working properly (there’s the obvious one where you generate a word of the appropriate size and then reinterpret it as a float, but this is a terrible distribution which is unable to find interesting bugs).

I suspect in general the shrinks this produce will often be a bit worse than that in classic quickcheck because they don’t have access to the structure, but in some cases that may actually improve matters – there are a lot of things where classic quickcheck shrinking can get stuck in a local optimum which this essentially allows you to redraw for.

This supports example saving like the normal Hypothesis interface does, and if anything it supports it better because your examples are literally a sequence of bytes and there’s almost nothing to do.

It supports composition of strategies with side effects in the middle in a way that is currently incredibly complicated.

Another nice feature is that it means you can drive your testing with american fuzzy lop if you like, because you can just write a program that reads in a buffer from standard in, feeds it to your test case and sees what happens.

In general, I’m really very excited by this whole concept and think it is the future of both the Hypothesis internals and the API. I don’t yet know what the path to implementing that is, and I’m currently considering doing the C implementation first, but I’ve put together a prototype which seems to work pretty well. Watch this space to see what comes next.

This entry was posted in Hypothesis on by .

A survey of Quickchecks

It’s commonly claimed that every language has a Quickcheck port. This is one of those statements that is simultaneously true and really unhelpful. Every language has a Quickcheck port that you really don’t want to use. Some languages have Quickcheck ports that you do.

I thought I’d do a quick survey of the current state of play for different languages.

First, a disclaimer: I do not write all of these languages, and many of them I barely read. As such, some of this list is based on an outsider making conjectures and if you do write the language and have better information than me I would very much appreciate correction.

First, lets start with the “just use this one” table. These are languages for which there is more or less unambiguously a single Quickcheck you should be using.

Minimum criteria for being on this list:

  1. Must support random generation of data to a test function. Note that I consider Smallcheck its own thing, so Smallcheck ports don’t count for inclusion unless they also support random generation.
  2. It must be fairly straightforward to generate your own custom types
  3. It must support shrinking
  4. It must either be under active development or have a reasonable claim to being finished.
  5. Must be under an OSI approved license (note: I haven’t checked this one too carefully, but these all claim to be open source and I’ve checked the license in detail on about half of them).
  6. There isn’t another implementation for the language also satisfying the above 5 conditions which also gets a comparable amount of active use.

Note that the presence of an item on this list does not mean that I have used it, only that I have inspected it to see what it supports. In particular I can’t comment on the data quality or how buggy or stable each of these is, though most of them appear to be pretty good.

Language Library Notes
C theft Provides nothing in the way of built in generators. You’ve always got to write your own from scratch. However the API for doing so is pretty easy.
C++ CppQuickCheck
Clojure test.check
Coq QuickChick
F# FsCheck
Haskell Quickcheck The original. Probably no longer the best.
JavaScript jsverify
PHP Eris
Python Hypothesis You knew I was going to say this, so don’t even pretend to be surprised.
Ruby Rantly I really wanted to disqualify this one on grounds of the way that you support shrinking is by monkey patching a ‘shrink’ method on to the type. I couldn’t quite justify doing so because monkey patching common and easily clashing words onto types is the Ruby standard idiom. I am unclear on how active development on Rantly is.
Rust Quickcheck
Scala ScalaCheck
Swift Swiftcheck

Other Languages


You may have been surprised that Erlang is not in the table despite being one of the most well known languages for Quickcheck. The reason is that there isn’t an unambiguous choice there. There is eqc, which is unambiguously the best but is not open source, and there’s Triq and PropEr, both of which are under active development. Apparently people use PropEr unless they have to avoid the GPL, in which case they use Triq.


As far as I can tell, people who want good Quickcheck for Java write their tests in Clojure or Scala. There is junit-quickcheck which is disqualified on grounds of not implementing shrinking. functionaljava does have a Quickcheck implementation that supports shrinking, but it seems unclear that people actually know this exists and are using it. It is possible it should be on the above table, but I’m softly considering it to fail condition 6 because test.check and ScalaCheck just seem to completely dominate it. I am very open to arguments that this is not the case.

Similarly, other .NET languages just seem to adopt the policy that you should just use FsCheck to test your code rather than trying to port it to the current language.


Go has a deeply inadequate version of Quickcheck which doesn’t support shrinking baked into its standard library. This is quite surprising as Quickcheck is from the 90s rather than the 70s, but they make up for it by not acknowledging where the concept is actually from. There do not appear to be any alternatives.


OCaml seems to be suffering from a problem of being close enough to Haskell that people try to do a straight port of Quickcheck but far enough from Haskell that this doesn’t work. The result is that there is a “mechanical port” of Quickcheck which is completely abandoned and a fork of it that uses more idiomatic OCaml. I am insufficiently familiar with OCaml or its community to know if either is used or whether there is another one that is.


There are of course many other languages than this, any many of them have something that looks a bit like Quickcheck if you squint hard enough. The Quickcheck page on Wikipedia lists a bunch of others, but a not entirely thorough sampling of the remainder suggests most of them fail the test of “does it implement shrinking” and the test of “is it actively developed or finished” and are not widely used enough to be worth an honourable mention.

If you know of a reasonably good Quickcheck that I have not included here, please do let me know.

This entry was posted in Hypothesis, programming on by .

Mighty morphing power strategies

The Hypothesis API is a bit of a bizarre sleight of hand. It pretends to be very clean and simple, but that’s mostly a distraction to stop you thinking too hard about it. Everything looks easy, so you aren’t surprised when it just works, and you don’t think too hard and realise that what it just did should actually have been impossible.

Take for example this piece of code using the Django integration (this comes straight from the docs):

from hypothesis.strategies import lists, just
def generate_with_shops(company):
  return lists(models(Shop, company=just(company))).map(lambda _: company)
company_with_shops_strategy = models(Company).flatmap(generate_with_shops)

We take a strategy that generates a model inserted into the database, then use flatmap to create a strategy for children of that and bind that in. Everything just works neatly and everything will simplify just fine – the list of children can be simplified, both by throwing away children and simplifying individual children, the original element can be simplifies, everything lives in the database, all is good with the world. Examples will be persisted into the Hypothesis example database as normal. Everything works great, no reason to think twice about it.

Except it is completely ridiculous that this works, and it’s certainly unique to Hypothesis. No other Quickcheck has anything like it.

Some of this is just the magic of Hypothesis templating at work. There’s a lot more information available than is present in the end result, and this also explains how you can mutate the value by adding children to it and have simplification still work, etc.

But there’s something that should make you very suspicious going on here: We cannot create the strategy we need to generate the children until we have already performed some side effects (i.e. put some things in the database). What could the template for this possibly be?

The answer to this is quite bad. But it’s quite bad and hidden behind another abstraction layer!

The answer is that we have a type called Morpher. As far as we are concerned for now, a Morpher has one method called “become”. You call my_morpher.become(my_strategy) and you get a value that could have been drawn from my_strategy.

You can think of Morphers as starting out as a reproducible way of getting examples from strategies, but there’s more to it than that, for one very important reason: A Morpher can be simplified and serialized.

This gives us a very easy implementation of flatmap:

def flatmap(self, f):
    return builds(lambda s, m: m.become(f(s)), self, morphers())

i.e. we generate an element of the strategy, apply f to it to get a new strategy, and then tell the morpher to become an instance of that. Easy!

Easy, that is, except for the fact that it still looks completely impossible, because we’re no closer to understanding how morphers work.

I’m not going to explain how they work in too much detail, because the details are still in flux, but I’ll sketch out how the magic works and if you want the gory details you can check the code.

As it starts out, a Morpher is simple: It contains a random seed for a parameter value and a template value, and become() just draws parameters and templates with those standard seeds and then reifies the result.

This would achieve all of the desired results except for simplification: You can save the seeds and you can now generate.

So how do we simplify? Noting that each time we may have to become a different strategy and that templates are not compatible between strategies.

There is a two part trick to this:

  1. The template for a Morpher (which is actually the Morpher itself) is secretly mutable (actually quite a lot of Hypothesis template strategies are mutable). When we call become() on a Morpher, the strategy used is stored on the template for later so we have access to it when we want to simplify, as is the template that was produced in the course of the become() call.
  2. As well as storing a random seed we also store a list of serialized representations of possible templates. These are the representations that would be used when saving in the database. The reason for this is that the Hypothesis database has the following really helpful invariant: Any serialized representation can either be turned into a valid template for the strategy or rejected as invalid. Moreover the representations are quite straightforward, so usually similar strategies will have compatible representations.
  3. When we wish to become a strategy, we first try our serialized representations in order to see if one of them produces a valid template. If it does, we use that template, otherwise if we reach the end we generate a fresh one using the random seed method mentioned above.
  4. When we simplify, we try to simplify the last template generated with the last strategy used, and then replace the representation that generated that strategy with the simplified form of it, thus generating a new morpher with the same seed and parameter but a simplified serialized representation.

If you’re reading that thinking that it’s horrifying, you’re not wrong. It’s also quite fragile – there are definitely some corner cases of it that I haven’t quite shaken out yet, and it’s why flatmap is somewhat slower and buggier than things using more normal methods of generation.

But it’s also really powerful, because you can use this technology for things other than flatmap. I originally intended it as a shared implementation between flatmap and the stateful testing, although for various reasons I haven’t got around to rebuilding the stateful testing on top of it yet. It’s also what powers a really new cool new feature I released today (I know I said I wasn’t doing new features, but I couldn’t resist and it only took me an hour), which is essentially a form of do notation for Hypothesis strategies (only more pythonic).

def list_and_sample(draw, elements):
    values = draw(lists(elements, min_size=1))
    redraw = draw(lists(sampled_from(values)))
    return (values, redraw)

list_and_sample(integers) now gives you a strategy which draws a list of at least one integer, then draws another sample from that list (with replacement)

What this does is give you a magic “draw” function that produces examples from a strategy, and composes these all together into a single strategy built on repeatedly calling that draw function as many times as you like.

This is also black magic, but it’s not novel black magic: It’s just morphers again. We generate an infinite stream of morphers, then map a function over it. draw maintains a counter, and the nth time you call it it gets the nth element from the stream and tell it to become the strategy that you’ve just passed in. There’s a bit more fiddling and details to work out in terms of making everything line up right, but that’s more or less it. We’ve vastly simplified the definition of strategies that you would previously have used an ugly chain of lambdas and flatmaps to build up.

If you want to read more about this feature, I commend you to the documentation. It’s available in the latest release (1.11.0), so have fun playing with it.

This entry was posted in Hypothesis, Python on by .

The two voices of progress

There are two voices in my head.

Both are just my inner monologue of course. But sometimes it speaks with one voice, sometimes another.

One voice says: Here’s how you can make this happen.

The other says: Here’s what will happen if you do that.

Historically I often described the former as the software developer’s role and the latter as the ops role. I think the reality is that everyone needs both, and not just because we’re all devops now.

There are many ways to fail to make progress, but ultimately all of them come down to hitting some sort of obstacle and failing to be able to get past it.

Sometimes the obstacle is a creative block: You could get past it if you just had the right solution. Other times it’s a foresight block: You needed to plan for this six months ago, and if you didn’t there’s actually not much you can do here. Different voices tell you how to overcome different obstacles.

I think most people are predisposed to listen to and value one voice over the other. Certainly historically I’ve been much more of a “here’s how you can make this happen” type of person, and have gradually learned that the other voice has useful contributions that would have prevented this debacle we’ve found ourselves in if I’d only listened to the other one.

But you don’t have to be good at listening to both voices when they’re inside your head, because the voices come from outside your head too: Your team mates.

The problem comes not when you’re not good at thinking in both ways, the problem comes when you discount what people who are good at thinking in the way that you’re not are saying as unimportant.

This is probably familiar to anyone who has tried to be the voice of reason in the room: A bunch of cowboys dismiss your concerns as probably not very important in practice and how likely is that to happen? Lets be pragmatic here.

My impression is that right now the tech industry is heavily biased towards the “lets make it work” voice, as exemplified by the “move fast and break things” attitude, but it goes the other way too. When you see everything constantly being broken around you it’s tempting to think that if only everyone would just listen to you everything would be great.

I think this is a mistake too. The danger of being too far on the predictive side is that it prevents you from experimenting. It also tends to produce a sort of analysis paralysis: Many ideas that have proven to be very successful would never have been started if their creators had realised how much work they were going to be.

I think in many ways the optimal way to do progress is to have the two voices handing off to each other. Experiment, then stabilize, then repeat. Skip the experiment step and you don’t get anywhere, skip the stabilize step and you get somewhere you didn’t want to be.

It seems very hard to get these two approaches in balance. I’d like to fix that. I’m not sure how to make it happen, but I’m pretty sure that what will happen if we do is pretty great.

This entry was posted in Uncategorized on by .

Apparently I’m still doing this fanfic thing

After a discussion on Twitter about the mechanics of wishing (a niche interest of mine, and apparently of others) I somehow ended up being talked into / talking myself into writing some fanfic of Disney’s Aladdin.

Yeah, I know.

I keep thinking “It’s completely impossible to write this because¬† once (event happens) then (character) has too much power and you just can’t construct a plot around that” and then I come up with a great solution to that problem and I come up with a scene around that and then I write that scene.

The result is that it’s not really a whole fiction so much as a set of isolated scenes that probably make sense if you’ve seen the movie and definitely don’t if you haven’t. There will almost certainly be a certain amount of backfilling and it may eventually turn into a complete story in its own right.

Current features:

  1. The Genie features a very large number of rules, which are dynamically altered as people try to work around them. WIshing is powerful but you simply can’t bootstrap into godhood with it. Mostly because every time I think of a way of bootstrapping into godhood I write a scene in where someone tries to do that and then the rules are added to in a way that prevents it.
  2. Everyone who gets their hand on the lamp (and it’s not just Aladdin and Jafar) is severely clued in and makes intelligent use of their wishes.
  3. Jasmine is a character with a great deal of agency and does not do the whole “I’m not just a prize to be won!” thing followed by having all her agency taken away and acting as a prize to be won in a fight between two men.
  4. Jasmine is scary. Don’t mess with her. You’ll regret it.

If that sounds appealing, here’s the work in progress.

This entry was posted in Uncategorized on by .