David R. MacIver's Blog: The three stage value pipeline for Hypothesis generation

The three stage value pipeline for Hypothesis generation

22 February 2015

Note: This describes a work in progress rather than something that is in a released Hypothesis version. As such it’s liable to change quite a lot before it’s done. I’m partly writing this as a form of thinking through the design, partly as a way of procrastinating from the fact that I’ve literally broken the entire test suite in the course of moving to this and trying to fix it is really making me wish I’d written Hypothesis in Haskell.

I’ve previously talked about how generating data in Hypothesis is a two step process: Generate a random parameter, then for a given parameter value generate a random value.

I’ve introduced a third step to the process because I thought that wasn’t complicated enough. The process now goes:

Generate a parameter
From a parameter generate a template
Reify that template into a value.

Why?

The idea is that rather than working with values which might have state we instead always work with merely the information that could be used to construct the values. The specific impetus for this change was the Django integration, but it also allows me to unify two features of Hypothesis that... honestly it would never have even occurred to me to unify.

In order to motivate this, allow me to present some examples:

When generating say, a list of integers, we want to be able to pass it to user provided functions without worrying about it being mutated. How do we do this?
If we have a strategy that produces things for one_of((integers_in_range(1, 10), (9, 20))) and we have found a counter-example of 10, how do we determine which strategy to pass it to for simplification?
How does one simplify Django models we’ve previously generated once we’ve rolled back the database?

Previously my answers to these would have been respectively:

We copy it
We have a method which says whether a strategy could have produced a value and arbitrarily pick one that could have produced it
Oh god I don’t know can I have testmachine back?

But the templates provide a solution to all three problems! The new answers are:

Each time we reify the template it produces a fresh list
A template for a one_of strategy encodes which strategy the value came from and we use that one
We instantiate the template for the Django model outside the transaction and reify it inside the transaction.

A key point in order for part 3 to work is that simplification happens on templates, not values. In general most things that would previously have happened to values now happen to templates. The only point at which we actually need values is when we actually want to run the test.

As I said, this is still all a bit in pieces on the floor, but I think it’s a big improvement in the design. I just wish I’d thought of it earlier so I didn’t have to fix all this code that’s now broken by the change.

(Users who do not have their own SearchStrategy implementations will be mostly unaffected. Users who do have their own SearchStrategy implementations will probably suffer quite a lot in this next release. Sorry. That’s why it says 0.x in the version)

Comments

Monadic data generation strategies and why you should are | David R. MacIver on 2015-02-24 16:08:35:

[…] posted this gist earlier. It’s a toy port of one of the new templatized data generation for Hypothesis to […]

Stable serialization and cryptographic hashing for tracking seen objects | David R. MacIver on 2015-03-19 09:54:50:

[…] is the clue. These days Hypothesis doesn’t need to track arbitrary Python objects. It tracks templates, which I can require to have a much more specific type than the objects tracked. In particular I […]