David R. MacIver's Blog: Conjecture, parametrization and data distribution

Conjecture, parametrization and data distribution

27 November 2015

Up front warning: This is a very inside baseball post, and I’m the only person who plays this particular variant of the game. This blog post is mostly a mix of notes to self to and sharing my working.

I’m in the process of trying to rewrite the Hypothesis backend to use the Conjecture approach.

At this point the thing I was originally worried was intractable - shrinking of data - is basically solved. Conjecture shrinks as well as or better than Hypothesis. There are a few quirks to still pay attention to - the shrinking can always be improved, and I’m still on the fence as to whether some of the work I have with explicit costing and output based shrink control is useful (I think it’s probably not), but basically I could ship what I have today for shrinking and it would be fine.

However I’m discovering another problem: The other major innovative area of Hypothesis is its parametrized approach to data generation. More generally, I’m finding that getting great quality initial data out of Conjecture is hard.

This manifests in two major ways:

It can be difficult to get good data when you also have good shrinking because you want to try nasty distributions. e.g. just generating 8 bytes and converting it to an IEEE 754 binary float representation produces great shrinking, but a fairly sub-par distribution - e.g. the probability of generating NaN is 1 in 2048 (actually very slightly lower).
The big important feature of Hypothesis’s parametrization is correlated output. e.g. you can’t feasibly generate a list of 100 positive integers by chance if you’re generating each element independently. Correlated output is good for finding bugs.

1 is relatively easily solved by letting data generators participate in the initial distribution: Instead of having the signature draw_bytes(self, n) you have the signature draw_bytes(self, n, distribution=uniform). So you can let the floating point generator specify an alternative distribution that is good at hitting special case floating point numbers without worrying about how it affects distributions. Then, you run the tests in two modes: The first where you’re building the data as you go and use the provided distributions, the second where you’re drawing from a pre-allocated block of data and ignore the distribution entirely.

This is a bit low-level unfortunately, but I think it’s mostly a very low level problem. I’m still hoping for a better solution. Watch this space.

For the second part... I think I can just steal Hypothesis’s solution to some degree. Instead of the current case where strategies expose a single function draw_value(self, data) they can now expose functions draw_parameter(self, data) and draw_value(self, data, parameter). A normal draw call then just does strategy.draw_value(data, strategy.draw_parameter(data)), but you can use alternate calls to induce correlation.

There are a couple problems with this:

It significantly complicates the usage pattern: I think the parametrization is one of the bits of Hypothesis people who look at the internals least understand, and one of the selling points of Conjecture was “You just write functions”. On the other hand I’m increasingly not sold on “You just write functions” as a good thing: A lot of the value of Hypothesis is the strategies library, and having a slightly more structured data type there is quite useful. It’s still easy to go from a function from testdata to a value to a strategy, so this isn’t a major loss.
It’s much less language agnostic. In statically typed languages you need some way to encode different strategies having different parameter types, ideally without this being exposed in the strategy (because then strategies don’t form a monad, or even an applicative). You can solve this problem a bit by making parameters an opaque identifier and keeping track of them in some sort of state dictionary on the strategy, but that’s a bit gross.
Much more care with parameter design is needed than in Hypothesis because the parameter affects the shrinking. As long as shrinking of the parameter works sensibly this should be OK, but this can become much more complicated. An example of where this gets complicated later.
I currently have no good ideas how parameters should work for flatmap, and only some bad ones. This isn’t a major problem because you can fall back to a slightly worse distribution but it’s annoying because Conjecture previously had the property that the monadic and applicative interfaces were equivalently good.

Here’s an example of where parametrization can be a bit tricky:

Suppose you have the strategy one_of(s1, ..., sn) - that is, you have n strategies and you want to pick a random one and then draw from that.

One natural way to parametrize this is as follows: Pick a random non-empty subset of {1, .., n}. Those are the enabled alternatives. Now pick a parameter for each of these options. Drawing a value is then picking a random one of the enabled alternatives and feeding it its parameter.

There are a couple major problems with this, but the main one is that it shrinks terribly.

First off: The general approach to shrinking directions Hypothesis takes for alternation is that earlier branches are preserved. e.g. if I do integers() | text() we’ll prefer integers. If I do text() | integers() we’ll prefer text. This generally works quite well. Conjecture’s preference for things that consume less data slightly ruins this (e.g. The integer 1 will always be preferred to the string “antidisestablishmentarianism” regardless of the order), but not to an intolerable degree, and it would be nice to preserve this property.

More generally, we don’t want a bad initial parameter draw to screw things up for us. So for example if we have just(None) | something_really_complicated() and we happen to draw a parameter which only allows the second, but it turns out this value doesn’t matter at all, we really want to be able to simplify to None.

So what we need is a parameter that shrinks in a way that makes it more permissive. The way to do this is to:

Draw n bits.
Invert those n bits.
If the result is zero, try again.
Else, return a parameter that allows all set bits.

The reason for this is that the initially drawn n bits will shrink towards zero, so as you shrink, the parameter will have more set bits.

This then presents two further problems that need solving.

The next problem is that if we pick options through choice(enabled_parameters) then this will change as we enable more things. This may sometimes work, but in general will require difficult to manage simultaneous shrinks to work well. We want to be able to shrink the parameter and the elements independently if at all possible.

So what we do is rejection sampling: We generate a random number from one to n, then if that bit is set we accept it, if not we start again. If the number of set bits is very low this can be horrendously inefficient, but we can short-circuit that problem by using the control over the distribution of bytes suggested above!

The nice thing about doing it this way is that we can mark the intermediate draws as deletable, so they get discarded and if you pay no attention to the instrumentation behind the curtain it looks like our rejection sampling magically always draws the right thing on its first draw. We can then try bytewise shrinking of the parameter, which leads to a more permissive set of options (that could then later allow us to shrink this), and the previously chosen option remains stable.

This then leads to the final problem: If we draw all the parameters up front, adding in more bits will cause us to read more data because we’ll have. This is to draw parameters for them. This is forbidden: Conjecture requires shrinks to read no more data than the example you started from (for good reason - this both helps guarantee the termination of the shrink process and keeps you in areas where shrinking is fast).

The solution here is to generate parameters lazily. When you pick alternative i, you first check if you’ve already generated a parameter for it. If you have you use that, if not you generate a new one there and then. This keeps the number and location of generated parameters relatively stable.

In writing this, a natural generalization occurred to me. It’s a little weird, but it nicely solves this problem in a way that also generates to monadic bind:

parameters are generated from data.new_parameter(). All this is in an integer counter.
There is a function data.parameter_value(parameter, strategy) which does the same lazy calculation keyed off the parameter ID: If we already have a parameter value for this ID and strategy, use that. If we don’t, draw a new one and store that.
Before drawing from it, all strategies are interned. That is, replaced with an equivalent strategy we’ve previously seen in this test run. This means that if you have something like booleans().flatmap(lambda b: lists(just(b))), both lists(just(False)) and lists(just(True)) will be replaced with stable strategies from a pool when drawing. This means that parameters get reused.

I think this might be a good idea. It’s actually a better API, because it becomes much harder to use the wrong parameter value, and there’s no worry about leaking values or state on strategy objects, because the life cycle is fairly sharply confined to that of the test. It doesn’t solve the problem with typing this well, but it solves the problem of using it incorrectly well enough that an unsafe cast is probably fine if you’re unable to do so.

Anyway, brain dump over. I’m not sure this made sense to anyone but me, but it helped me think through the problems quite a lot.