# Using multi-armed bandits to satisfy property based tests

This is a post about Hypothesis’s data generation. It’s much more sophisticated than any other property based testing system whose implementation I’m aware of. Quviq’s Erlang Quickcheck may be more advanced, I don’t know – it’s proprietary so I haven’t looked at the source code – but as far as I know the way this works is entirely novel to me (with inspiration heavily drawn from other people’s ideas – the novelty here is about 70% just that it’s applied to property based testing and 30% some interesting twists on how it works).

Hypothesis’s approach to data generation generates a wide variety of things that are really hard to do produce in a traditional property based testing model without a lot of careful hand tuning, and more interestingly it adaptively seeks out novel examples and takes user feedback to prune bad examples without you having to write custom strategies to support it. If you ask for a list of integers but really only want long lists of positive integers you can write something like assume(len(xs) >= 100 and all(x >= 0 for x in xs)) and Hypothesis will rapidly learn the shape of the data that you want and avoid wasting your time by giving you too many examples that don’t satisfy it. Even without using assume it will try you on lists of all positive integers, lists of all negative integers, and a mix in between.

Lets see how this works.

The general approach of data generation in property based testing is that you have a random number generator and some parameter that can control the shape of the distribution. For traditional quickcheck this parameter is an integer called “size” and controls the shape of the distribution by just generating larger examples when size is larger. The first version of Hypothesis did this (I tried to do a thing where size corresponded to the entropy of the distribution but this didn’t really work out that well), and it worked OK.

The problem is that there’s very little you can do to tune the shape of this distribution beyond trying larger and smaller examples, and that can miss out on a bunch of interesting things. There’s a concept called Swarm Testing (full disclosure: I learned about it it from a blog post rather than the paper) which is basically as follows: In any given run of a random test, turn off some of the features of what you can generate. This lets you get a much higher degree of coverage of interaction between the features that you chose to leave in.

I originally took this idea verbatim and it worked OK, but I’m now going to skip the rest of the history lesson and tell you what Hypothesis does now which is essentially a unification and significant generalisation of both the size parameter and the feature flags.

Any given strategy for data generation defines two probability distributions. One of these is a parameter distribution. You simply give it a random number generator and you get back a parameter. This is an opaque value which you do not know anything about. It could be an integer, it could be a giant opaque binary blob. It’s probably a tuple of numbers but you’re not supposed to know that shh. The only valid operation on a parameter value is to feed it back to the strategy it came from (for statically typed fans tuning in from home imagine that a strategy for producing type a is parameterised as (Strategy b a) and you only have access to an (exists b. Strategy b a). For non statically typed fans ignore this parenthetical). The second distribution is the conditional data distribution. You feed it a random number generator and a previously generated value of the parameter and you get a value out.

Why this distinction? You can just draw a value by drawing from the parameter and feeding it straight back in. Why not just do that?

The reason is that you can reuse the same parameter multiple times in a given run. So for example when generating a list of integers, you draw the parameter for the integer once, then generate all elements from the list according to that single parameter value. This lets you produce things like “long lists of positive integers” and “long lists of negative integers” which would have previously happened with probability $$\approx 0$$ when you chose all of the elements completely independently (they’re still independent conditional on the parameter in this model).

This on its own is powerful enough. Doing our top level data generation as “draw a parameter, pass it back in”  produces a much lumpier distribution which can get at those hard to reach corners of the search space.

But what’s really powerful is what this then enables: Despite the completely opaque nature of the parameter object you can use it to guide the search space. This is because although you can’t look at the parameter you can look at the generated data.

This means that if you have some way of determining whether a particular example is good or bad you can use a multi-armed bandit algorithm to select parameter values which tend to produce better results.

So we keep track of how many examples from each parameter we’ve generated and how many of those turned out to be bad and then we do Thompson Sampling to select the next parameter.

That’s pretty much it! There are a few important details though:

1. the Thompson sampling implementation I use has a special arm which is “draw a new parameter value”. It’s drawn from a beta distribution which has as alpha + beta the number of parameters drawn so far and beta the number of times that the first value drawn from a parameter was bad. If this arm is selected we pick a fresh parameter value and add it to the list.
2. we start with a specified minimum number of parameters and just draw fresh parameters until we reach that number (it’s currently set to 25 but that’s pretty arbitrary and will likely change)

What also matters is what counts as bad. Hypothesis uses two criteria to determine in an example is bad:

1. If it fails an assume check
2. If we’ve seen this exact example before

The second one is… slightly dubious in that it means that the assumptions of the multi armed bandit aren’t actually holding true – the success of an arm depends on what’s come before, and so the probability changes over time. I can only offer in defence that it seems to work pretty well in practice.

Anyway, the result of failing things for the first is that Hypothesis will automatically be more likely to pick values that pass your assume checks – it will pick regions of the parameter space that have a higher probability of doing so. It isn’t able to magically understand the shape of your assume and pick perfectly, but assuming there is any relation at all between the structure of the example that it understands and the thing you’re assuming it will tend to do better a uniform draw across all the values.

The second criteria essentially forces it to seek out novelty. It’s the analogue of the size parameter – when things start getting too similar to what it’s seen before it is drawn to pick a new parameter and add some more variety to the mix.

This approach is still quite new and will doubtless see more refinement, but I’m pretty happy with it so far. It’s also going to be very useful in the 0.5 release when I start adding coverage driven example discovery because it will allow rewarding branches that produce new and interesting examples as well as penalising ones that are actively bad.

At the moment it’s hard for me to tell how much better this is in practice. It’s certainly the case that the quality of the distributions is required to make some tests pass in Hypothesis itself, but that may be artificial enough that in real live usage it’s a lot less of an advantage. However experiments with testing Hypothesis itself suggests that with the right sorts of tests there’s a lot of mileage you can get out of covering obscure corners of the test space and that anything that can raise the probability of hitting those corners is worth doing. This is at the very least a very big step in the right direction.

This entry was posted in Hypothesis, Uncategorized on by .

## 4 thoughts on “Using multi-armed bandits to satisfy property based tests”

Neat!

Would it be an improvement to use sum of fractions of bad examples for each existing arm as beta? I have vague feeling that fractions will be more robust in case of bad start. Additionally it will automatically make some sort of prediction about search space diversity for new arm, for example if some existing “barren” arms will begin producing many bad runs because of repeated samples after a while.

1. david Post author

Hm. I’m not 100% sure what you’re suggesting but I think it’s what’s already happening? The way Thompson sampling works is basically that you’re drawing from a distribution where the expected value is the fraction of successes you’ve seen from that arm and the variance decreases towards zero as the number of times you’ve tried that arm increases.

If you mean for the “pick a new parameter” arm, the reasoning for the current choice is that you don’t want to weight it too heavily and that whether or not the first run of any parameter is bad is a decent prediction whether for the first run of a new one will be bad, so the expected value for the thompson sample for that is the fraction of runs which produce a good result on their first run.

Yes, I’m talking about “new parameter” arm.
Let’s say there were 100 parameter values explored so far. For half of them there were 90 good test runs and 10 bad ones, for the other half ratio is reversed (let’s say 9 bad tests and 1 good). If we only look at first run for every parameter value, we’ll get about 50 good and 50 bad test runs. So your “pick new parameter” arm with modeled reward beta(50, 50) will compete against 50 instances of beta(10, 90), and chances for it to be picked are astronomically low.

And it seems wrong. Somewhat more “inductive” way to model its reward would be “pick one of 50 gamma(90, 10) and 50 gamma(1, 9), and then pick from this distribution”, which gives new parameter 1 chance in 102.

With this approach probability to take new parameter value never drops below 1 / (2 * number_of_parameter_values_generated_so_far), but I think that’s fine.

1. david Post author

Apparently I thought I’d replied to this and hadn’t. Sorry!

In summary, you persuaded me and I ended up implementing this algorithm. Unclear at the moment if it works better, but it is certainly more appealing.