I mentioned in the Hypothesis short term road map that one of the things I wanted to do was improve the data generation.
What’s wrong with the data generation as it currently stands?
Well, in classic quickcheck style testing basically what you have is a size parameter that controls the shape of the distribution. A large value for the size parameter results in a more variable distribution – larger elements, longer lists, etc.
How, given this, would you generate a list of more than 100 floating point numbers in which every element had the same sign?
Assuming your floating point generator has probability p of producing a non-negative number with 0 < p < 1. The probability of a list of at least 100 elements getting all the same sign is max(p^100, (1-p)^100). Or, as I like to call it, as close to zero as makes no difference.
Hypothesis has historically solved this by introducing into the mix a notion of flags, which are boolean variables that can be turned on and off for a whole data generation run. For example, whether negative floating point numbers are allowed is a flag. This means that with a 50/50 chance your list will have all positive numbers. This idea was inspired by John Regehr’s post “Better Random Testing By Leaving Features Out”
But it’s not quite enough. What for example if you want to generate a long list with a low random average value? Or a high one? Depending on how you use the size parameter it will be hard to do one of these (in Hypothesis it was the latter which is hard – the size was split amongst the elements and so the elements of a long list would tend to be much smaller).
It’s also very non-orthogonal. There are these two different notions of parameter which are similar but not really and don’t share any implementation.
So what I wanted to do (and now have done! It proved surprisingly easy) is to introduce a common notion of parameter which unified these two.
A parameter is just any sort of distribution you can draw from. It can contain any sort of values – booleans, floats, integers, composite values, etc. Each strategy has a parameter you can draw from, and data generation then occurs by first drawing a parameter value and then drawing data conditional on that parameter value.
A list then has a composite parameter: One parameter is its average length, and lengths are then drawn from a geometric distribution with that average, and the other parameter is whatever parameter its elements take.
The reason this is importantly different from just drawing from the marginal distribution is that all of the draws then use the same parameter value. So for the floating parameters we have one parameter which says the sign of the value (it can be always positive, always negative or either) and parameters that control the shape of it given that sign. This means that when generating a list, the chances of all elements being high or all elements being low is much better (it does mean that the chances of having a mix are lower, but you can fix that with a more complicated parameter shape).
The overall effect is that this allows you to generate much more interesting large structured examples. With the classic approach you end up with a sort of curse of dimensionality where everything looks pretty flat because of the differences averaging out.
Another nice perk of this is that you can use the parameters to guide rejection better. A test in hypothesis can essentially fail for two reasons: One is that it’s a legitimate failure and the other is that a precondition has not been met. Generally speaking when the latter happens you don’t really want to generate a whole bunch of data like that.
To continue with our sign example, suppose I want to generate non-negative lists of numbers. I could do this by defining a special strategy for non-negative floats, but that’s an annoyingly large amount of work (I’d like to make it less work and have some ideas around that, but there will always be cases like this). It would be nice if I could do things like assume(all(x >= 0 for x in xs)) and have any chance of succeeding for large lists.
Well with the new system it will succeed for about half of parameter values. So the logic becomes: Generate a parameter, try some number of draws of data from this parameter. Eventually draw a new parameter, but if we have a high rejection rate draw a new parameter much sooner.
This has all landed in master as of yesterday evening, and I’m pretty pleased with it so far. I’m going to do a bit more work on hypothesis over the next few days, but expect to see it in a released version near you quite soon.