Category Archives: Hypothesis

Revised Hypothesis 1.0 plan

So in my previous post about Hypothesis 1.0 I stated the following priorities:

  1. A stable API
  2. Solid documentation
  3. The ability to get reports about a run out – number of examples tried, number rejected, etc.
  4. Coverage driven example discovery

Well, uh, I’m still planning on the first two (don’t worry, they’re hard requirements. The release date will slip before I compromise on them).

The second two though? Eh. They just don’t seem that important for a 1.0 release.

The reports would be nice to have. They’re definitely helpful, and they’re certainly important for avoiding useless tests, but I wouldn’t say they’re essential.

But more importantly, I know how to do them without breaking the current API.

After some thought I’ve realised that there are essentially two major categories of things that I need to be in the 1.0 release.

  1. Things that would require me to break the API
  2. Things that will get people excited enough about Hypothesis to use it

2 is pure marketing, but marketing is important here. My goal with Hypothesis is to make the world of Python a better tested environment, and if people aren’t actually using Hypothesis it will have failed at that goal, regardless of what an amazing piece of software it is.

And the 1.0 release is the big marketing opportunity. It’s not quite go big or go home, but it’s the point at which I’m throwing Hypothesis out to the world and saying “Here is my testing library. It’s awesome. You should use it”, and at that point it had better have a pretty compelling story for getting people to use it.

So those are the things I need for 1.0: excitement and stability.

And… well the coverage driven stuff is exciting to me, and it’s exciting to other quickcheck nerds, but honestly we’re not the target demographic here. Any quickcheck nerds who are writing Python should be just taking one look at Hypothesis and going “oh yes. It must be mine. Yes”, because honestly… the other quickcheck libraries for Python are terrible. I don’t like to trash talk (OK, I love to trash talk, but it’s uncouth), but if your quickcheck library doesn’t do example minimization then it’s a toy. Hypothesis started better than all the other quickcheck libraries, and it’s gone from strength to strength.

So coverage doesn’t break the API and doesn’t excite people, so it’s out. I’d like to do it at some point, but for now it’s just not important.

Reporting is useful, but it’s both API non-breaking and boring, so that’s out too.

So what’s in instead?

It’s a short list:

  1. Django integration

OK. That’s a slight lie, because it turns out there’s a feature I need for the Django integration (details why to come in another post). So the actual list:

  1. Bringing in the ideas from testmachine
  2. Django integration

And this is actually very compatible with both, because the testmachine ideas are among the ideas most likely to completely break the API. I… am not actually sure what I was thinking when trying to relegate them to a post 1.0 release – I have some ideas for how they could have been integrated without changing the API much, but I think the result would have been really painfully limited in comparison to what I could do.

So what will the django integration look like?

It will look like, if you will excuse me briefly violating my blog’s editorial guidelines, fucking magic is what it will look like.

You see, Django models come with everything Hypothesis needs to know how to generate them. They’re basically giant typed records of values. Hypothesis loves typed records of values. When this works (and I pretty much know how to make all the details work), the django integration is going to look like:

from hypothesis.extra.django import TestCase
from hypothesis import given
from myproject.mymodels import SomeModel
from myproject.myforms import SomeForm
 
class TestSomeStuff(TestCase):
    @given(SomeModel)
    def test_some_stuff(self, model_instance):
        ...
 
    @given(SomeForm)
    def test_some_other_stuff(self, form_instance):
        ....

That’s it. No fixtures, no annoying specifications of different edge cases for your model to be in, just ask Hypothesis for a model instance and it will say “sure. Here you go”. Given the upcoming fake factory integration it will even do it with nice data.

And, well, this may not be research level innovation on the theme of Quickcheck, but as far as compelling marketing messages for bringing property based testing to the masses I can’t really think of a better one than “Hey, all of you people using that popular web framework, would you like to do more testing with not only zero extra effort but you can also throw away a bunch of laborious boring stuff you’re already doing?”

So that’s the plan. I suspect this is going to cause a it of slippage in the timeline I’d originally intended but, well, I was also planning to be looking for jobs this month and I’ve mostly thrown myself into Hypothesis instead, so it’s not like I don’t have the time. I’m going to try to still be on track for an end of February release, but it’s no big deal if it ends up creeping into March.

This entry was posted in Hypothesis, Uncategorized on by .

Changes to the Hypothesis example database

One of the most notable features of Hypothesis is the example database. Unlike classic Quickcheck, when Hypothesis finds an example it saves it in its database. The next time it needs an example of that type it tries the previously saved example before moving on to random generation.

The initial version of this was really limited though. The problem is that this happens at the falsify level, so for example a failure in @given(int, int) it would save the pair (int, int), and for attempts to falsify other things with @given(int, int) it would reuse that. However if you asked for a @given((int, int), str) for example it would not reuse the previous example.

The result is that you don’t get to share any interesting features between examples – only exact matches. This seems a shame.

This turns out to be fixable, and the fix is both fairly straightforward but also quite interesting in that it only really works because of the ridiculously dynamic nature of Hypothesis.

Conceptually all that happens is that whenever you save something you also save all the subcomponents, and whenever you generate something you have a chance of generating any previously saved examples for it. It’s easy to say, but the details are a little complicated. Maybe there’s a better way to do it, but the approach I ended up taking is very specific to Hypothesis.

The main component is that Hypothesis’s weird little home brewed version of type classes are first class values in the language. Both the “types” that it maps from and the values it produces are just python values, and can be manipulated as such.

We use this in two ways: First, Hypothesis strategies grew a decompose method. This iterates over (descriptor, value) pairs. So for example when saving (1, “foo”) as an (int, str), this would yield (int, 1) and then (str, “foo”), and we would then save those values for those types.

(There’s a slight wart here. Sometimes these values decompose as non-serializable types, so we need to check for that case and not save when that happens).

So that’s the saving, but what next? How do we get these values back out? This is particularly important because arguments to @given are generated as a single strategy, so these will never actually help – a @given(int) isn’t using the strategy for int, it’s using the strategy for ((int,), {}) (one positional arg, no keyword args).

The next step takes advantage of the fact that the strategies are first class values, with a little bit of an assist from the parametrized data generation. We introduce the idea of an ExampleAugmentedStrategy. This is a strategy that wraps another strategy with some examples that could have come from it. It’s essentially biasing a strategy towards some favoured examples. Sometimes it picks from those examples, sometimes it generates some fresh ones (which ones, and the probability of generating them, are decided by the parameter value).

The strategy table is then modified to look up previous examples and if any are found return an augmented strategy.

Which brings us to the desired end point: The examples we previously extracted can now be reassembled into other examples, allowing Hypothesis to essentially learn pieces of the problem from other tests.

In theory this should allow the construction of much harder to find examples. In practice I haven’t yet seen all that much benefit from it yet, because I’m still figuring out what a good workflow for the database is (I don’t really have a way to save it between travis runs, which makes it of limited  use), but I think it’s starting to show a lot of promise.

This entry was posted in Hypothesis, Uncategorized on by .

A plan for Hypothesis 1.0

So I’ve just released Hypothesis 0.4. It implements the example database I discussed in The Pain of Randomized Testing. Every time Hypothesis finds a failure it will save it for later use, and every property you test in Hypothesis will use the saved database to reproduce any failures it’s seen before (and sometimes find new ones, because it allows test cases to cross between different properties). This is pretty cool.

It has been almost exactly two weeks since I wrote my short term roadmap. In that time I’ve knocked out two of the features I wanted to get done: The example database and the improved data generation. I’ve also seriously improved the quality of the testing of Hypothesis itself and found a whole bunch of bugs as a result. So things are going pretty well.

But I’m away next week, and I’m basically giving myself until the end of February to get 1.0 out. So it’s time to get serious about  a concrete plan for the road map.

So the following are what I’m declaring requirements for 1.0:

  1. A stable API
  2. Solid documentation
  3. The ability to get reports about a run out – number of examples tried, number rejected, etc.
  4. Coverage driven example discovery

I could maaaaybe drop one of the latter two but not both, but I’m very reluctant to do so. The first two are hard requirements. If it doesn’t have a stable well documented API then it’s not 1.0. This isn’t one of those clownshoes open source hack projects where the specified API is whatever the code happens to implement and the documentation is the source code.

From the original list I’m explicitly taking testmachine off the table for now. I want to do it, and I will do it, but I’m officially relegating it to the giant pile of “Oooh wouldn’t it be neat if- gah, no David, you can do that after 1.0″ along with “generating strings matching a regular expression”, “automatically generating Django model object graphs” and “driving Selenium from hypothesis” and a whole bunch of other ideas. It will most likely be the 1.1 flagship feature.

A top level driver for Hypothesis will probably shake out of the wash as part of the reporting work. We’ll see how that goes. It’s neither a requirement nor an explicit not-for-release.

This leaves what should be an achievable amount of work. I have prototypes demonstrating the feasibility of the coverage driven example discovery, so based on past experience it will probably take me about two hours to actually get it hooked in and most of a week to get it solid enough that I’m happy with it. Reports should be a bit less than that.

Which gives me about two weeks to stabilise the API and write documentation. I think I can do that. I’ve been doing some of it as I go, and will do more of it as I implement the other two. The main difficulty I have here is basically deciding how much of Hypothesis’s internal weirdness should be exposed as API and how much I should hide behind a more pythonic surface. I suspect I’m going to go full weird and just tidy it up so that it’s as nice to use as possible.

So that’s the plan. I’m not getting everything I wanted to get done, but if I had then the end result wouldn’t be nearly as solid. I’m opting to strike a balance between production worthy and interestingly novel, and I think this is pretty much the right one.

And it’s a pretty exciting place to be. Right now it’s teetering there already, but if all this goes according to plan I’m basically prepared to state that by the end of February randomized property based testing will have three categories: Quviq’s Quickcheck, Hypothesis, and everything else. Because the former is proprietary (even the documentation!) I don’t really know what it’s capable of, but Hypothesis’s capabilities are going to be significantly beyond anything that seems to exist in the open source world.

This entry was posted in Hypothesis, Uncategorized on by .

On taking the ideas in Hypothesis

Last night on twitter, Eiríkr Åsheim asked me about the plausibility of taking some of the ideas from Hypothesis and putting them into Scalacheck. I thought I’d summarize elaborate on my response.

Basically, I’m incredibly for people taking ideas from Hypothesis. The new developments on Hypothesis are mostly about taking other peoples’ good ideas from other areas, cross-pollinating them, and putting in a ton of work to make them as effective as possible, so it would be entirely churlish for me to object to other people taking the ideas.

I think Eiríkr was mostly asking about the possibility of doing so though, which again my response is basically “yes this is 100% possible and if you want to do this I will help you”.

Some of the work on Hypothesis is basically how to make the best possible Quickcheck port in a dynamic language. If you’re working in Scala you probably don’t care about that that much, but maybe it would be interesting for Clojure I don’t know. To be honest I feel like this section of Hypothesis is a bit weird and I’m not really that interested in spreading it. If you want to know how it works we can talk about it but for now I’ll continue to just look faintly embarrassed about the subject.

The rest of Hypothesis though is just about how to make the best possible version of Quickcheck and the fact that I’m doing it in Python is practically a coincidence. The parametrization, derandomization, multi-armed bandits, example database and coverage driven discovery are all stuff you could write in any language you cared to name (though coverage driven discovery depends on how good your access to profiling/coverage APIs is, and the parametrization may be kinda annoying in most statically typed languages). There’s nothing Python specific about them and if you’re prepared to put in the work your favourite Quickcheck port could easily grow them and if you want to put in the work I will help you do so.

One of my requirements for a 1.0 release of Hypothesis is a vast amount of documentation. In particular there will be design documents for all these features which should make this a ton easier. In the meantime if there’s something that I haven’t written about that you want to know how it works, just ask me. Maybe prepare to be there for a while though as I probably won’t shut up about it.

This entry was posted in Hypothesis, Uncategorized on by .

Using multi-armed bandits to satisfy property based tests

This is a post about Hypothesis’s data generation. It’s much more sophisticated than any other property based testing system whose implementation I’m aware of. Quviq’s Erlang Quickcheck may be more advanced, I don’t know – it’s proprietary so I haven’t looked at the source code – but as far as I know the way this works is entirely novel to me (with inspiration heavily drawn from other people’s ideas – the novelty here is about 70% just that it’s applied to property based testing and 30% some interesting twists on how it works).

Hypothesis’s approach to data generation generates a wide variety of things that are really hard to do produce in a traditional property based testing model without a lot of careful hand tuning, and more interestingly it adaptively seeks out novel examples and takes user feedback to prune bad examples without you having to write custom strategies to support it. If you ask for a list of integers but really only want long lists of positive integers you can write something like assume(len(xs) >= 100 and all(x >= 0 for x in xs)) and Hypothesis will rapidly learn the shape of the data that you want and avoid wasting your time by giving you too many examples that don’t satisfy it. Even without using assume it will try you on lists of all positive integers, lists of all negative integers, and a mix in between.

Lets see how this works.

The general approach of data generation in property based testing is that you have a random number generator and some parameter that can control the shape of the distribution. For traditional quickcheck this parameter is an integer called “size” and controls the shape of the distribution by just generating larger examples when size is larger. The first version of Hypothesis did this (I tried to do a thing where size corresponded to the entropy of the distribution but this didn’t really work out that well), and it worked OK.

The problem is that there’s very little you can do to tune the shape of this distribution beyond trying larger and smaller examples, and that can miss out on a bunch of interesting things. There’s a concept called Swarm Testing (full disclosure: I learned about it it from a blog post rather than the paper) which is basically as follows: In any given run of a random test, turn off some of the features of what you can generate. This lets you get a much higher degree of coverage of interaction between the features that you chose to leave in.

I originally took this idea verbatim and it worked OK, but I’m now going to skip the rest of the history lesson and tell you what Hypothesis does now which is essentially a unification and significant generalisation of both the size parameter and the feature flags.

Any given strategy for data generation defines two probability distributions. One of these is a parameter distribution. You simply give it a random number generator and you get back a parameter. This is an opaque value which you do not know anything about. It could be an integer, it could be a giant opaque binary blob. It’s probably a tuple of numbers but you’re not supposed to know that shh. The only valid operation on a parameter value is to feed it back to the strategy it came from (for statically typed fans tuning in from home imagine that a strategy for producing type a is parameterised as (Strategy b a) and you only have access to an (exists b. Strategy b a). For non statically typed fans ignore this parenthetical). The second distribution is the conditional data distribution. You feed it a random number generator and a previously generated value of the parameter and you get a value out.

Why this distinction? You can just draw a value by drawing from the parameter and feeding it straight back in. Why not just do that?

The reason is that you can reuse the same parameter multiple times in a given run. So for example when generating a list of integers, you draw the parameter for the integer once, then generate all elements from the list according to that single parameter value. This lets you produce things like “long lists of positive integers” and “long lists of negative integers” which would have previously happened with probability \(\approx 0\) when you chose all of the elements completely independently (they’re still independent conditional on the parameter in this model).

This on its own is powerful enough. Doing our top level data generation as “draw a parameter, pass it back in”  produces a much lumpier distribution which can get at those hard to reach corners of the search space.

But what’s really powerful is what this then enables: Despite the completely opaque nature of the parameter object you can use it to guide the search space. This is because although you can’t look at the parameter you can look at the generated data.

This means that if you have some way of determining whether a particular example is good or bad you can use a multi-armed bandit algorithm to select parameter values which tend to produce better results.

So we keep track of how many examples from each parameter we’ve generated and how many of those turned out to be bad and then we do Thompson Sampling to select the next parameter.

That’s pretty much it! There are a few important details though:

  1. the Thompson sampling implementation I use has a special arm which is “draw a new parameter value”. It’s drawn from a beta distribution which has as alpha + beta the number of parameters drawn so far and beta the number of times that the first value drawn from a parameter was bad. If this arm is selected we pick a fresh parameter value and add it to the list.
  2. we start with a specified minimum number of parameters and just draw fresh parameters until we reach that number (it’s currently set to 25 but that’s pretty arbitrary and will likely change)

What also matters is what counts as bad. Hypothesis uses two criteria to determine in an example is bad:

  1. If it fails an assume check
  2. If we’ve seen this exact example before

The second one is… slightly dubious in that it means that the assumptions of the multi armed bandit aren’t actually holding true – the success of an arm depends on what’s come before, and so the probability changes over time. I can only offer in defence that it seems to work pretty well in practice.

Anyway, the result of failing things for the first is that Hypothesis will automatically be more likely to pick values that pass your assume checks – it will pick regions of the parameter space that have a higher probability of doing so. It isn’t able to magically understand the shape of your assume and pick perfectly, but assuming there is any relation at all between the structure of the example that it understands and the thing you’re assuming it will tend to do better a uniform draw across all the values.

The second criteria essentially forces it to seek out novelty. It’s the analogue of the size parameter – when things start getting too similar to what it’s seen before it is drawn to pick a new parameter and add some more variety to the mix.

This approach is still quite new and will doubtless see more refinement, but I’m pretty happy with it so far. It’s also going to be very useful in the 0.5 release when I start adding coverage driven example discovery because it will allow rewarding branches that produce new and interesting examples as well as penalising ones that are actively bad.

At the moment it’s hard for me to tell how much better this is in practice. It’s certainly the case that the quality of the distributions is required to make some tests pass in Hypothesis itself, but that may be artificial enough that in real live usage it’s a lot less of an advantage. However experiments with testing Hypothesis itself suggests that with the right sorts of tests there’s a lot of mileage you can get out of covering obscure corners of the test space and that anything that can raise the probability of hitting those corners is worth doing. This is at the very least a very big step in the right direction.

 

This entry was posted in Hypothesis, Uncategorized on by .