Category Archives: Uncategorized

Changes to the Hypothesis example database

One of the most notable features of Hypothesis is the example database. Unlike classic Quickcheck, when Hypothesis finds an example it saves it in its database. The next time it needs an example of that type it tries the previously saved example before moving on to random generation.

The initial version of this was really limited though. The problem is that this happens at the falsify level, so for example a failure in @given(int, int) it would save the pair (int, int), and for attempts to falsify other things with @given(int, int) it would reuse that. However if you asked for a @given((int, int), str) for example it would not reuse the previous example.

The result is that you don’t get to share any interesting features between examples – only exact matches. This seems a shame.

This turns out to be fixable, and the fix is both fairly straightforward but also quite interesting in that it only really works because of the ridiculously dynamic nature of Hypothesis.

Conceptually all that happens is that whenever you save something you also save all the subcomponents, and whenever you generate something you have a chance of generating any previously saved examples for it. It’s easy to say, but the details are a little complicated. Maybe there’s a better way to do it, but the approach I ended up taking is very specific to Hypothesis.

The main component is that Hypothesis’s weird little home brewed version of type classes are first class values in the language. Both the “types” that it maps from and the values it produces are just python values, and can be manipulated as such.

We use this in two ways: First, Hypothesis strategies grew a decompose method. This iterates over (descriptor, value) pairs. So for example when saving (1, “foo”) as an (int, str), this would yield (int, 1) and then (str, “foo”), and we would then save those values for those types.

(There’s a slight wart here. Sometimes these values decompose as non-serializable types, so we need to check for that case and not save when that happens).

So that’s the saving, but what next? How do we get these values back out? This is particularly important because arguments to @given are generated as a single strategy, so these will never actually help – a @given(int) isn’t using the strategy for int, it’s using the strategy for ((int,), {}) (one positional arg, no keyword args).

The next step takes advantage of the fact that the strategies are first class values, with a little bit of an assist from the parametrized data generation. We introduce the idea of an ExampleAugmentedStrategy. This is a strategy that wraps another strategy with some examples that could have come from it. It’s essentially biasing a strategy towards some favoured examples. Sometimes it picks from those examples, sometimes it generates some fresh ones (which ones, and the probability of generating them, are decided by the parameter value).

The strategy table is then modified to look up previous examples and if any are found return an augmented strategy.

Which brings us to the desired end point: The examples we previously extracted can now be reassembled into other examples, allowing Hypothesis to essentially learn pieces of the problem from other tests.

In theory this should allow the construction of much harder to find examples. In practice I haven’t yet seen all that much benefit from it yet, because I’m still figuring out what a good workflow for the database is (I don’t really have a way to save it between travis runs, which makes it of limited  use), but I think it’s starting to show a lot of promise.

This entry was posted in Hypothesis, Uncategorized on by .

Better prediction of probabilities through auctions

I bet you think this is another post about Hypothesis, don’t you?

Nope! This is about something completely different. This is about prediction markets and proper scoring rules.

Which aren’t actually something I know that much about, so it’s very likely everything in this post is either obvious or a bad idea. This should just be considered a “here’s some stuff I thought of while reading about this and I couldn’t find anyone who wrote this down before so I’ll write it down now” post.

A scoring rule is a way of rewarding predictions. You have some possibilities A1, …, An which are disjoint and cover everything. So e.g. you could have A1 = “there will be a labour majority in the UK 2015 elections”, A2 = “there will be a conservative majority”, A3 = “none of the above”. You want people to assign probabilities to these events. They give you a probability distribution and depending on which outcome occurs you give them a reward (or penalise them for negative rules, which a lot of them are).

A scoring rule is proper if predicting the correct probabilities is an optimal solution and strictly proper if it’s the only optimal solution. For example a scoring rule that gives you a score of log(r_i) where r_k is the probability you assigned to outcome k and i is the outcome that occurs is a strictly proper scoring rule. A scoring rule that assigns you a score of 0 regardless of what happens is proper but not strictly proper (You can’t do better than 0, but anything you want to predict at all doesn’t do worse than that either).

When thinking about these things I noticed a particularly simple sort of proper scoring rule. It’s not strictly proper, but under a certain natural extension of the definitions can be made into a strictly proper extended scoring rule.

The rule is this:

Let \(q_1, \ldots, q_n\) be any sequence of numbers in \([0, 1]\). They do not have to sum to \(1\). Let \(r_1, \ldots, r_n\) be your prediction. Your total cost is \(C = \sum\limits_{i, q_i < r_i} q_i\). Let \(i\) be the winning outcome. Then you score \(1 – C\) if \(r_i > q_i\) and \(-C\) otherwise.

You might notice that you are basically conducting a simultaneous Vickrey auction on each coordinate.

Theorem: Under this scoring rule, a distribution \(r\) obtains the optimal score if and only if \(r_i < q_i\) whenever \(p_i < q_i\) and \(q_i < r_i\) whenever \(p_i < r_i\).

Proof:

In order to simplify this proof we will show that this is true without the constraint that \(\sum r_i = 1\). Showing that these are the optimal solutions in the general case also implies that they are the optimal solutions in the specific case.

Let \(t_i = 1\) if \(r_i > q_i\) else \(t_i = 0\). Then your expected score is \(\sum t_i (p_i – r_i)\), because you pay a cost of \(r_i\) for each case where \(t_i = 1\) and a cost of \(0\) for each case where \(t_i = 0\), and you get a benefit of \(1\) with probability \(p_i\) if \(t_i\) = 1 and 0 otherwise.

This means that the multiplier for \(t_i\) is negative if \(p_i < r_i\), positive \(p_i > r_i\) and \(0\) when \(p_i = r_i\). So the only sensible choices to achieve optimality are to set \(t_i = 0\) when the coefficient is negative and \(1\) when it is positive. It doesn’t matter what you set it to when it’s 0 because the expected value is the same either way.

QED

Note that this uses very little about your chosen distribution: All you’re actually doing is picking a set of values and saying “I think your current choice of estimates undervalues these and I’m prepared to stake money on it”. This is why it’s not a proper scoring rule: Any distribution which correctly identifies which values are underconfident will do.

There are two things you can do with this. One of them turns it into a proper scoring rule of sorts, the other just uses that feature as is. I’ll now explain both:

Randomizing for fun and expected profit

The solution which makes it a proper scoring rule is as follows: Let \(Q\) be any random vector taking values in \([0, 1]^n\) with a smooth and strictly positive distribution function.

Now play exactly the above scoring rule, but you make your prediction and then \(Q\) is drawn and you play the scoring rule against the drawn value.

Lets see that the only optimal choice of \(r_i\) is \(p_i\):

Consider again the profit you make from component \(i\). Call this profit \(T_i\). Then \(E(T_i|Q_i = q_i) = p_i – q_i\) if \(r_i > q_i\) or 0 if \(r_i \leq q_i)\).

So \(E(T_i) = (p_i – E(q_i | q_i < r_i) ) P(q_i < r_i) = p_i P(q_i < r_i) – \int\limits_0^{r_i} x f_i(x) dx = \int\limits_0^{r_i} (p_i – x) f_i(x)\)

Where \(f_i\) is the marginal distribution of \(Q_i\). Now, as assumed, \(f_i\) is strictly positive. That means that this integrand is strictly increasing for as long as \(p_i \leq x\) and strictly decreasing once \(x > p_i\), so the optimal value has to be at \(r_i = p_i\).

QED

We can also derandomize this given any explicit choice of \(Q\): You make your prediction and then when the results come in we give you what would have been the expected value of your choice.

Lets consider how that would work for the simple case where each \(q_i\) is chosen uniformly at random. Splitting it up once more into benefit and cost, your expected reward is the probability that you would have won the winning index \(i\). That is, it’s \(r_i\).

Meanwhile your total cost is \( \sum E(Q_i | Q_i \leq r_i) P(Q_i \leq r_i  = \sum \frac{1}{2} r_i^2\). So your reward is \(r_i – \frac{1}{2} \sum r_i^2\). And this is just our old friend the quadratic scoring rule! So it’s possible that by picking other interesting probability distributions for \(q\) we could produce other interesting proper scoring rules. I don’t know. I haven’t really investigated.

Prediction markets

As Robin Hanson famously observed, you can turn any proper scoring rule into a prediction market. I don’t think what I’m about to describe is actually an instance of his scheme, but I don’t feel like I’ve sufficiently digested said scheme to be sure that it’s not. It’s certainly very close even if it’s not exactly the same thing.

You can use the original non-random form of the scoring rule to create what’s basically the world’s simplest prediction market:

You have n disjoint events as above that you want to predict, and you want to solicit the wisdom of the crowds in predicting it.

You do so by placing an initial non-zero bid on each option, \(b_i\). The market is now open for bidding.

A bid consists of simply picking one of the outcomes. When bidding on outcome \(i\) you pay the current market prediction, which is \(\frac{b_i}{\sum b_j}\) and increment \(b_i\) by one. If this is the outcome that occurs then you will get a reward of \(1\) for your bid. People may bid as many times as they want on as many options as they want.

This scheme should converge to something corresponding to the true beliefs of the market. When the market believes that the current prediction is under confident they will buy shares in it, raising its value. When they believe it is overconfident they must also believe that some other value is underconfident (because the probabilities sum to 1 so if \(q_i > p_i\) for some \(i\) then \(q_j < p_j\) for some \(j\)), so they will bid on that one, causing all the other probabilities to fall.

Convergence in theory is quite slow, basically oscillating like \(\frac{1}{n}\) in the number of bids on either side of the true probabilities, but I think in practice that still quickly takes it within the realms of being finer precision than peoples’ predictions are going to be anyway.

I’ve no idea if this is a good prediction mechanism in general, but it’s appealingly simple and might be worth consideration in some contexts for that alone.

This entry was posted in Uncategorized on by .

A plan for Hypothesis 1.0

So I’ve just released Hypothesis 0.4. It implements the example database I discussed in The Pain of Randomized Testing. Every time Hypothesis finds a failure it will save it for later use, and every property you test in Hypothesis will use the saved database to reproduce any failures it’s seen before (and sometimes find new ones, because it allows test cases to cross between different properties). This is pretty cool.

It has been almost exactly two weeks since I wrote my short term roadmap. In that time I’ve knocked out two of the features I wanted to get done: The example database and the improved data generation. I’ve also seriously improved the quality of the testing of Hypothesis itself and found a whole bunch of bugs as a result. So things are going pretty well.

But I’m away next week, and I’m basically giving myself until the end of February to get 1.0 out. So it’s time to get serious about  a concrete plan for the road map.

So the following are what I’m declaring requirements for 1.0:

  1. A stable API
  2. Solid documentation
  3. The ability to get reports about a run out – number of examples tried, number rejected, etc.
  4. Coverage driven example discovery

I could maaaaybe drop one of the latter two but not both, but I’m very reluctant to do so. The first two are hard requirements. If it doesn’t have a stable well documented API then it’s not 1.0. This isn’t one of those clownshoes open source hack projects where the specified API is whatever the code happens to implement and the documentation is the source code.

From the original list I’m explicitly taking testmachine off the table for now. I want to do it, and I will do it, but I’m officially relegating it to the giant pile of “Oooh wouldn’t it be neat if- gah, no David, you can do that after 1.0″ along with “generating strings matching a regular expression”, “automatically generating Django model object graphs” and “driving Selenium from hypothesis” and a whole bunch of other ideas. It will most likely be the 1.1 flagship feature.

A top level driver for Hypothesis will probably shake out of the wash as part of the reporting work. We’ll see how that goes. It’s neither a requirement nor an explicit not-for-release.

This leaves what should be an achievable amount of work. I have prototypes demonstrating the feasibility of the coverage driven example discovery, so based on past experience it will probably take me about two hours to actually get it hooked in and most of a week to get it solid enough that I’m happy with it. Reports should be a bit less than that.

Which gives me about two weeks to stabilise the API and write documentation. I think I can do that. I’ve been doing some of it as I go, and will do more of it as I implement the other two. The main difficulty I have here is basically deciding how much of Hypothesis’s internal weirdness should be exposed as API and how much I should hide behind a more pythonic surface. I suspect I’m going to go full weird and just tidy it up so that it’s as nice to use as possible.

So that’s the plan. I’m not getting everything I wanted to get done, but if I had then the end result wouldn’t be nearly as solid. I’m opting to strike a balance between production worthy and interestingly novel, and I think this is pretty much the right one.

And it’s a pretty exciting place to be. Right now it’s teetering there already, but if all this goes according to plan I’m basically prepared to state that by the end of February randomized property based testing will have three categories: Quviq’s Quickcheck, Hypothesis, and everything else. Because the former is proprietary (even the documentation!) I don’t really know what it’s capable of, but Hypothesis’s capabilities are going to be significantly beyond anything that seems to exist in the open source world.

This entry was posted in Hypothesis, Uncategorized on by .

On taking the ideas in Hypothesis

Last night on twitter, Eiríkr Åsheim asked me about the plausibility of taking some of the ideas from Hypothesis and putting them into Scalacheck. I thought I’d summarize elaborate on my response.

Basically, I’m incredibly for people taking ideas from Hypothesis. The new developments on Hypothesis are mostly about taking other peoples’ good ideas from other areas, cross-pollinating them, and putting in a ton of work to make them as effective as possible, so it would be entirely churlish for me to object to other people taking the ideas.

I think Eiríkr was mostly asking about the possibility of doing so though, which again my response is basically “yes this is 100% possible and if you want to do this I will help you”.

Some of the work on Hypothesis is basically how to make the best possible Quickcheck port in a dynamic language. If you’re working in Scala you probably don’t care about that that much, but maybe it would be interesting for Clojure I don’t know. To be honest I feel like this section of Hypothesis is a bit weird and I’m not really that interested in spreading it. If you want to know how it works we can talk about it but for now I’ll continue to just look faintly embarrassed about the subject.

The rest of Hypothesis though is just about how to make the best possible version of Quickcheck and the fact that I’m doing it in Python is practically a coincidence. The parametrization, derandomization, multi-armed bandits, example database and coverage driven discovery are all stuff you could write in any language you cared to name (though coverage driven discovery depends on how good your access to profiling/coverage APIs is, and the parametrization may be kinda annoying in most statically typed languages). There’s nothing Python specific about them and if you’re prepared to put in the work your favourite Quickcheck port could easily grow them and if you want to put in the work I will help you do so.

One of my requirements for a 1.0 release of Hypothesis is a vast amount of documentation. In particular there will be design documents for all these features which should make this a ton easier. In the meantime if there’s something that I haven’t written about that you want to know how it works, just ask me. Maybe prepare to be there for a while though as I probably won’t shut up about it.

This entry was posted in Hypothesis, Uncategorized on by .

Using multi-armed bandits to satisfy property based tests

This is a post about Hypothesis’s data generation. It’s much more sophisticated than any other property based testing system whose implementation I’m aware of. Quviq’s Erlang Quickcheck may be more advanced, I don’t know – it’s proprietary so I haven’t looked at the source code – but as far as I know the way this works is entirely novel to me (with inspiration heavily drawn from other people’s ideas – the novelty here is about 70% just that it’s applied to property based testing and 30% some interesting twists on how it works).

Hypothesis’s approach to data generation generates a wide variety of things that are really hard to do produce in a traditional property based testing model without a lot of careful hand tuning, and more interestingly it adaptively seeks out novel examples and takes user feedback to prune bad examples without you having to write custom strategies to support it. If you ask for a list of integers but really only want long lists of positive integers you can write something like assume(len(xs) >= 100 and all(x >= 0 for x in xs)) and Hypothesis will rapidly learn the shape of the data that you want and avoid wasting your time by giving you too many examples that don’t satisfy it. Even without using assume it will try you on lists of all positive integers, lists of all negative integers, and a mix in between.

Lets see how this works.

The general approach of data generation in property based testing is that you have a random number generator and some parameter that can control the shape of the distribution. For traditional quickcheck this parameter is an integer called “size” and controls the shape of the distribution by just generating larger examples when size is larger. The first version of Hypothesis did this (I tried to do a thing where size corresponded to the entropy of the distribution but this didn’t really work out that well), and it worked OK.

The problem is that there’s very little you can do to tune the shape of this distribution beyond trying larger and smaller examples, and that can miss out on a bunch of interesting things. There’s a concept called Swarm Testing (full disclosure: I learned about it it from a blog post rather than the paper) which is basically as follows: In any given run of a random test, turn off some of the features of what you can generate. This lets you get a much higher degree of coverage of interaction between the features that you chose to leave in.

I originally took this idea verbatim and it worked OK, but I’m now going to skip the rest of the history lesson and tell you what Hypothesis does now which is essentially a unification and significant generalisation of both the size parameter and the feature flags.

Any given strategy for data generation defines two probability distributions. One of these is a parameter distribution. You simply give it a random number generator and you get back a parameter. This is an opaque value which you do not know anything about. It could be an integer, it could be a giant opaque binary blob. It’s probably a tuple of numbers but you’re not supposed to know that shh. The only valid operation on a parameter value is to feed it back to the strategy it came from (for statically typed fans tuning in from home imagine that a strategy for producing type a is parameterised as (Strategy b a) and you only have access to an (exists b. Strategy b a). For non statically typed fans ignore this parenthetical). The second distribution is the conditional data distribution. You feed it a random number generator and a previously generated value of the parameter and you get a value out.

Why this distinction? You can just draw a value by drawing from the parameter and feeding it straight back in. Why not just do that?

The reason is that you can reuse the same parameter multiple times in a given run. So for example when generating a list of integers, you draw the parameter for the integer once, then generate all elements from the list according to that single parameter value. This lets you produce things like “long lists of positive integers” and “long lists of negative integers” which would have previously happened with probability \(\approx 0\) when you chose all of the elements completely independently (they’re still independent conditional on the parameter in this model).

This on its own is powerful enough. Doing our top level data generation as “draw a parameter, pass it back in”  produces a much lumpier distribution which can get at those hard to reach corners of the search space.

But what’s really powerful is what this then enables: Despite the completely opaque nature of the parameter object you can use it to guide the search space. This is because although you can’t look at the parameter you can look at the generated data.

This means that if you have some way of determining whether a particular example is good or bad you can use a multi-armed bandit algorithm to select parameter values which tend to produce better results.

So we keep track of how many examples from each parameter we’ve generated and how many of those turned out to be bad and then we do Thompson Sampling to select the next parameter.

That’s pretty much it! There are a few important details though:

  1. the Thompson sampling implementation I use has a special arm which is “draw a new parameter value”. It’s drawn from a beta distribution which has as alpha + beta the number of parameters drawn so far and beta the number of times that the first value drawn from a parameter was bad. If this arm is selected we pick a fresh parameter value and add it to the list.
  2. we start with a specified minimum number of parameters and just draw fresh parameters until we reach that number (it’s currently set to 25 but that’s pretty arbitrary and will likely change)

What also matters is what counts as bad. Hypothesis uses two criteria to determine in an example is bad:

  1. If it fails an assume check
  2. If we’ve seen this exact example before

The second one is… slightly dubious in that it means that the assumptions of the multi armed bandit aren’t actually holding true – the success of an arm depends on what’s come before, and so the probability changes over time. I can only offer in defence that it seems to work pretty well in practice.

Anyway, the result of failing things for the first is that Hypothesis will automatically be more likely to pick values that pass your assume checks – it will pick regions of the parameter space that have a higher probability of doing so. It isn’t able to magically understand the shape of your assume and pick perfectly, but assuming there is any relation at all between the structure of the example that it understands and the thing you’re assuming it will tend to do better a uniform draw across all the values.

The second criteria essentially forces it to seek out novelty. It’s the analogue of the size parameter – when things start getting too similar to what it’s seen before it is drawn to pick a new parameter and add some more variety to the mix.

This approach is still quite new and will doubtless see more refinement, but I’m pretty happy with it so far. It’s also going to be very useful in the 0.5 release when I start adding coverage driven example discovery because it will allow rewarding branches that produce new and interesting examples as well as penalising ones that are actively bad.

At the moment it’s hard for me to tell how much better this is in practice. It’s certainly the case that the quality of the distributions is required to make some tests pass in Hypothesis itself, but that may be artificial enough that in real live usage it’s a lot less of an advantage. However experiments with testing Hypothesis itself suggests that with the right sorts of tests there’s a lot of mileage you can get out of covering obscure corners of the test space and that anything that can raise the probability of hitting those corners is worth doing. This is at the very least a very big step in the right direction.

 

This entry was posted in Hypothesis, Uncategorized on by .