Better prediction of probabilities through auctions

I bet you think this is another post about Hypothesis, don’t you?

Nope! This is about something completely different. This is about prediction markets and proper scoring rules.

Which aren’t actually something I know that much about, so it’s very likely everything in this post is either obvious or a bad idea. This should just be considered a “here’s some stuff I thought of while reading about this and I couldn’t find anyone who wrote this down before so I’ll write it down now” post.

A scoring rule is a way of rewarding predictions. You have some possibilities A1, …, An which are disjoint and cover everything. So e.g. you could have A1 = “there will be a labour majority in the UK 2015 elections”, A2 = “there will be a conservative majority”, A3 = “none of the above”. You want people to assign probabilities to these events. They give you a probability distribution and depending on which outcome occurs you give them a reward (or penalise them for negative rules, which a lot of them are).

A scoring rule is proper if predicting the correct probabilities is an optimal solution and strictly proper if it’s the only optimal solution. For example a scoring rule that gives you a score of log(r_i) where r_k is the probability you assigned to outcome k and i is the outcome that occurs is a strictly proper scoring rule. A scoring rule that assigns you a score of 0 regardless of what happens is proper but not strictly proper (You can’t do better than 0, but anything you want to predict at all doesn’t do worse than that either).

When thinking about these things I noticed a particularly simple sort of proper scoring rule. It’s not strictly proper, but under a certain natural extension of the definitions can be made into a strictly proper extended scoring rule.

The rule is this:

Let \(q_1, \ldots, q_n\) be any sequence of numbers in \([0, 1]\). They do not have to sum to \(1\). Let \(r_1, \ldots, r_n\) be your prediction. Your total cost is \(C = \sum\limits_{i, q_i < r_i} q_i\). Let \(i\) be the winning outcome. Then you score \(1 – C\) if \(r_i > q_i\) and \(-C\) otherwise.

You might notice that you are basically conducting a simultaneous Vickrey auction on each coordinate.

Theorem: Under this scoring rule, a distribution \(r\) obtains the optimal score if and only if \(r_i < q_i\) whenever \(p_i < q_i\) and \(q_i < r_i\) whenever \(p_i < r_i\).


In order to simplify this proof we will show that this is true without the constraint that \(\sum r_i = 1\). Showing that these are the optimal solutions in the general case also implies that they are the optimal solutions in the specific case.

Let \(t_i = 1\) if \(r_i > q_i\) else \(t_i = 0\). Then your expected score is \(\sum t_i (p_i – r_i)\), because you pay a cost of \(r_i\) for each case where \(t_i = 1\) and a cost of \(0\) for each case where \(t_i = 0\), and you get a benefit of \(1\) with probability \(p_i\) if \(t_i\) = 1 and 0 otherwise.

This means that the multiplier for \(t_i\) is negative if \(p_i < r_i\), positive \(p_i > r_i\) and \(0\) when \(p_i = r_i\). So the only sensible choices to achieve optimality are to set \(t_i = 0\) when the coefficient is negative and \(1\) when it is positive. It doesn’t matter what you set it to when it’s 0 because the expected value is the same either way.


Note that this uses very little about your chosen distribution: All you’re actually doing is picking a set of values and saying “I think your current choice of estimates undervalues these and I’m prepared to stake money on it”. This is why it’s not a proper scoring rule: Any distribution which correctly identifies which values are underconfident will do.

There are two things you can do with this. One of them turns it into a proper scoring rule of sorts, the other just uses that feature as is. I’ll now explain both:

Randomizing for fun and expected profit

The solution which makes it a proper scoring rule is as follows: Let \(Q\) be any random vector taking values in \([0, 1]^n\) with a smooth and strictly positive distribution function.

Now play exactly the above scoring rule, but you make your prediction and then \(Q\) is drawn and you play the scoring rule against the drawn value.

Lets see that the only optimal choice of \(r_i\) is \(p_i\):

Consider again the profit you make from component \(i\). Call this profit \(T_i\). Then \(E(T_i|Q_i = q_i) = p_i – q_i\) if \(r_i > q_i\) or 0 if \(r_i \leq q_i)\).

So \(E(T_i) = (p_i – E(q_i | q_i < r_i) ) P(q_i < r_i) = p_i P(q_i < r_i) – \int\limits_0^{r_i} x f_i(x) dx = \int\limits_0^{r_i} (p_i – x) f_i(x)\)

Where \(f_i\) is the marginal distribution of \(Q_i\). Now, as assumed, \(f_i\) is strictly positive. That means that this integrand is strictly increasing for as long as \(p_i \leq x\) and strictly decreasing once \(x > p_i\), so the optimal value has to be at \(r_i = p_i\).


We can also derandomize this given any explicit choice of \(Q\): You make your prediction and then when the results come in we give you what would have been the expected value of your choice.

Lets consider how that would work for the simple case where each \(q_i\) is chosen uniformly at random. Splitting it up once more into benefit and cost, your expected reward is the probability that you would have won the winning index \(i\). That is, it’s \(r_i\).

Meanwhile your total cost is \( \sum E(Q_i | Q_i \leq r_i) P(Q_i \leq r_i  = \sum \frac{1}{2} r_i^2\). So your reward is \(r_i – \frac{1}{2} \sum r_i^2\). And this is just our old friend the quadratic scoring rule! So it’s possible that by picking other interesting probability distributions for \(q\) we could produce other interesting proper scoring rules. I don’t know. I haven’t really investigated.

Prediction markets

As Robin Hanson famously observed, you can turn any proper scoring rule into a prediction market. I don’t think what I’m about to describe is actually an instance of his scheme, but I don’t feel like I’ve sufficiently digested said scheme to be sure that it’s not. It’s certainly very close even if it’s not exactly the same thing.

You can use the original non-random form of the scoring rule to create what’s basically the world’s simplest prediction market:

You have n disjoint events as above that you want to predict, and you want to solicit the wisdom of the crowds in predicting it.

You do so by placing an initial non-zero bid on each option, \(b_i\). The market is now open for bidding.

A bid consists of simply picking one of the outcomes. When bidding on outcome \(i\) you pay the current market prediction, which is \(\frac{b_i}{\sum b_j}\) and increment \(b_i\) by one. If this is the outcome that occurs then you will get a reward of \(1\) for your bid. People may bid as many times as they want on as many options as they want.

This scheme should converge to something corresponding to the true beliefs of the market. When the market believes that the current prediction is under confident they will buy shares in it, raising its value. When they believe it is overconfident they must also believe that some other value is underconfident (because the probabilities sum to 1 so if \(q_i > p_i\) for some \(i\) then \(q_j < p_j\) for some \(j\)), so they will bid on that one, causing all the other probabilities to fall.

Convergence in theory is quite slow, basically oscillating like \(\frac{1}{n}\) in the number of bids on either side of the true probabilities, but I think in practice that still quickly takes it within the realms of being finer precision than peoples’ predictions are going to be anyway.

I’ve no idea if this is a good prediction mechanism in general, but it’s appealingly simple and might be worth consideration in some contexts for that alone.

This entry was posted in Uncategorized on by .

A plan for Hypothesis 1.0

So I’ve just released Hypothesis 0.4. It implements the example database I discussed in The Pain of Randomized Testing. Every time Hypothesis finds a failure it will save it for later use, and every property you test in Hypothesis will use the saved database to reproduce any failures it’s seen before (and sometimes find new ones, because it allows test cases to cross between different properties). This is pretty cool.

It has been almost exactly two weeks since I wrote my short term roadmap. In that time I’ve knocked out two of the features I wanted to get done: The example database and the improved data generation. I’ve also seriously improved the quality of the testing of Hypothesis itself and found a whole bunch of bugs as a result. So things are going pretty well.

But I’m away next week, and I’m basically giving myself until the end of February to get 1.0 out. So it’s time to get serious about  a concrete plan for the road map.

So the following are what I’m declaring requirements for 1.0:

  1. A stable API
  2. Solid documentation
  3. The ability to get reports about a run out – number of examples tried, number rejected, etc.
  4. Coverage driven example discovery

I could maaaaybe drop one of the latter two but not both, but I’m very reluctant to do so. The first two are hard requirements. If it doesn’t have a stable well documented API then it’s not 1.0. This isn’t one of those clownshoes open source hack projects where the specified API is whatever the code happens to implement and the documentation is the source code.

From the original list I’m explicitly taking testmachine off the table for now. I want to do it, and I will do it, but I’m officially relegating it to the giant pile of “Oooh wouldn’t it be neat if- gah, no David, you can do that after 1.0″ along with “generating strings matching a regular expression”, “automatically generating Django model object graphs” and “driving Selenium from hypothesis” and a whole bunch of other ideas. It will most likely be the 1.1 flagship feature.

A top level driver for Hypothesis will probably shake out of the wash as part of the reporting work. We’ll see how that goes. It’s neither a requirement nor an explicit not-for-release.

This leaves what should be an achievable amount of work. I have prototypes demonstrating the feasibility of the coverage driven example discovery, so based on past experience it will probably take me about two hours to actually get it hooked in and most of a week to get it solid enough that I’m happy with it. Reports should be a bit less than that.

Which gives me about two weeks to stabilise the API and write documentation. I think I can do that. I’ve been doing some of it as I go, and will do more of it as I implement the other two. The main difficulty I have here is basically deciding how much of Hypothesis’s internal weirdness should be exposed as API and how much I should hide behind a more pythonic surface. I suspect I’m going to go full weird and just tidy it up so that it’s as nice to use as possible.

So that’s the plan. I’m not getting everything I wanted to get done, but if I had then the end result wouldn’t be nearly as solid. I’m opting to strike a balance between production worthy and interestingly novel, and I think this is pretty much the right one.

And it’s a pretty exciting place to be. Right now it’s teetering there already, but if all this goes according to plan I’m basically prepared to state that by the end of February randomized property based testing will have three categories: Quviq’s Quickcheck, Hypothesis, and everything else. Because the former is proprietary (even the documentation!) I don’t really know what it’s capable of, but Hypothesis’s capabilities are going to be significantly beyond anything that seems to exist in the open source world.

This entry was posted in Uncategorized on by .

On taking the ideas in Hypothesis

Last night on twitter, Eiríkr Åsheim asked me about the plausibility of taking some of the ideas from Hypothesis and putting them into Scalacheck. I thought I’d summarize elaborate on my response.

Basically, I’m incredibly for people taking ideas from Hypothesis. The new developments on Hypothesis are mostly about taking other peoples’ good ideas from other areas, cross-pollinating them, and putting in a ton of work to make them as effective as possible, so it would be entirely churlish for me to object to other people taking the ideas.

I think Eiríkr was mostly asking about the possibility of doing so though, which again my response is basically “yes this is 100% possible and if you want to do this I will help you”.

Some of the work on Hypothesis is basically how to make the best possible Quickcheck port in a dynamic language. If you’re working in Scala you probably don’t care about that that much, but maybe it would be interesting for Clojure I don’t know. To be honest I feel like this section of Hypothesis is a bit weird and I’m not really that interested in spreading it. If you want to know how it works we can talk about it but for now I’ll continue to just look faintly embarrassed about the subject.

The rest of Hypothesis though is just about how to make the best possible version of Quickcheck and the fact that I’m doing it in Python is practically a coincidence. The parametrization, derandomization, multi-armed bandits, example database and coverage driven discovery are all stuff you could write in any language you cared to name (though coverage driven discovery depends on how good your access to profiling/coverage APIs is, and the parametrization may be kinda annoying in most statically typed languages). There’s nothing Python specific about them and if you’re prepared to put in the work your favourite Quickcheck port could easily grow them and if you want to put in the work I will help you do so.

One of my requirements for a 1.0 release of Hypothesis is a vast amount of documentation. In particular there will be design documents for all these features which should make this a ton easier. In the meantime if there’s something that I haven’t written about that you want to know how it works, just ask me. Maybe prepare to be there for a while though as I probably won’t shut up about it.

This entry was posted in Uncategorized on by .

Using multi-armed bandits to satisfy property based tests

This is a post about Hypothesis’s data generation. It’s much more sophisticated than any other property based testing system whose implementation I’m aware of. Quviq’s Erlang Quickcheck may be more advanced, I don’t know – it’s proprietary so I haven’t looked at the source code – but as far as I know the way this works is entirely novel to me (with inspiration heavily drawn from other people’s ideas – the novelty here is about 70% just that it’s applied to property based testing and 30% some interesting twists on how it works).

Hypothesis’s approach to data generation generates a wide variety of things that are really hard to do produce in a traditional property based testing model without a lot of careful hand tuning, and more interestingly it adaptively seeks out novel examples and takes user feedback to prune bad examples without you having to write custom strategies to support it. If you ask for a list of integers but really only want long lists of positive integers you can write something like assume(len(xs) >= 100 and all(x >= 0 for x in xs)) and Hypothesis will rapidly learn the shape of the data that you want and avoid wasting your time by giving you too many examples that don’t satisfy it. Even without using assume it will try you on lists of all positive integers, lists of all negative integers, and a mix in between.

Lets see how this works.

The general approach of data generation in property based testing is that you have a random number generator and some parameter that can control the shape of the distribution. For traditional quickcheck this parameter is an integer called “size” and controls the shape of the distribution by just generating larger examples when size is larger. The first version of Hypothesis did this (I tried to do a thing where size corresponded to the entropy of the distribution but this didn’t really work out that well), and it worked OK.

The problem is that there’s very little you can do to tune the shape of this distribution beyond trying larger and smaller examples, and that can miss out on a bunch of interesting things. There’s a concept called Swarm Testing (full disclosure: I learned about it it from a blog post rather than the paper) which is basically as follows: In any given run of a random test, turn off some of the features of what you can generate. This lets you get a much higher degree of coverage of interaction between the features that you chose to leave in.

I originally took this idea verbatim and it worked OK, but I’m now going to skip the rest of the history lesson and tell you what Hypothesis does now which is essentially a unification and significant generalisation of both the size parameter and the feature flags.

Any given strategy for data generation defines two probability distributions. One of these is a parameter distribution. You simply give it a random number generator and you get back a parameter. This is an opaque value which you do not know anything about. It could be an integer, it could be a giant opaque binary blob. It’s probably a tuple of numbers but you’re not supposed to know that shh. The only valid operation on a parameter value is to feed it back to the strategy it came from (for statically typed fans tuning in from home imagine that a strategy for producing type a is parameterised as (Strategy b a) and you only have access to an (exists b. Strategy b a). For non statically typed fans ignore this parenthetical). The second distribution is the conditional data distribution. You feed it a random number generator and a previously generated value of the parameter and you get a value out.

Why this distinction? You can just draw a value by drawing from the parameter and feeding it straight back in. Why not just do that?

The reason is that you can reuse the same parameter multiple times in a given run. So for example when generating a list of integers, you draw the parameter for the integer once, then generate all elements from the list according to that single parameter value. This lets you produce things like “long lists of positive integers” and “long lists of negative integers” which would have previously happened with probability \(\approx 0\) when you chose all of the elements completely independently (they’re still independent conditional on the parameter in this model).

This on its own is powerful enough. Doing our top level data generation as “draw a parameter, pass it back in”  produces a much lumpier distribution which can get at those hard to reach corners of the search space.

But what’s really powerful is what this then enables: Despite the completely opaque nature of the parameter object you can use it to guide the search space. This is because although you can’t look at the parameter you can look at the generated data.

This means that if you have some way of determining whether a particular example is good or bad you can use a multi-armed bandit algorithm to select parameter values which tend to produce better results.

So we keep track of how many examples from each parameter we’ve generated and how many of those turned out to be bad and then we do Thompson Sampling to select the next parameter.

That’s pretty much it! There are a few important details though:

  1. the Thompson sampling implementation I use has a special arm which is “draw a new parameter value”. It’s drawn from a beta distribution which has as alpha + beta the number of parameters drawn so far and beta the number of times that the first value drawn from a parameter was bad. If this arm is selected we pick a fresh parameter value and add it to the list.
  2. we start with a specified minimum number of parameters and just draw fresh parameters until we reach that number (it’s currently set to 25 but that’s pretty arbitrary and will likely change)

What also matters is what counts as bad. Hypothesis uses two criteria to determine in an example is bad:

  1. If it fails an assume check
  2. If we’ve seen this exact example before

The second one is… slightly dubious in that it means that the assumptions of the multi armed bandit aren’t actually holding true – the success of an arm depends on what’s come before, and so the probability changes over time. I can only offer in defence that it seems to work pretty well in practice.

Anyway, the result of failing things for the first is that Hypothesis will automatically be more likely to pick values that pass your assume checks – it will pick regions of the parameter space that have a higher probability of doing so. It isn’t able to magically understand the shape of your assume and pick perfectly, but assuming there is any relation at all between the structure of the example that it understands and the thing you’re assuming it will tend to do better a uniform draw across all the values.

The second criteria essentially forces it to seek out novelty. It’s the analogue of the size parameter – when things start getting too similar to what it’s seen before it is drawn to pick a new parameter and add some more variety to the mix.

This approach is still quite new and will doubtless see more refinement, but I’m pretty happy with it so far. It’s also going to be very useful in the 0.5 release when I start adding coverage driven example discovery because it will allow rewarding branches that produce new and interesting examples as well as penalising ones that are actively bad.

At the moment it’s hard for me to tell how much better this is in practice. It’s certainly the case that the quality of the distributions is required to make some tests pass in Hypothesis itself, but that may be artificial enough that in real live usage it’s a lot less of an advantage. However experiments with testing Hypothesis itself suggests that with the right sorts of tests there’s a lot of mileage you can get out of covering obscure corners of the test space and that anything that can raise the probability of hitting those corners is worth doing. This is at the very least a very big step in the right direction.


This entry was posted in Uncategorized on by .

Anatomy of a bug hunt

TLDR, this is another Hypothesis post. If you’re bored of those you can press back now.

I discovered a bug today that has been in Hypothesis from the beginning. I’m really really glad that I found it the way I did: Partly because I feel all smug and validated about my development practices as a result, but mostly because finding and figuring this one out in the wild would have been a nightmare, and it probably would have pissed off a ton of users who just wrote the library off as inherently flaky before I found out about it.

In Hypothesis you can define strategies for instances. This is how things like @given([int]) and so on work – there’s a rule that says how to build strategies for lists given strategies for their elements.

But in Python you have subtyping. There’s a similar strategy definition for tuples, but you might want more specific ones for your custom namedtuples (e.g. insisting on some relation between the fields).

So you need to be able to define instance strategy mappings for a subtype of something that already has a mapping defined and have it work. But you also want to have it inherit the supertype’s definition if a subtype one has not been defined (e.g. if you have a specific namedtuple you want it to inherit the generic namedtuple logic).

This is basically a method dispatch problem. You have to find the most specific type with a definition for a supertype of the current type (note: There’s potentially an ambiguity in “most specific” which is currently not well handled. I plan to look into this but it’s not high on my priority list).

The Hypothesis implementation of this is ridiculously inefficient, but it shouldn’t be used as a general purpose object system so that’s basically not a huge problem. I did however think that it was correct – I’d frequently said things along the lines of “Yes it’s ridiculous that Hypothesis has its own object system but honestly it’s been super useful and never given me any problems”.

Ah, hubris.

So yeah, there turns out to be a bug in this logic which can cause things to really confusingly wrong in surprising and hard to debug ways.

The bug is this:

Due to reasons that are honestly a bit silly and probably entirely unnecessary (I need to revisit this code properly at some point), the core of the algorithm Hypothesis uses for this requires doing a topological sort of the instance handlers with respect to the subset relation.

So the way I did this was to do a normal sort by the relation that x < y if x is a subtype of y, x > y if y is a subtype of x and otherwise sort by comparing names.

And this is so very much not how you do a topological sort. I don’t know what past David was thinking but apparently I’m just really bad at extending partial orders. This produces a non-transitive order, and as a result sorting it can basically do anything it wants.

What it wants in practice seems to be to almost work. Most of the time it produces the right results, but depending on the order you started with it can be almost arbitrarily wrong.

I spotted this because my builds are designed to fail at less than 100% branch coverage. The tests I had around this were flaky in the sense that sometimes they would randomly fail to produce enough coverage and I couldn’t figure out why. Investigating caused me to find this dodgy code and do a double take.

I then really struggled to find an example that actually demonstrated it was dodgy. So I just wrote some code to throw lots of examples at it and see what happened, and indeed got lots of super complicated examples demonstrating that it really was dodgy but not providing any great deal of insight.

Then I realised that I was literally sitting in the middle of a library that was about example generation and minimization, so I once more turned Hypothesis’s terrible gaze upon itself and it immediately started giving me small examples of 3 or 4 classes.

Which on trying to reproduce in a deterministic test or in ipython didn’t work.

And this is where the bug gets super funny and where I spent a good half an hour finding test cases then trying and failing to reproduce them and getting increasingly convinced that I’d found a new and exciting different bug in hypothesis before I finally figured out what was going on.

You see, as I mentioned above the behaviour depends on the starting order. In this case, the items we start with are the entries in a dict. This means that given a specific order of operations to build up the mapping, the starting order is entirely deterministic if basically arbitrary – it’s just the dict iteration order.

Entirely deterministic, that is, within any given process. As of Python 3, hash randomization is on by default. Each new python process will have a different hash secret and will thus have its own iteration order.

This means that that lovingly hand-crafted example of this will probably only produce the problem some fraction of the time, and that no matter how many times you run it in a given test run it will always produce the same answer.

Hypothesis on the other hand? No problem. It just throws examples at the problem until it finds one that works with the current hash iteration order.

This is really interesting to me because it’s an example where property based testing is not just the same as normal testing but you don’t have to explicitly provide the examples. In this case the system is actively fighting against your ability to produce examples but Hypothesis just shrugs, does its thing and gives you one anyway.

All told, despite the ridiculous terrible code that was the source of this, I think this is pretty much a success story for Hypothesis: The policy of failing the build if there’s not 100% coverage is working out really well because it makes me find areas of the codebase which are merely erratically covered and given them some serious attention, and the ability of Hypothesis to find and reduce interesting examples continues to impress.

If you’re interested, here is the commit that fixes this. Somewhat embarrassingly, it only works because of a typo, but it does still work rather well.

This entry was posted in Uncategorized on by .