Revised Hypothesis 1.0 plan

So in my previous post about Hypothesis 1.0 I stated the following priorities:

  1. A stable API
  2. Solid documentation
  3. The ability to get reports about a run out – number of examples tried, number rejected, etc.
  4. Coverage driven example discovery

Well, uh, I’m still planning on the first two (don’t worry, they’re hard requirements. The release date will slip before I compromise on them).

The second two though? Eh. They just don’t seem that important for a 1.0 release.

The reports would be nice to have. They’re definitely helpful, and they’re certainly important for avoiding useless tests, but I wouldn’t say they’re essential.

But more importantly, I know how to do them without breaking the current API.

After some thought I’ve realised that there are essentially two major categories of things that I need to be in the 1.0 release.

  1. Things that would require me to break the API
  2. Things that will get people excited enough about Hypothesis to use it

2 is pure marketing, but marketing is important here. My goal with Hypothesis is to make the world of Python a better tested environment, and if people aren’t actually using Hypothesis it will have failed at that goal, regardless of what an amazing piece of software it is.

And the 1.0 release is the big marketing opportunity. It’s not quite go big or go home, but it’s the point at which I’m throwing Hypothesis out to the world and saying “Here is my testing library. It’s awesome. You should use it”, and at that point it had better have a pretty compelling story for getting people to use it.

So those are the things I need for 1.0: excitement and stability.

And… well the coverage driven stuff is exciting to me, and it’s exciting to other quickcheck nerds, but honestly we’re not the target demographic here. Any quickcheck nerds who are writing Python should be just taking one look at Hypothesis and going “oh yes. It must be mine. Yes”, because honestly… the other quickcheck libraries for Python are terrible. I don’t like to trash talk (OK, I love to trash talk, but it’s uncouth), but if your quickcheck library doesn’t do example minimization then it’s a toy. Hypothesis started better than all the other quickcheck libraries, and it’s gone from strength to strength.

So coverage doesn’t break the API and doesn’t excite people, so it’s out. I’d like to do it at some point, but for now it’s just not important.

Reporting is useful, but it’s both API non-breaking and boring, so that’s out too.

So what’s in instead?

It’s a short list:

  1. Django integration

OK. That’s a slight lie, because it turns out there’s a feature I need for the Django integration (details why to come in another post). So the actual list:

  1. Bringing in the ideas from testmachine
  2. Django integration

And this is actually very compatible with both, because the testmachine ideas are among the ideas most likely to completely break the API. I… am not actually sure what I was thinking when trying to relegate them to a post 1.0 release – I have some ideas for how they could have been integrated without changing the API much, but I think the result would have been really painfully limited in comparison to what I could do.

So what will the django integration look like?

It will look like, if you will excuse me briefly violating my blog’s editorial guidelines, fucking magic is what it will look like.

You see, Django models come with everything Hypothesis needs to know how to generate them. They’re basically giant typed records of values. Hypothesis loves typed records of values. When this works (and I pretty much know how to make all the details work), the django integration is going to look like:

from hypothesis.extra.django import TestCase
from hypothesis import given
from myproject.mymodels import SomeModel
from myproject.myforms import SomeForm
 
class TestSomeStuff(TestCase):
    @given(SomeModel)
    def test_some_stuff(self, model_instance):
        ...
 
    @given(SomeForm)
    def test_some_other_stuff(self, form_instance):
        ....

That’s it. No fixtures, no annoying specifications of different edge cases for your model to be in, just ask Hypothesis for a model instance and it will say “sure. Here you go”. Given the upcoming fake factory integration it will even do it with nice data.

And, well, this may not be research level innovation on the theme of Quickcheck, but as far as compelling marketing messages for bringing property based testing to the masses I can’t really think of a better one than “Hey, all of you people using that popular web framework, would you like to do more testing with not only zero extra effort but you can also throw away a bunch of laborious boring stuff you’re already doing?”

So that’s the plan. I suspect this is going to cause a it of slippage in the timeline I’d originally intended but, well, I was also planning to be looking for jobs this month and I’ve mostly thrown myself into Hypothesis instead, so it’s not like I don’t have the time. I’m going to try to still be on track for an end of February release, but it’s no big deal if it ends up creeping into March.

This entry was posted in Hypothesis, Uncategorized on by .

Some thoughts on open source

Historically my attitude to open source has been one of avoiding the GPL like the plague and sticking everything under a permissive license.

There are a couple reasons for this:

  1. Generally speaking I want people to use my stuff. The GPL acts as an obstacle to that.
  2. I want to spread my ideas. The GPL acts as an obstacle to that too, although less so.
  3. Most of my open source code is not stuff I’m emotionally attached to and frankly if someone wants to take it off my hands that’s great – it means I don’t have to maintain it
  4. I’ve used a lot of open source code in my commercial work that I wouldn’t have been able to use if it was GPLed. It seems churlish to not return the favour.

I’m… starting to rethink some of this attitude.

Here are some links:

  1. How to capture an open source project
  2. Github is not your CV
  3. The Ethics of Unpaid Labor and the OSS Community

Fundamentally, open source may claim to be about freedom, and most people who do it may even have that intent, but when judging systems intent is irrelevant, what matters is the structural effect.

And the structural effect of the current relation between the vast quantity of permissively licensed open source work and the industry is one of overvaluing people who are capable of putting in work for free, devaluing everyone’s labour, and allowing the industrial scale parasite of VC backed startups to leech yet more value from the commons without putting anything back at all.

And it’s at this point that I go have an existential crisis and go gibber in the corner for a bit.

Because this all ends up falling under what I regard as the fundamental question of my life: How can I make the world better while not also making the world worse and make enough money to eat at the same time?

Answers on a postcard. Or a blog comment. Or anything really because if you’ve got the answer can you please let me know? Because I’ve no idea.

Because the stuff I give away for free is the closest I’ve come to achieving the first even if it falls down hard on the latter. It sounds terribly megalomaniacal, but the point of this blog is at least partly to make the world a better place, and the point of Hypothesis is explicitly to raise the software quality bar for software written in Python (This is part of why I’m putting in quite so much work on the edge cases and things like cross platform compatibility. I don’t want people to be able to say they can’t use Hypothesis).

But… I increasingly feel like giving things away for free is also failing pretty hard on the second criteria: I may be making the world a better place, but I’m also making it a worse one by furthering an exploitative system.

And I’m not really sure what to do about it, but I think at the bare minimum it’s time to start pushing back a little.

And I think in the spirit of that bare minimum I’m going to be moving away from licenses that are quite as permissive as I’ve historically chosen. Hypothesis is under the MPLv2, as will all future open source work I do until I decide I have a better idea. It manages to strike a good balance between the desire to get other people to use it and taking a bit of a stronger stance on things. I’m also requiring a CLA assigning all copyright to me, though I’m less convinced that that’s a good general principle.

I’m also going to try to figure out how to get paid for this. It feels… embarrassingly self-serving to think in terms of being morally obligated to get paid for my work, but I think it’s valid. I’m only able to do this free work because I’m in the privileged class of people who have enough time and money to do so. For everyone else it’s not a moral issue, it’s just a flat out issue. It may not feel like it, but providing push back on doing free work is at least partly an attempt to abdicate some of that privilege.

(To be clear: I also want to be paid to work on Hypothesis for purely self serving reasons)

This doesn’t mean I’m not going to work on Hypothesis if I can’t get paid to do so, and it doesn’t mean that the wellspring of random half-baked ideas that is my Github account is going to dry up, so maybe this is a promise without teeth. I don’t know. Getting paid for open source work outside of a few big projects (there are a lot of paid Linux devs) seems to be one of the great unsolved problems of our industry, and I don’t really want to predicate something I regard as as important as Hypothesis on solving that problem, and until Hypothesis starts to see the sort of traction I’d like it to, getting people to pay me for it might be something of a hard sell.

Which basically puts me back where I started, just with a slightly less permissive license that if I’m being honest doesn’t really affect much of anything. It’s not a solution, but maybe it’s at least the start of one.

 

This entry was posted in Uncategorized on by .

Changes to the Hypothesis example database

One of the most notable features of Hypothesis is the example database. Unlike classic Quickcheck, when Hypothesis finds an example it saves it in its database. The next time it needs an example of that type it tries the previously saved example before moving on to random generation.

The initial version of this was really limited though. The problem is that this happens at the falsify level, so for example a failure in @given(int, int) it would save the pair (int, int), and for attempts to falsify other things with @given(int, int) it would reuse that. However if you asked for a @given((int, int), str) for example it would not reuse the previous example.

The result is that you don’t get to share any interesting features between examples – only exact matches. This seems a shame.

This turns out to be fixable, and the fix is both fairly straightforward but also quite interesting in that it only really works because of the ridiculously dynamic nature of Hypothesis.

Conceptually all that happens is that whenever you save something you also save all the subcomponents, and whenever you generate something you have a chance of generating any previously saved examples for it. It’s easy to say, but the details are a little complicated. Maybe there’s a better way to do it, but the approach I ended up taking is very specific to Hypothesis.

The main component is that Hypothesis’s weird little home brewed version of type classes are first class values in the language. Both the “types” that it maps from and the values it produces are just python values, and can be manipulated as such.

We use this in two ways: First, Hypothesis strategies grew a decompose method. This iterates over (descriptor, value) pairs. So for example when saving (1, “foo”) as an (int, str), this would yield (int, 1) and then (str, “foo”), and we would then save those values for those types.

(There’s a slight wart here. Sometimes these values decompose as non-serializable types, so we need to check for that case and not save when that happens).

So that’s the saving, but what next? How do we get these values back out? This is particularly important because arguments to @given are generated as a single strategy, so these will never actually help – a @given(int) isn’t using the strategy for int, it’s using the strategy for ((int,), {}) (one positional arg, no keyword args).

The next step takes advantage of the fact that the strategies are first class values, with a little bit of an assist from the parametrized data generation. We introduce the idea of an ExampleAugmentedStrategy. This is a strategy that wraps another strategy with some examples that could have come from it. It’s essentially biasing a strategy towards some favoured examples. Sometimes it picks from those examples, sometimes it generates some fresh ones (which ones, and the probability of generating them, are decided by the parameter value).

The strategy table is then modified to look up previous examples and if any are found return an augmented strategy.

Which brings us to the desired end point: The examples we previously extracted can now be reassembled into other examples, allowing Hypothesis to essentially learn pieces of the problem from other tests.

In theory this should allow the construction of much harder to find examples. In practice I haven’t yet seen all that much benefit from it yet, because I’m still figuring out what a good workflow for the database is (I don’t really have a way to save it between travis runs, which makes it of limited  use), but I think it’s starting to show a lot of promise.

This entry was posted in Hypothesis, Uncategorized on by .

Better prediction of probabilities through auctions

I bet you think this is another post about Hypothesis, don’t you?

Nope! This is about something completely different. This is about prediction markets and proper scoring rules.

Which aren’t actually something I know that much about, so it’s very likely everything in this post is either obvious or a bad idea. This should just be considered a “here’s some stuff I thought of while reading about this and I couldn’t find anyone who wrote this down before so I’ll write it down now” post.

A scoring rule is a way of rewarding predictions. You have some possibilities A1, …, An which are disjoint and cover everything. So e.g. you could have A1 = “there will be a labour majority in the UK 2015 elections”, A2 = “there will be a conservative majority”, A3 = “none of the above”. You want people to assign probabilities to these events. They give you a probability distribution and depending on which outcome occurs you give them a reward (or penalise them for negative rules, which a lot of them are).

A scoring rule is proper if predicting the correct probabilities is an optimal solution and strictly proper if it’s the only optimal solution. For example a scoring rule that gives you a score of log(r_i) where r_k is the probability you assigned to outcome k and i is the outcome that occurs is a strictly proper scoring rule. A scoring rule that assigns you a score of 0 regardless of what happens is proper but not strictly proper (You can’t do better than 0, but anything you want to predict at all doesn’t do worse than that either).

When thinking about these things I noticed a particularly simple sort of proper scoring rule. It’s not strictly proper, but under a certain natural extension of the definitions can be made into a strictly proper extended scoring rule.

The rule is this:

Let \(q_1, \ldots, q_n\) be any sequence of numbers in \([0, 1]\). They do not have to sum to \(1\). Let \(r_1, \ldots, r_n\) be your prediction. Your total cost is \(C = \sum\limits_{i, q_i < r_i} q_i\). Let \(i\) be the winning outcome. Then you score \(1 – C\) if \(r_i > q_i\) and \(-C\) otherwise.

You might notice that you are basically conducting a simultaneous Vickrey auction on each coordinate.

Theorem: Under this scoring rule, a distribution \(r\) obtains the optimal score if and only if \(r_i < q_i\) whenever \(p_i < q_i\) and \(q_i < r_i\) whenever \(p_i < r_i\).

Proof:

In order to simplify this proof we will show that this is true without the constraint that \(\sum r_i = 1\). Showing that these are the optimal solutions in the general case also implies that they are the optimal solutions in the specific case.

Let \(t_i = 1\) if \(r_i > q_i\) else \(t_i = 0\). Then your expected score is \(\sum t_i (p_i – r_i)\), because you pay a cost of \(r_i\) for each case where \(t_i = 1\) and a cost of \(0\) for each case where \(t_i = 0\), and you get a benefit of \(1\) with probability \(p_i\) if \(t_i\) = 1 and 0 otherwise.

This means that the multiplier for \(t_i\) is negative if \(p_i < r_i\), positive \(p_i > r_i\) and \(0\) when \(p_i = r_i\). So the only sensible choices to achieve optimality are to set \(t_i = 0\) when the coefficient is negative and \(1\) when it is positive. It doesn’t matter what you set it to when it’s 0 because the expected value is the same either way.

QED

Note that this uses very little about your chosen distribution: All you’re actually doing is picking a set of values and saying “I think your current choice of estimates undervalues these and I’m prepared to stake money on it”. This is why it’s not a proper scoring rule: Any distribution which correctly identifies which values are underconfident will do.

There are two things you can do with this. One of them turns it into a proper scoring rule of sorts, the other just uses that feature as is. I’ll now explain both:

Randomizing for fun and expected profit

The solution which makes it a proper scoring rule is as follows: Let \(Q\) be any random vector taking values in \([0, 1]^n\) with a smooth and strictly positive distribution function.

Now play exactly the above scoring rule, but you make your prediction and then \(Q\) is drawn and you play the scoring rule against the drawn value.

Lets see that the only optimal choice of \(r_i\) is \(p_i\):

Consider again the profit you make from component \(i\). Call this profit \(T_i\). Then \(E(T_i|Q_i = q_i) = p_i – q_i\) if \(r_i > q_i\) or 0 if \(r_i \leq q_i)\).

So \(E(T_i) = (p_i – E(q_i | q_i < r_i) ) P(q_i < r_i) = p_i P(q_i < r_i) – \int\limits_0^{r_i} x f_i(x) dx = \int\limits_0^{r_i} (p_i – x) f_i(x)\)

Where \(f_i\) is the marginal distribution of \(Q_i\). Now, as assumed, \(f_i\) is strictly positive. That means that this integrand is strictly increasing for as long as \(p_i \leq x\) and strictly decreasing once \(x > p_i\), so the optimal value has to be at \(r_i = p_i\).

QED

We can also derandomize this given any explicit choice of \(Q\): You make your prediction and then when the results come in we give you what would have been the expected value of your choice.

Lets consider how that would work for the simple case where each \(q_i\) is chosen uniformly at random. Splitting it up once more into benefit and cost, your expected reward is the probability that you would have won the winning index \(i\). That is, it’s \(r_i\).

Meanwhile your total cost is \( \sum E(Q_i | Q_i \leq r_i) P(Q_i \leq r_i  = \sum \frac{1}{2} r_i^2\). So your reward is \(r_i – \frac{1}{2} \sum r_i^2\). And this is just our old friend the quadratic scoring rule! So it’s possible that by picking other interesting probability distributions for \(q\) we could produce other interesting proper scoring rules. I don’t know. I haven’t really investigated.

Prediction markets

As Robin Hanson famously observed, you can turn any proper scoring rule into a prediction market. I don’t think what I’m about to describe is actually an instance of his scheme, but I don’t feel like I’ve sufficiently digested said scheme to be sure that it’s not. It’s certainly very close even if it’s not exactly the same thing.

You can use the original non-random form of the scoring rule to create what’s basically the world’s simplest prediction market:

You have n disjoint events as above that you want to predict, and you want to solicit the wisdom of the crowds in predicting it.

You do so by placing an initial non-zero bid on each option, \(b_i\). The market is now open for bidding.

A bid consists of simply picking one of the outcomes. When bidding on outcome \(i\) you pay the current market prediction, which is \(\frac{b_i}{\sum b_j}\) and increment \(b_i\) by one. If this is the outcome that occurs then you will get a reward of \(1\) for your bid. People may bid as many times as they want on as many options as they want.

This scheme should converge to something corresponding to the true beliefs of the market. When the market believes that the current prediction is under confident they will buy shares in it, raising its value. When they believe it is overconfident they must also believe that some other value is underconfident (because the probabilities sum to 1 so if \(q_i > p_i\) for some \(i\) then \(q_j < p_j\) for some \(j\)), so they will bid on that one, causing all the other probabilities to fall.

Convergence in theory is quite slow, basically oscillating like \(\frac{1}{n}\) in the number of bids on either side of the true probabilities, but I think in practice that still quickly takes it within the realms of being finer precision than peoples’ predictions are going to be anyway.

I’ve no idea if this is a good prediction mechanism in general, but it’s appealingly simple and might be worth consideration in some contexts for that alone.

This entry was posted in Uncategorized on by .

A plan for Hypothesis 1.0

So I’ve just released Hypothesis 0.4. It implements the example database I discussed in The Pain of Randomized Testing. Every time Hypothesis finds a failure it will save it for later use, and every property you test in Hypothesis will use the saved database to reproduce any failures it’s seen before (and sometimes find new ones, because it allows test cases to cross between different properties). This is pretty cool.

It has been almost exactly two weeks since I wrote my short term roadmap. In that time I’ve knocked out two of the features I wanted to get done: The example database and the improved data generation. I’ve also seriously improved the quality of the testing of Hypothesis itself and found a whole bunch of bugs as a result. So things are going pretty well.

But I’m away next week, and I’m basically giving myself until the end of February to get 1.0 out. So it’s time to get serious about  a concrete plan for the road map.

So the following are what I’m declaring requirements for 1.0:

  1. A stable API
  2. Solid documentation
  3. The ability to get reports about a run out – number of examples tried, number rejected, etc.
  4. Coverage driven example discovery

I could maaaaybe drop one of the latter two but not both, but I’m very reluctant to do so. The first two are hard requirements. If it doesn’t have a stable well documented API then it’s not 1.0. This isn’t one of those clownshoes open source hack projects where the specified API is whatever the code happens to implement and the documentation is the source code.

From the original list I’m explicitly taking testmachine off the table for now. I want to do it, and I will do it, but I’m officially relegating it to the giant pile of “Oooh wouldn’t it be neat if- gah, no David, you can do that after 1.0″ along with “generating strings matching a regular expression”, “automatically generating Django model object graphs” and “driving Selenium from hypothesis” and a whole bunch of other ideas. It will most likely be the 1.1 flagship feature.

A top level driver for Hypothesis will probably shake out of the wash as part of the reporting work. We’ll see how that goes. It’s neither a requirement nor an explicit not-for-release.

This leaves what should be an achievable amount of work. I have prototypes demonstrating the feasibility of the coverage driven example discovery, so based on past experience it will probably take me about two hours to actually get it hooked in and most of a week to get it solid enough that I’m happy with it. Reports should be a bit less than that.

Which gives me about two weeks to stabilise the API and write documentation. I think I can do that. I’ve been doing some of it as I go, and will do more of it as I implement the other two. The main difficulty I have here is basically deciding how much of Hypothesis’s internal weirdness should be exposed as API and how much I should hide behind a more pythonic surface. I suspect I’m going to go full weird and just tidy it up so that it’s as nice to use as possible.

So that’s the plan. I’m not getting everything I wanted to get done, but if I had then the end result wouldn’t be nearly as solid. I’m opting to strike a balance between production worthy and interestingly novel, and I think this is pretty much the right one.

And it’s a pretty exciting place to be. Right now it’s teetering there already, but if all this goes according to plan I’m basically prepared to state that by the end of February randomized property based testing will have three categories: Quviq’s Quickcheck, Hypothesis, and everything else. Because the former is proprietary (even the documentation!) I don’t really know what it’s capable of, but Hypothesis’s capabilities are going to be significantly beyond anything that seems to exist in the open source world.

This entry was posted in Hypothesis, Uncategorized on by .