Author Archives: david

On taking the ideas in Hypothesis

Last night on twitter, Eiríkr Åsheim asked me about the plausibility of taking some of the ideas from Hypothesis and putting them into Scalacheck. I thought I’d summarize elaborate on my response.

Basically, I’m incredibly for people taking ideas from Hypothesis. The new developments on Hypothesis are mostly about taking other peoples’ good ideas from other areas, cross-pollinating them, and putting in a ton of work to make them as effective as possible, so it would be entirely churlish for me to object to other people taking the ideas.

I think Eiríkr was mostly asking about the possibility of doing so though, which again my response is basically “yes this is 100% possible and if you want to do this I will help you”.

Some of the work on Hypothesis is basically how to make the best possible Quickcheck port in a dynamic language. If you’re working in Scala you probably don’t care about that that much, but maybe it would be interesting for Clojure I don’t know. To be honest I feel like this section of Hypothesis is a bit weird and I’m not really that interested in spreading it. If you want to know how it works we can talk about it but for now I’ll continue to just look faintly embarrassed about the subject.

The rest of Hypothesis though is just about how to make the best possible version of Quickcheck and the fact that I’m doing it in Python is practically a coincidence. The parametrization, derandomization, multi-armed bandits, example database and coverage driven discovery are all stuff you could write in any language you cared to name (though coverage driven discovery depends on how good your access to profiling/coverage APIs is, and the parametrization may be kinda annoying in most statically typed languages). There’s nothing Python specific about them and if you’re prepared to put in the work your favourite Quickcheck port could easily grow them and if you want to put in the work I will help you do so.

One of my requirements for a 1.0 release of Hypothesis is a vast amount of documentation. In particular there will be design documents for all these features which should make this a ton easier. In the meantime if there’s something that I haven’t written about that you want to know how it works, just ask me. Maybe prepare to be there for a while though as I probably won’t shut up about it.

This entry was posted in Hypothesis, Uncategorized on by .

Using multi-armed bandits to satisfy property based tests

This is a post about Hypothesis’s data generation. It’s much more sophisticated than any other property based testing system whose implementation I’m aware of. Quviq’s Erlang Quickcheck may be more advanced, I don’t know – it’s proprietary so I haven’t looked at the source code – but as far as I know the way this works is entirely novel to me (with inspiration heavily drawn from other people’s ideas – the novelty here is about 70% just that it’s applied to property based testing and 30% some interesting twists on how it works).

Hypothesis’s approach to data generation generates a wide variety of things that are really hard to do produce in a traditional property based testing model without a lot of careful hand tuning, and more interestingly it adaptively seeks out novel examples and takes user feedback to prune bad examples without you having to write custom strategies to support it. If you ask for a list of integers but really only want long lists of positive integers you can write something like assume(len(xs) >= 100 and all(x >= 0 for x in xs)) and Hypothesis will rapidly learn the shape of the data that you want and avoid wasting your time by giving you too many examples that don’t satisfy it. Even without using assume it will try you on lists of all positive integers, lists of all negative integers, and a mix in between.

Lets see how this works.

The general approach of data generation in property based testing is that you have a random number generator and some parameter that can control the shape of the distribution. For traditional quickcheck this parameter is an integer called “size” and controls the shape of the distribution by just generating larger examples when size is larger. The first version of Hypothesis did this (I tried to do a thing where size corresponded to the entropy of the distribution but this didn’t really work out that well), and it worked OK.

The problem is that there’s very little you can do to tune the shape of this distribution beyond trying larger and smaller examples, and that can miss out on a bunch of interesting things. There’s a concept called Swarm Testing (full disclosure: I learned about it it from a blog post rather than the paper) which is basically as follows: In any given run of a random test, turn off some of the features of what you can generate. This lets you get a much higher degree of coverage of interaction between the features that you chose to leave in.

I originally took this idea verbatim and it worked OK, but I’m now going to skip the rest of the history lesson and tell you what Hypothesis does now which is essentially a unification and significant generalisation of both the size parameter and the feature flags.

Any given strategy for data generation defines two probability distributions. One of these is a parameter distribution. You simply give it a random number generator and you get back a parameter. This is an opaque value which you do not know anything about. It could be an integer, it could be a giant opaque binary blob. It’s probably a tuple of numbers but you’re not supposed to know that shh. The only valid operation on a parameter value is to feed it back to the strategy it came from (for statically typed fans tuning in from home imagine that a strategy for producing type a is parameterised as (Strategy b a) and you only have access to an (exists b. Strategy b a). For non statically typed fans ignore this parenthetical). The second distribution is the conditional data distribution. You feed it a random number generator and a previously generated value of the parameter and you get a value out.

Why this distinction? You can just draw a value by drawing from the parameter and feeding it straight back in. Why not just do that?

The reason is that you can reuse the same parameter multiple times in a given run. So for example when generating a list of integers, you draw the parameter for the integer once, then generate all elements from the list according to that single parameter value. This lets you produce things like “long lists of positive integers” and “long lists of negative integers” which would have previously happened with probability \(\approx 0\) when you chose all of the elements completely independently (they’re still independent conditional on the parameter in this model).

This on its own is powerful enough. Doing our top level data generation as “draw a parameter, pass it back in”  produces a much lumpier distribution which can get at those hard to reach corners of the search space.

But what’s really powerful is what this then enables: Despite the completely opaque nature of the parameter object you can use it to guide the search space. This is because although you can’t look at the parameter you can look at the generated data.

This means that if you have some way of determining whether a particular example is good or bad you can use a multi-armed bandit algorithm to select parameter values which tend to produce better results.

So we keep track of how many examples from each parameter we’ve generated and how many of those turned out to be bad and then we do Thompson Sampling to select the next parameter.

That’s pretty much it! There are a few important details though:

  1. the Thompson sampling implementation I use has a special arm which is “draw a new parameter value”. It’s drawn from a beta distribution which has as alpha + beta the number of parameters drawn so far and beta the number of times that the first value drawn from a parameter was bad. If this arm is selected we pick a fresh parameter value and add it to the list.
  2. we start with a specified minimum number of parameters and just draw fresh parameters until we reach that number (it’s currently set to 25 but that’s pretty arbitrary and will likely change)

What also matters is what counts as bad. Hypothesis uses two criteria to determine in an example is bad:

  1. If it fails an assume check
  2. If we’ve seen this exact example before

The second one is… slightly dubious in that it means that the assumptions of the multi armed bandit aren’t actually holding true – the success of an arm depends on what’s come before, and so the probability changes over time. I can only offer in defence that it seems to work pretty well in practice.

Anyway, the result of failing things for the first is that Hypothesis will automatically be more likely to pick values that pass your assume checks – it will pick regions of the parameter space that have a higher probability of doing so. It isn’t able to magically understand the shape of your assume and pick perfectly, but assuming there is any relation at all between the structure of the example that it understands and the thing you’re assuming it will tend to do better a uniform draw across all the values.

The second criteria essentially forces it to seek out novelty. It’s the analogue of the size parameter – when things start getting too similar to what it’s seen before it is drawn to pick a new parameter and add some more variety to the mix.

This approach is still quite new and will doubtless see more refinement, but I’m pretty happy with it so far. It’s also going to be very useful in the 0.5 release when I start adding coverage driven example discovery because it will allow rewarding branches that produce new and interesting examples as well as penalising ones that are actively bad.

At the moment it’s hard for me to tell how much better this is in practice. It’s certainly the case that the quality of the distributions is required to make some tests pass in Hypothesis itself, but that may be artificial enough that in real live usage it’s a lot less of an advantage. However experiments with testing Hypothesis itself suggests that with the right sorts of tests there’s a lot of mileage you can get out of covering obscure corners of the test space and that anything that can raise the probability of hitting those corners is worth doing. This is at the very least a very big step in the right direction.

 

This entry was posted in Hypothesis, Uncategorized on by .

Anatomy of a bug hunt

TLDR, this is another Hypothesis post. If you’re bored of those you can press back now.

I discovered a bug today that has been in Hypothesis from the beginning. I’m really really glad that I found it the way I did: Partly because I feel all smug and validated about my development practices as a result, but mostly because finding and figuring this one out in the wild would have been a nightmare, and it probably would have pissed off a ton of users who just wrote the library off as inherently flaky before I found out about it.

In Hypothesis you can define strategies for instances. This is how things like @given([int]) and so on work – there’s a rule that says how to build strategies for lists given strategies for their elements.

But in Python you have subtyping. There’s a similar strategy definition for tuples, but you might want more specific ones for your custom namedtuples (e.g. insisting on some relation between the fields).

So you need to be able to define instance strategy mappings for a subtype of something that already has a mapping defined and have it work. But you also want to have it inherit the supertype’s definition if a subtype one has not been defined (e.g. if you have a specific namedtuple you want it to inherit the generic namedtuple logic).

This is basically a method dispatch problem. You have to find the most specific type with a definition for a supertype of the current type (note: There’s potentially an ambiguity in “most specific” which is currently not well handled. I plan to look into this but it’s not high on my priority list).

The Hypothesis implementation of this is ridiculously inefficient, but it shouldn’t be used as a general purpose object system so that’s basically not a huge problem. I did however think that it was correct – I’d frequently said things along the lines of “Yes it’s ridiculous that Hypothesis has its own object system but honestly it’s been super useful and never given me any problems”.

Ah, hubris.

So yeah, there turns out to be a bug in this logic which can cause things to really confusingly wrong in surprising and hard to debug ways.

The bug is this:

Due to reasons that are honestly a bit silly and probably entirely unnecessary (I need to revisit this code properly at some point), the core of the algorithm Hypothesis uses for this requires doing a topological sort of the instance handlers with respect to the subset relation.

So the way I did this was to do a normal sort by the relation that x < y if x is a subtype of y, x > y if y is a subtype of x and otherwise sort by comparing names.

And this is so very much not how you do a topological sort. I don’t know what past David was thinking but apparently I’m just really bad at extending partial orders. This produces a non-transitive order, and as a result sorting it can basically do anything it wants.

What it wants in practice seems to be to almost work. Most of the time it produces the right results, but depending on the order you started with it can be almost arbitrarily wrong.

I spotted this because my builds are designed to fail at less than 100% branch coverage. The tests I had around this were flaky in the sense that sometimes they would randomly fail to produce enough coverage and I couldn’t figure out why. Investigating caused me to find this dodgy code and do a double take.

I then really struggled to find an example that actually demonstrated it was dodgy. So I just wrote some code to throw lots of examples at it and see what happened, and indeed got lots of super complicated examples demonstrating that it really was dodgy but not providing any great deal of insight.

Then I realised that I was literally sitting in the middle of a library that was about example generation and minimization, so I once more turned Hypothesis’s terrible gaze upon itself and it immediately started giving me small examples of 3 or 4 classes.

Which on trying to reproduce in a deterministic test or in ipython didn’t work.

And this is where the bug gets super funny and where I spent a good half an hour finding test cases then trying and failing to reproduce them and getting increasingly convinced that I’d found a new and exciting different bug in hypothesis before I finally figured out what was going on.

You see, as I mentioned above the behaviour depends on the starting order. In this case, the items we start with are the entries in a dict. This means that given a specific order of operations to build up the mapping, the starting order is entirely deterministic if basically arbitrary – it’s just the dict iteration order.

Entirely deterministic, that is, within any given process. As of Python 3, hash randomization is on by default. Each new python process will have a different hash secret and will thus have its own iteration order.

This means that that lovingly hand-crafted example of this will probably only produce the problem some fraction of the time, and that no matter how many times you run it in a given test run it will always produce the same answer.

Hypothesis on the other hand? No problem. It just throws examples at the problem until it finds one that works with the current hash iteration order.

This is really interesting to me because it’s an example where property based testing is not just the same as normal testing but you don’t have to explicitly provide the examples. In this case the system is actively fighting against your ability to produce examples but Hypothesis just shrugs, does its thing and gives you one anyway.

All told, despite the ridiculous terrible code that was the source of this, I think this is pretty much a success story for Hypothesis: The policy of failing the build if there’s not 100% coverage is working out really well because it makes me find areas of the codebase which are merely erratically covered and given them some serious attention, and the ability of Hypothesis to find and reduce interesting examples continues to impress.

If you’re interested, here is the commit that fixes this. Somewhat embarrassingly, it only works because of a typo, but it does still work rather well.

This entry was posted in Hypothesis, Uncategorized on by .

The things you forget to test

As mentioned recently, Hypothesis has 100% branch coverage.

There’s a line which I often attribute to being from one of the people who develops SQLite but now can’t find a reference for:

100% coverage tells you nothing, but less than 100% coverage tells you something.

(The implication being that the something it tells you is “Your code isn’t well tested enough”. Thanks Joke Explainer).

I’m somewhat on the fence as to whether 100% coverage is necessary. I think there are a lot of cases where getting to it is going to be more pain than it’s really worth. A self-contained stand-alone library like Hypothesis is very much not in that category – 100% coverage is both achievable and helpful there – but testing e.g. recommendation algorithms that do complex data processing on external services (err, hypothetically), is something where there are going to be edge cases which are a real nuisance to test properly and you’d basically have to mock out everything interesting in order to get to the code. For cases like that I suspect 100% coverage is still achievable but probably not really worth achieving.

But.

There are two areas that I inevitably find are the last bits to remain untested when I think a code base is well enough tested that 100% branch coverage forces me to deal with. Even once I’ve got 100% branch coverage, when I write a bunch of new code that I think is properly tested, these are usually the bits that I get an angry email from Travis telling me the build has failed because I’ve forgotten to test them. Further, I think you should be testing them even if 100% branch coverage is not a goal.

These are:

  1. Negative testing: Testing that if you do the wrong thing, it throws an exception or fails in some other way.
  2. Testing your object pretty printing.

Somehow these live in blind spots where I have to consciously remind myself that I should be testing them and will otherwise forget entirely.

Are they important to test though?

Well, let me ask you two more questions before I answer that one:

  1. How do you feel about silent data corruption?
  2. Do you like being able to debug errors in your program when it goes wrong?

I’m going to assume the answers are “not positively” and “Yes”. If instead you think that silent data corruption errors where you have no useful debugging information sounds like a super fun game then you can stop reading now.

The result of not properly validating your arguments and getting passed the wrong value will often be silent data corruption. Some of the most annoying things I’ve had to debug have been when something was passed a value that it couldn’t actually handle but just blithely assumed that it could and did the wrong thing rather than raising an error. This is a lot easier in dynamic languages of course (examples include when someone has passed a single element list of strings instead of a string, or when someone has passed a dict when there was supposed to be an ORM model), but it’s not hard to imagine cases that can happen in a statically typed language either (e.g. in Java, nullness checks, in real statically typed languages expecting a non-empty list, etc)

Now, in general, when coverage is complaining about your data validation not being covered that’s a sign that you actually are checking your arguments, but it can still be worth testing that – it makes sure that code doesn’t go away, it makes sure that the errors you get when that happens make sense (exceptions thrown during error handling are the most embarrassing exceptions), but even when coverage isn’t complaining about it it’s worth remembering that you should probably be testing the negative path as well as the positive one.

The pretty printing one I’m more ambivalent about. Whenever I write tests for my object representations I always feel embarrassingly pedantic.

And then my tests for my object representations find bugs in them and I get to bask in the smug glow of having done the right thing.

(I have a test that uses hypothesis to generate random descriptors, get the strategy for them and test that the repr for that strategy does not match a certain regexp. You would not believe how smug I was when that test found a bug that I would not have otherwise caught in testing).

Fun fact: Just after I wrote the above I thought “Oh, there’s an aspect of my reprs I haven’t tested. I should test that” and added another check to the recursive self test for the reprs (basically that it evaled to the something equal to the object). I found bugs.

So smug.

Uh, but anyway, am I still being pedantic? Does getting your repr right actually matter that much? Who cares if there are bugs in unimportant code?

Well, it matters in most of the programs I write. I write essentially three types of programs:

  1. Libraries and similar tooling
  2. Long running server programs
  3. Hacky scripts that are mostly for my personal consumption

The third they’re not a big deal in which is great because I’m not really writing tests for those anyway.

The first two they’re actually pretty important for.

For libraries, your object representation is part of your user interface. It’s hopefully uncontroversial (though probably isn’t) that your user interface should be tested. Moreover it’s easy to test your user interface here, so you really should.

For long running server processes, this is where my “Do you like being able to debug your programs?” question comes in. It’s really annoying when your error message says something like “Error calling method foo on #<Project 1234>”. It’s important to have good string representations of your objects because those are what’s going to show up in your logs.

So it’s important to have good string representations. But your string representation is code, and because it’s code it’s probably going to have bugs in it. You can find those bugs in one of two ways: You can find them when you test your code, or you can find them when they you’re looking through your server logs and you find that the thing you’re getting an error on is showing up as “Project(name=%s)” and that’s why you’re here shaving this yak.

So I think both of these things are things worth testing, and what’s nice is that they’re things that are easy to test. They mostly don’t suffer the problems that make some things hard to test – they don’t generally involve complicated set up or have a lot of external dependencies. You just create some objects or call some functions and inspect the output.

More generally, I still stand by the idea that 100% branch coverage isn’t always going to be worth it, but I think what is worth it is working on at least one non-trivial project with 100% branch coverage. It forces you to take a hard look at testing and think about what you can and can’t test, and what you can test but rarely do. This is a really valuable experience that I think will improve your development skills a lot.

This entry was posted in Hypothesis, Uncategorized on by .

The pain of randomized testing

Some people really don’t like randomized testing. At the large company I’ve just left this was definitely a common opinion, and it made sense that it would be – in a large code base any test that can fail randomly will fail randomly.

You’d think as the author of a randomized testing library I’d disagree with this sentiment.

If that’s the case, you probably haven’t been paying attention to what fraction of Hypothesis’s builds are red. About a quarter of those are spurious failures from randomized testing.

Now, testing Hypothesis itself is a very different beast than testing things with Hypothesis. The sort of random failures that are terrible are false positives. A test that fails because the test has a non-zero probability of failing even when correct, not because it has a non-1 chance of finding a real bug. Hypothesis as a general rule will give you a lot of possible false negatives (it can only cover so much of the search space) and generally will not give you false positives. Unfortunately this means that in testing Hypothesis itself this is inverted: Because it’s testing that it can find counter-examples, the false negatives in finding counter examples become false positive in Hypothesis’s test suite.

I’m doing my best all the time to try to make Hypothesis’s test suite less flaky while still keeping it as thorough as it currently is (ideally more thorough given that every time I figure out a new method for testing hypothesis I find new bugs). Sometimes this involves making the tests slightly less strict (sad face), sometimes it involves improving the search algorithm, sometimes it involves allowing the tests to run for longer to make sure they can find an example eventually. Sometimes this involves crying and clicking rerun build on travis and hoping that it passes this time.

But ultimately this is also a problem that’s going to affect you, the user. Sometimes hypothesis will fail because it can’t find enough examples – either because you have a hard to satisfy assumption in your tests or because it couldn’t find enough distinct examples. These currently cause errors (I plan to let you downgrade them to warnings so you can run in no false positives mode, but this has its own dangers – your entire CI could have accidentally become useless and just isn’t telling you)

So the false positives problem is annoying but mostly solvable. Unfortunately it’s not like false negatives are that much better. There are two major problems with false negatives (I mean, other than the fact that there’s a bug in your code that you don’t know about):

  1. They can cause the build to fail when it’s not your fault
  2. When a test breaks it should stay broken until you’ve fixed the problem. If you know there’s a test with a false negative there you’re basically forced to manually create a test reproducing its problem in order to start work on it. This is a nuisance.

The obvious solution here is to specify a fixed seed for your builds. This needs to be specified per test, but is otherwise a fairly viable solution. Indeed I’m planning to make this an option in the next point release.

But there’s one major problem with this solution: It solves the false negative problem by basically just going “la la la there is no false negative problem”. You solve it by ensuring that you will discover the counter-example that you’re missing.

It also doesn’t interact brilliantly with the problem of being in a bad space in the search space purely by chance – if your test is failing because it’s not finding enough examples then your test will stay failing. This then requires some manual intervention to tinker with the seed. None of this is terrible or insoluble, but it’s all a bit unsatisfying.

However you can combine two of the features I have planned for Hypothesis together to be a much better solution. You can run Hypothesis in what I think of as “fuzzer mode”.

Basically, if Hypothesis were billed as a fuzzer instead of a property based testing library the very idea of running it as part of your CI would seem ludicrous. That’s just not how you use a fuzzer! If you look at something like AFL the way you run it is basically you give it a program to try to break and then you walk away and every now and then it goes “oh, cool, I found something new” and saves a new interesting example. This isn’t something you run as part of your CI, this is something that you spend as much cycle time as you’re willing to give on and then add the output of to your CI. The examples it produces are the CI tests, not the fuzzer itself.

In testmachine (which is dead but will live again inside of Hypothesis) I solved this by having it generate code which you could copy and paste into your test suite. I don’t actually like this solution very much – it’s annoyingly manual and requires a lot of jumping through hoops to make it work properly. It’s one of those solutions that is more clever than good.

But the idea is right: When you find a failure you should get a test case.

So the plan is this: Give Hypothesis a “test discovery” mode. This is basically a long running process that builds up a database of test examples. Whenever it finds something new and interesting, it adds it to the database. (You could easily set up some automation around this to e.g. check in the database to your repository and automatically issue a pull request every time a new version is available)

You can then run your test suite just against that database. It looks at the examples in there and runs all matching examples through your test, then if none of them fail it declares it good. No concern about search space exhaustion, etc.

This solves myriad problems: Your CI is no longer random, with neither false positives nor false negatives, you get much better coverage because you can spend as many cycles as you like on finding new examples as you like, and as a bonus your tests become a lot faster because they’re no longer spending all that time on example generation and minimization. It’s a win/win situation.

It doesn’t solve my problem of course, but that’s because Hypothesis has to test a whole bunch of things that are intrinsically random. But I guess I’ll just have to suffer that so you don’t have to.

This entry was posted in Hypothesis, Uncategorized on by .