The pain of randomized testing

Some people really don’t like randomized testing. At the large company I’ve just left this was definitely a common opinion, and it made sense that it would be – in a large code base any test that can fail randomly will fail randomly.

You’d think as the author of a randomized testing library I’d disagree with this sentiment.

If that’s the case, you probably haven’t been paying attention to what fraction of Hypothesis’s builds are red. About a quarter of those are spurious failures from randomized testing.

Now, testing Hypothesis itself is a very different beast than testing things with Hypothesis. The sort of random failures that are terrible are false positives. A test that fails because the test has a non-zero probability of failing even when correct, not because it has a non-1 chance of finding a real bug. Hypothesis as a general rule will give you a lot of possible false negatives (it can only cover so much of the search space) and generally will not give you false positives. Unfortunately this means that in testing Hypothesis itself this is inverted: Because it’s testing that it can find counter-examples, the false negatives in finding counter examples become false positive in Hypothesis’s test suite.

I’m doing my best all the time to try to make Hypothesis’s test suite less flaky while still keeping it as thorough as it currently is (ideally more thorough given that every time I figure out a new method for testing hypothesis I find new bugs). Sometimes this involves making the tests slightly less strict (sad face), sometimes it involves improving the search algorithm, sometimes it involves allowing the tests to run for longer to make sure they can find an example eventually. Sometimes this involves crying and clicking rerun build on travis and hoping that it passes this time.

But ultimately this is also a problem that’s going to affect you, the user. Sometimes hypothesis will fail because it can’t find enough examples – either because you have a hard to satisfy assumption in your tests or because it couldn’t find enough distinct examples. These currently cause errors (I plan to let you downgrade them to warnings so you can run in no false positives mode, but this has its own dangers – your entire CI could have accidentally become useless and just isn’t telling you)

So the false positives problem is annoying but mostly solvable. Unfortunately it’s not like false negatives are that much better. There are two major problems with false negatives (I mean, other than the fact that there’s a bug in your code that you don’t know about):

  1. They can cause the build to fail when it’s not your fault
  2. When a test breaks it should stay broken until you’ve fixed the problem. If you know there’s a test with a false negative there you’re basically forced to manually create a test reproducing its problem in order to start work on it. This is a nuisance.

The obvious solution here is to specify a fixed seed for your builds. This needs to be specified per test, but is otherwise a fairly viable solution. Indeed I’m planning to make this an option in the next point release.

But there’s one major problem with this solution: It solves the false negative problem by basically just going “la la la there is no false negative problem”. You solve it by ensuring that you will discover the counter-example that you’re missing.

It also doesn’t interact brilliantly with the problem of being in a bad space in the search space purely by chance – if your test is failing because it’s not finding enough examples then your test will stay failing. This then requires some manual intervention to tinker with the seed. None of this is terrible or insoluble, but it’s all a bit unsatisfying.

However you can combine two of the features I have planned for Hypothesis together to be a much better solution. You can run Hypothesis in what I think of as “fuzzer mode”.

Basically, if Hypothesis were billed as a fuzzer instead of a property based testing library the very idea of running it as part of your CI would seem ludicrous. That’s just not how you use a fuzzer! If you look at something like AFL the way you run it is basically you give it a program to try to break and then you walk away and every now and then it goes “oh, cool, I found something new” and saves a new interesting example. This isn’t something you run as part of your CI, this is something that you spend as much cycle time as you’re willing to give on and then add the output of to your CI. The examples it produces are the CI tests, not the fuzzer itself.

In testmachine (which is dead but will live again inside of Hypothesis) I solved this by having it generate code which you could copy and paste into your test suite. I don’t actually like this solution very much – it’s annoyingly manual and requires a lot of jumping through hoops to make it work properly. It’s one of those solutions that is more clever than good.

But the idea is right: When you find a failure you should get a test case.

So the plan is this: Give Hypothesis a “test discovery” mode. This is basically a long running process that builds up a database of test examples. Whenever it finds something new and interesting, it adds it to the database. (You could easily set up some automation around this to e.g. check in the database to your repository and automatically issue a pull request every time a new version is available)

You can then run your test suite just against that database. It looks at the examples in there and runs all matching examples through your test, then if none of them fail it declares it good. No concern about search space exhaustion, etc.

This solves myriad problems: Your CI is no longer random, with neither false positives nor false negatives, you get much better coverage because you can spend as many cycles as you like on finding new examples as you like, and as a bonus your tests become a lot faster because they’re no longer spending all that time on example generation and minimization. It’s a win/win situation.

It doesn’t solve my problem of course, but that’s because Hypothesis has to test a whole bunch of things that are intrinsically random. But I guess I’ll just have to suffer that so you don’t have to.

This entry was posted in Hypothesis, Uncategorized on by .

One thought on “The pain of randomized testing

  1. Pingback: A plan for Hypothesis 1.0 | David R. MacIver

Comments are closed.