Notes on Hypothesis performance tuning

A common question I get asked is “How fast is Hypothesis?”

The answer I give then and the answer I’ll give now is “N times as slow as your code where N is the number of examples you run”.

It’s relatively simple maths: The time taken to run your tests is the the number of times Hypothesis runs your test * the amount of time it takes to run a single test + the overhead Hypothesis adds. The overhead is usually much smaller than the first part and thus ignorable (and when it’s not, it’s usually because you’ve found a performance bug in Hypothesis rather than an intrinsic limitation and I’d appreciate a report).

Unfortunately this answer is in the category of “true but unhelpful”, as a lot of people when they run Hypothesis for the first time find that N * the speed of their individual tests is really quite slow. So I thought I’d share some tips for fixing this.

The first ones are the easy ones: Make N smaller.

The Hypothesis settings mechanism allows you to easily configure the number of examples. You can either pass a Settings object to @given as the settings keyword argument, or you can set values on Settings.default.

The two settings of interest here are max_examples and timeout. max_examples causes Hypothesis to stop after it’s tried this many examples which satisfied the assumptions of your test. It defaults to 200, which is perfectly reasonable if your tests are testing a relatively small bit of pure code that doesn’t talk to any external systems, but if that’s not representative of what you’re testing you might want to turn it down a notch. In particular if you’re using the Django integration, which has to set up and roll back a test database on each run, you might want to cut this back as far as 50.

You can also set a timeout, which is a number of seconds after which Hypothesis will stop running (this is a soft rather than hard limit – it won’t interrupt you mid-test). In fact there already is one – it’s 60 seconds, which is designed to be well outside the range of normal operation but is useful for if something goes a bit awry.

A caveat if you’re using timeout: Hypothesis considers not having been able to try enough examples an error, because it tends to indicate a flaw in your testing that it’s better not to hide. If you’re setting the timeout really low compared to the length of your test runs and it’s resulting in only a couple of examples running, you may want to lower the min_satisfying_examples parameter. However the default is 5, which is already quite low, so if you feel the need to drop it below that you might want to try the following tips first, and if those don’t help you might want to consider that either this test might just be intrinsically slow and/or you should consider not using Hypothesis for it.

Finally, one thing to watch out for is that using assume() calls can increase N, because they are not counted towards max_examples. If you’re using assume, try to make assumptions as early in the test as possible so it aborts quickly. This is also more of a problem in the Django integration because there’s an intrinsic slowness to running the tests even if you don’t do anything. filter() uses the same mechanism as assume() under the hood, so will have the same problems.

Now, on to making your tests faster, but first an aside on debugging.

You can get a much better idea of what Hypothesis is doing by setting the verbosity level. You can do this programmatically:

from hypothesis import Settings, Verbosity
Settings.default.verbosity = Verbosity.verbose

Or you can do this by setting the HYPOTHESIS_VERBOSITY_LEVEL environment variable to verbose (see documentation for more details on verbosity). This will cause Hypothesis to print out each example it tries. I often find this is useful because you can see noticeable pauses if a particular example is causing your code to be slow. Better diagnostics would also be useful here I know. Sorry.

Anyway, faster tests.

Your tests might just be slow because your code is intrinsically slow. I can’t help you there, sorry. But they might also be slow because Hypothesis is giving you much bigger examples than you wanted.

Most of the strategies used for data generation come with knobs you can twiddle to change the shape of the examples you get. In particular of note is that strings and all the collections come with a max_size and average_size parameter you can use to control the size you get. The defaults are an average_size of Settings.default.average_list_length (the name is slightly historical), which by default is 25, and an unbounded max_size.

In general I wouldn’t recommend turning the collection size down for simple data, but if you’re e.g. doing lists of django models you might want to turn it down, and if you’re doing lists of text or lists of lists or similar you might want to turn down the size on the inner lists.

You might also want to investigate if the reason Hypothesis’s default sizes are a problem for you is unintentionally quadratic behaviour or similar. If turning the list size down helps a lot this might be a problem with your code.

Another thing you can try is to make your tests smaller: Quickcheck, the project on which Hypothesis is originally based, is designed for ultra small incredibly fast tests. Hypothesis is a bit more liberal in this regard, but it still works best with small fast tests and the larger integrationish tests might be better reserved for either not using Hypothesis, or running with a custom set of settings that only runs a much smaller number of examples than the rest of your tests (as per above).

Weirdly, another thing you can try is to make your tests bigger. It may be that you’re duplicating a lot of work between tests. For example if you find yourself with two tests that look like:

@given(foos())
def test1(foo):
   do_some_expensive_thing(foo)
   if a_thing_happened():
      check_some_stuff()
 
@given(foos())
def test2(foo):
   do_some_expensive_thing(foo)
   if not a_thing_happened():
      check_some_other_stuff()

Then you might want to consider bundling these together as

@given(foos())
def test1(foo):
   do_some_expensive_thing(foo)
   if a_thing_happened():
      check_some_stuff()
   else:
      check_some_other_stuff()

(A concrete example being the one I cover in my Hypothesis for Django talk where the end result is probably a bit more complicated than what you would normally naturally write in a test)

If none of the above works there’s one final thing you can try (you can even try this earlier in the process if you like): Asking for help. There’s a mailing list and an IRC channel. I’m usually pretty good about responding to requests for help on the mailing list, and I’m normally around on the IRC channel during sensible hours in UK time (although I’m going to be on EST and intermittently connected to the internet for the next few weeks). Also if I’m not around there are plenty of people who are who may also be able to help.

This entry was posted in Hypothesis, Python on by .