Hypothesis 1.9.0 is out

Codename: The great bundling

This is my favourite type of release: One which means I stop having to look embarrassed in response to common questions.

Here’s the changelog entry:

Codename: The great bundling.

This release contains two fairly major changes.

The first is the deprecation of the hypothesis-extra mechanism. From now on all the packages that were previously bundled under it other than hypothesis-pytest (which is a different beast and will remain separate). The functionality remains unchanged and you can still import them from exactly the same location, they just are no longer separate packages.

The second is that this introduces a new way of building strategies which lets you build up strategies recursively from other strategies.

It also contains the minor change that calling .example() on a strategy object will give you examples that are more representative of the actual data you’ll get. There used to be some logic in there to make the examples artificially simple but this proved to be a bad idea.

“How do I do recursive data?” has always been a question which I’ve had to look embarrassed about whenever anyone asked me. Because Hypothesis has a, uh, slightly unique attitude to data generation I’ve not been able to use the standard techniques that people use to make this work in Quickcheck, so this was a weak point where Hypothesis was simply worse than the alternatives.

I got away with it because Python is terrible at recursion anyway so people mostly don’t use recursive data. But it was still a bit of an embarrassment.

Part of the problem here is that you could do recursive data well enough using the internal API (not great. It’s definitely a bit of a weak point even there), but the internal API is not part of the public API and is decidedly harder to work with than the main public API.

The solution I ended up settling on is to provide a new function that lets you build up recursive data by specifying a base case and an expansion function and then getting what is just a fixed point combinator over strategies. In the traditional Hypothesis manner it uses a bizarre mix of side effects and exceptions internally to expose a lovely clean functional API which doesn’t let you see any of that.

The other embarrassing aspect of Hypothesis is how all the extra packages work. There are all these additional packages which have to be upgraded in lockstep with Hypothesis and behave like second class citizens – e.g. there have never been good changelogs and release announcements for them. It’s a common problem that people fail to upgrade one and get confusing error messages when things try to load and don’t work.

This release merges these all into Hypothesis core and leaves installing their dependencies up to you. You can install those dependencies using a setuptools extra, so e.g. installing hypothesis[django] will install Hypothesis + compatible versions of all the dependencies – but there’s no checking of versions etc. when you use them. I may add that checking if it turns out to be a major problem that people try to use these with the wrong version of dependencies, but I also may not. We’ll see.

This entry was posted in Hypothesis, Python on by .

A terrible/genius idea

In a conversation with Mark Jason Dominus and Pozorvlak on Twitter about GHC compile error messages I realised there was a common pattern of problem:

Compile errors are confusing, but they are confusing in predictable ways. The same error message that is completely unintuitive to a human will tend to be emitted in the same pattern of error.

Writing good error messages is hard, but this is the sort of thing which can be debugged better by a computer than a human. A simple bayesian text classifier can probably map these error messages extremely well to a diagnostic suggestion, and sometimes all you need is those suggestions to put you on the right path.

Moreover we can crowdsource gathering the data.

Here is a plan. A simple wrapper script which you can alias to any program you like and it will execute that program, passing all the arguments through.

If it ever gets a non-zero exit code it looks at the output of stderr and attempts to run it through a classifier. It then says “Hey, I think this could be one of these problems”.

If soon after it sees you run the program again with exactly the same arguments and it now gets a success, it says “Great, you fixed it! Was it one of the errors we suggested? If not, could you provide a short diagnostic?” and submits your answer back to a central service. The service then regularly builds machine learning models (one per base command) which it ships back to you on demand / semi-regularly.

You need some additional functionality to prevent well-poisoning and similar, but I think the basic concept wouldn’t be too hard to get something up and running with.

I’m not going to do it though – I’ve more than enough to do, and this doesn’t actually help with any problems I currently have regularly. If anyone wants to take the idea and run with it, please do so with my blessing.

This entry was posted in Uncategorized on by .

Two new Hypothesis releases and continuous deployment of libraries

I forgot to mention because it was so small, but I shipped a 1.8.1 Hypothesis release yesterday. It’s a workaround for Python 2 being terrible at unicode (Please upgrade to Python 3) and a common way of falling victim to that. Today I’ve shipped 1.8.2. It lt was also going to be a single bug fix release, but the travis build didn’t finish before I got on the plane, so I wrote another one mid-atlantic. So it’s a two bug-fix release:

  • When using ForkingTestCase you would usually not get the falsifying example printed if the subprocess exited abnormally (e.g. due to os._exit).
  • Improvements to the distribution of characters when using text() with a default alphabet. In particular produces a better distribution of ascii and whitespace in the alphabet.

If either of those things sound like a big deal to you, feel free to upgrade.

One thing that’s interesting is that the fact that I do this is really weird, and I don’t understand why. This sort of “I fixed a bug. Lets ship a patch release!” behaviour just doesn’t seem to happen, and I don’t understand why. I’ve often found myself incredibly frustrated because a library I was using hasn’t shipped a new version for months and there’s a one line patch that fixes my problem sitting in master.

If you’re using semantic versioning or similar, shipping a patch release is cheap. It doesn’t break anything (and if you accidentally introduced a new bug people can just not upgrade), it makes peoples’ lives better, why not do it? As long as you’ve got a reasonably good test suite, all the usual arguments for continuous deployment apply – you’re less likely to break things when shipping small changes, and if something does goes wrong it’s much easier to figure out what. Moreover your users will love you to bits – people are so happy when I go “Oh yeah, that’s a bug. Here’s a new version that fixes it”.

Usual arguments about free labour of course apply. You shouldn’t feel obligated to do this, especially if you can’t free up the time or brain for working on something right now, but if you can do this I think you’ll find it will make your life easier in the long run and users will love you for it.

This entry was posted in Hypothesis, Python on by .

Notes on Hypothesis performance tuning

A common question I get asked is “How fast is Hypothesis?”

The answer I give then and the answer I’ll give now is “N times as slow as your code where N is the number of examples you run”.

It’s relatively simple maths: The time taken to run your tests is the the number of times Hypothesis runs your test * the amount of time it takes to run a single test + the overhead Hypothesis adds. The overhead is usually much smaller than the first part and thus ignorable (and when it’s not, it’s usually because you’ve found a performance bug in Hypothesis rather than an intrinsic limitation and I’d appreciate a report).

Unfortunately this answer is in the category of “true but unhelpful”, as a lot of people when they run Hypothesis for the first time find that N * the speed of their individual tests is really quite slow. So I thought I’d share some tips for fixing this.

The first ones are the easy ones: Make N smaller.

The Hypothesis settings mechanism allows you to easily configure the number of examples. You can either pass a Settings object to @given as the settings keyword argument, or you can set values on Settings.default.

The two settings of interest here are max_examples and timeout. max_examples causes Hypothesis to stop after it’s tried this many examples which satisfied the assumptions of your test. It defaults to 200, which is perfectly reasonable if your tests are testing a relatively small bit of pure code that doesn’t talk to any external systems, but if that’s not representative of what you’re testing you might want to turn it down a notch. In particular if you’re using the Django integration, which has to set up and roll back a test database on each run, you might want to cut this back as far as 50.

You can also set a timeout, which is a number of seconds after which Hypothesis will stop running (this is a soft rather than hard limit – it won’t interrupt you mid-test). In fact there already is one – it’s 60 seconds, which is designed to be well outside the range of normal operation but is useful for if something goes a bit awry.

A caveat if you’re using timeout: Hypothesis considers not having been able to try enough examples an error, because it tends to indicate a flaw in your testing that it’s better not to hide. If you’re setting the timeout really low compared to the length of your test runs and it’s resulting in only a couple of examples running, you may want to lower the min_satisfying_examples parameter. However the default is 5, which is already quite low, so if you feel the need to drop it below that you might want to try the following tips first, and if those don’t help you might want to consider that either this test might just be intrinsically slow and/or you should consider not using Hypothesis for it.

Finally, one thing to watch out for is that using assume() calls can increase N, because they are not counted towards max_examples. If you’re using assume, try to make assumptions as early in the test as possible so it aborts quickly. You should also try to  Again this is also more of a problem in the Django integration because there’s an intrinsic slowness to running the tests even if you don’t do anything. filter() uses the same mechanism as assume() under the hood (I’ve got some work on changing that but it’s proven slightly problematic)

Now, on to making your tests faster, but first an aside on debugging.

You can get a much better idea of what Hypothesis is doing by setting the verbosity level. You can do this programmatically:

from hypothesis import Settings, Verbosity
Settings.default.verbosity = Verbosity.verbose

Or you can do this by setting the HYPOTHESIS_VERBOSITY_LEVEL environment variable to verbose (see documentation for more details on verbosity). This will cause Hypothesis to print out each example it tries. I often find this is useful because you can see noticeable pauses if a particular example is causing your code to be slow. Better diagnostics would also be useful here I know. Sorry.

Anyway, faster tests.

Your tests might just be slow because your code is intrinsically slow. I can’t help you there, sorry. But they might also be slow because Hypothesis is giving you much bigger examples than you wanted.

Most of the strategies used for data generation come with knobs you can twiddle to change the shape of the examples you get. In particular of note is that strings and all the collections come with a max_size and average_size parameter you can use to control the size you get. The defaults are an average_size of Settings.default.average_list_length (the name is slightly historical), which by default is 25, and an unbounded max_size.

In general I wouldn’t recommend turning the collection size down for simple data, but if you’re e.g. doing lists of django models you might want to turn it down, and if you’re doing lists of text or lists of lists or similar you might want to turn down the size on the inner lists.

You might also want to investigate if the reason Hypothesis’s default sizes are a problem for you is unintentionally quadratic behaviour or similar. If turning the list size down helps a lot this might be a problem with your code.

Another thing you can try is to make your tests smaller: Quickcheck, the project on which Hypothesis is originally based, is designed for ultra small incredibly fast tests. Hypothesis is a bit more liberal in this regard, but it still works best with small fast tests and the larger integrationish tests might be better reserved for either not using Hypothesis, or running with a custom set of settings that only runs a much smaller number of examples than the rest of your tests (as per above).

Weirdly, another thing you can try is to make your tests bigger. It may be that you’re duplicating a lot of work between tests. For example if you find yourself with two tests that look like:

@given(foos())
def test1(foo):
   do_some_expensive_thing(foo)
   if a_thing_happened():
      check_some_stuff()
 
@given(foos())
def test2(foo):
   do_some_expensive_thing(foo)
   if not a_thing_happened():
      check_some_other_stuff()

Then you might want to consider bundling these together as

@given(foos())
def test1(foo):
   do_some_expensive_thing(foo)
   if a_thing_happened():
      check_some_stuff()
   else:
      check_some_other_stuff()

(A concrete example being the one I cover in my Hypothesis for Django talk where the end result is probably a bit more complicated than what you would normally naturally write in a test)

If none of the above works there’s one final thing you can try (you can even try this earlier in the process if you like): Asking for help. There’s a mailing list and an IRC channel. I’m usually pretty good about responding to requests for help on the mailing list, and I’m normally around on the IRC channel during sensible hours in UK time (although I’m going to be on EST and intermittently connected to the internet for the next few weeks). Also if I’m not around there are plenty of people who are who may also be able to help.

This entry was posted in Hypothesis, Python on by .

Hypothesis 1.8.0 is out

This release is mostly focused on internal refactoring, but has some nice polishing and a few bug fixes.

New features:

  • Much more sensible reprs for strategies, especially ones that come from hypothesis.strategies. These should now have as reprs python code that would produce the same strategy.
  • lists() accepts a unique_by argument which forces the generated lists to be only contain elements unique according to some function key (which must return a hashable value).
  • Better error messages from flaky tests to help you debug things.

Mostly invisible implementation details that may result in finding new bugs in your code:

  • Sets and dictionary generation should now produce a better range of results.
  • floats with bounds now focus more on ‘critical values’, trying to produce values at edge cases.
  • flatmap should now have better simplification for complicated cases, as well as generally being (I hope) more reliable.

Bug fixes:

  • You could not previously use assume() if you were using the forking executor (because the relevant exception wasn’t pickleable).
This entry was posted in Hypothesis, programming, Python on by .