As some of you might be aware, I authored a python testing library called hypothesis.
It’s basically quickcheck for Python. I first wrote it within about a month of my learning python, so it has some… eccentricities (I still maintain that the fact that it has its own object model with multiple dispatch is totally legit, but I admit that the prototype based inheritance feature in it was probably misguided), but I flatter myself into thinking it’s not only pretty good but is by some small margin possibly the most advanced library of its ilk.
It’s also pretty neglected. I haven’t actually released a version of it in just over a year. Part of this is because I was pretty excited about testmachine and was considering it as the successor to hypothesis, but then I didn’t do any work on testmachine either. Basically, for a variety of reasons 2014 was a pretty rough year and I didn’t get much done on my open source work during it.
Despite this neglect, people seem to have started using hypothesis. I’ve got a bunch of issues opened, a few pull requests, and pypi stats are telling me it’s had more than 500 downloads in the last month.
And so, I appear to have failed to avoid success at all costs, and will instead embrace success and resume development on hypothesis.
So here’s the plan.
In the next few days there will be a point release of hypothesis. It will not be large – it will clear up a few bugs, maybe have one or two minor enhancements, improve python 3 support, and generally demonstrate that things are moving again.
In the next month or two there are a variety of larger features I want to work on. Some of them are pretty damn exciting, and if I get all of them working then I’m going to remove the weasel words from the above and simply say that flat out hypothesis will be the most advanced library of its kind.
In rough order of least to most exciting, I’m going to be working on:
Logging
This is super unsexy and to be honest I will probably not be inspired enough to work on it immediately, but I very much want it to get done as it’s both important and something people have actually asked for: Hypothesis needs to report information on its progress and what it’s tried. It’s important to know what sort of things hypothesis has tried on your code – e.g. if it’s only managed to generate a very small number of examples.
Some sort of top level driver
Hypothesis is right now fundamentally a library. If you want to actually use it to write tests you need to use it within pytest or similar.
Continued usage like this will 100% continue to be supported, encouraged and generally considered a first class citizen, but I would like there to be some sort of top level hypothesis program as well so you can get a more tailored view of what’s going on and have a better mechanism for controlling things like timeouts, etc.
Improved data generation
The data generation code is currently both ropy and not very intelligent. It has two discrete concepts – flags and size – which interact to control how things are generated. I want to introduce a more general and unifying notion of a parameter which gives you much more fine grained control over the shape of the distribution. This should also improve the coverage a lot, make this more easily user configurable, and it may even improve performance because currently data generation can be a bit of a bottleneck in some cases.
Merging the ideas from TestMachine
Testmachine is pretty awesome as a concept and I don’t want it to die. It turns out I have this popularish testing library that shares a lot in common with it. Lets merge the two.
One thing that I discovered with TestMachine is that making this sort of thing work well with mutable data is actually pretty hard, so this will probably necessitate some improved support around that. I suspect this will involve a bunch of fixes and improvements to the stateful testing feature.
Remembering failing test cases
One of the problems with randomized testing is that tests are inherently flaky – sometimes a passing test will become a failing test, sometimes vice versa without any changes to the underlying code.
A passing test becoming a failing test is generally fine in the sense that it means that the library has just found an exciting new bug for you to fix and you should be grateful.
A failing test becoming a passing test on the other hand is super annoying because it makes it much harder to reproduce and fix.
One way to do this is to have your library generate test cases that can be copied and pasted into your non-randomized test suite. This is the approach I took in testmachine and it’s a pretty good one.
Another approach that I’d like to explore instead is the idea of a test database which remembers failing test cases. Whenever a small example produces a failure, that example should be saved and tried first next time. Over time you build up a great database of examples to test your code with.
This also opens the possibility of giving hypothesis two possible run modes: One in which it just runs for an extended amount of time looking for bugs and the other in which it runs super quickly and basically only runs on previously discovered examples. I would be very interested in such an approach.
Support for glass-box testing
Randomized property based testing is intrinsically black box. It knows nothing about the code it’s testing except for how to feed it examples.
But what if it didn’t have to be?
American Fuzzy Lop is a fuzz tester, mostly designed for testing things that handle binary file formats (it works for things that are text formats too but the verbosity tends to work against it). It executes roughly the following algorithm:
- Take a seed example.
- Mutate it a bit
- Run the example through the program in question.
- If this produces bad behaviour, output it as a test case
- If this produces a new interesting state transition in the program, where a state transition is a pair of positions in the code with one immediately following the other in this execution, add it to the list of seed examples.
- Run ad infinitum, outputting bad examples as you go
This produces a remarkably good set of examples, and it’s 100% something that hypothesis could be doing. We can detect state transitions using coverage and generate new data until we stop getting new interesting examples. Then we can mutate existing data until we stop getting new interesting examples from that.
This sort of looking inside the box will allow one to have much greater confidence in hypothesis’s ability to find interesting failures. Currently it just executes examples until it has executed a fixed number or runs out of time, which is fine but may mean that it stops too early.
The existing behaviour will continue to remain as an option – initially switched on by default, but once this is robust enough eventually this will become the default option. The old behaviour will remain useful for cases where you want to e.g. test C code and thus can’t get reliable coverage information.
This would also work well with the test database idea, because you could prepopulate the test database with minimal interesting examples.
Probably some other stuff too
e.g. in the above I keep getting the nagging feeling that hypothesis needs a more general notion of settings to support some of these, so I will likely be doing something around that. There’s also some code clean up that really could use doing.
It’s currently unclear to me how long all of this is going to take and whether I will get all of it done. Chances are also pretty high that some of these will turn out to be bad ideas.
If any of you are using or are considering using hypothesis, do let me know if any of these seem particularly exciting and you’d like me to work on them. I’m also open to suggestions of other features you’d like to see included.
Pingback: On the new data generation in hypothesis | David R. MacIver
Pingback: The pain of randomized testing | David R. MacIver