I just want to brag a bit about the Hypothesis test suite

There’s not much content to this post other than “look at how awesome my test suite is”. But it’s pretty awesome. It:

  • Has 100% branch coverage
  • Fails the build if it ever doesn’t have 100% branch coverage
  • Has about a 3:2 ratio of lines of test code to library code
  • Also runs the tests against a version installed from a package. This both tests the installer and also tests without coverage in case coverage changes the semantics (which it probably won’t now but definitely will in future)
  • Runs across the three major python versions the library supports (2.7, 3.3, 3.4)
  • Has statistical tests about the properties of the distributions it produces
  • Runs a linter over the code to make sure that it’s pyflakes and pep8 compliant
  • Runs a linter over the README to make sure that it parses correctly as RST
  • Has a self test where it uses hypothesis to test hypothesis. This found real bugs in the library, despite the aforementioned coverage and code ratio.

This is probably the best tested code I’ve ever written, and I’m still morally certain it has at least one major bug in it.

I don’t know what that bug is of course, or I’d have fixed it, but every single time I’ve gone “Wow this code is really well tested. I bet it’s super solid” it has been followed by “Hmm. I could do this thing to prove it’s super solid. Oh look there’s a bug” .

OK, so this started out as me bragging about how great my test suite was but now I’m super depressed about the state of software all over again. I think of myself as being atypically good at producing correct code – not in some mythical 10x sense where I’m vastly better than everyone else, but I feel pretty comfortable in saying I’m better than average. This is probably the most well tested code I’ve ever written, and it’s still buggy. Where does that leave the majority of code found in the wild?

This entry was posted in How do I computer?, Hypothesis on by .

Hypothesis now has a mailing list

As part of my ongoing attempts to figure out who is using Hypothesis and how, I thought it would be worth having a place for people discuss that, so there is now a Google Group.

If you use Hypothesis or are interested in using Hypothesis, do sign up! I’d love to hear from you.

This entry was posted in Hypothesis, Uncategorized on by .

Testing the installer for your python library

I’m quite proud of of Hypothesis’s test suite. I make zero claims of it being perfect, and it could obviously be better (though it does have 100% branch coverage, which is nice), but it’s good enough that it gives me a lot of confidence to make changes and believe it will be detected if they break something.

So, uh, it was pretty embarrassing that I shipped a version of hypothesis that couldn’t actually run.

I’d manually tested the installer for the 0.2.0 release but when pushing the 0.2.1 release I made a bit of a screw up and included some changes I hadn’t meant to and these broke the installer because they added some new packages without adding them to the setup.py. I forgot to manually test the installer because I thought the change was trivial and voila, one broken package on pypi.

This is of course the general problem with manual testing: You need to remember to do it.

So on the general philosophy that bugs are things that you should respond to by writing automated tests to ensure they don’t happen again, I’ve now got a test for Hypothesis’s installer.

It’s not especially sophisticated. All it does is create a clean virtualenv, do an sdist, install the resulting package in the virtualenv and then run some code to make  sure the library is working correctly. You can see it here (or here for the latest version). This is then run on travis, so any build without a working installer is a failed build and will not be shipped.

It’s entirely possible that this is actually something everyone does already and I just didn’t know about it (some Googling suggests not), but if it’s not then I highly recommend you do it for any python libraries you maintain. It was super easy to do and saves both time and potentially embarrassing mistakes.

This entry was posted in Hypothesis, Uncategorized on by .

On the new data generation in hypothesis

I mentioned in the Hypothesis short term road map that one of the things I wanted to do was improve the data generation.

What’s wrong with the data generation as it currently stands?

Well, in classic quickcheck style testing basically what you have is a size parameter that controls the shape of the distribution. A large value for the size parameter results in a more variable distribution – larger elements, longer lists, etc.

How, given this, would you generate a list of more than 100 floating point numbers in which every element had the same sign?

Assuming your floating point generator has probability p of producing a non-negative number with 0 < p < 1. The probability of a list of at least 100 elements getting all the same sign is max(p^100, (1-p)^100). Or, as I like to call it, as close to zero as makes no difference.

Hypothesis has historically solved this by introducing into the mix a notion of flags, which are boolean variables that can be turned on and off for a whole data generation run. For example, whether negative floating point numbers are allowed is a flag. This means that with a 50/50 chance your list will have all positive numbers. This idea was inspired by John Regehr’s post “Better Random Testing By Leaving Features Out

But it’s not quite enough. What for example if you want to generate a long list with a low random average value? Or a high one? Depending on how you use the size parameter it will be hard to do one of these (in Hypothesis it was the latter which is hard – the size was split amongst the elements and so the elements of a long list would tend to be much smaller).

It’s also very non-orthogonal. There are these two different notions of parameter which are similar but not really and don’t share any implementation.

So what I wanted to do (and now have done! It proved surprisingly easy) is to introduce a common notion of parameter which unified these two.

A parameter is just any sort of distribution you can draw from. It can contain any sort of values – booleans, floats, integers,  composite values, etc. Each strategy has a parameter you can draw from, and data generation then occurs by first drawing a parameter value and then drawing data conditional on that parameter value.

A list then has a composite parameter: One parameter is its average length, and lengths are then drawn from a geometric distribution with that average, and the other parameter is whatever parameter its elements take.

The reason this is importantly different from just drawing from the marginal distribution is that all of the draws then use the same parameter value. So for the floating parameters we have one parameter which says the sign of the value (it can be always positive, always negative or either) and parameters that control the shape of it given that sign. This means that when generating a list, the chances of all elements being high or all elements being low is much better (it does mean that the chances of having a mix are lower, but you can fix that with a more complicated parameter shape).

The overall effect is that this allows you to generate much  more interesting large structured examples. With the classic approach you end up with a sort of curse of dimensionality where everything looks pretty flat because of the differences averaging out.

Another nice perk of this is that you can use the parameters to guide rejection better. A test in hypothesis can essentially fail for two reasons: One is that it’s a legitimate failure and the other is that a precondition has not been met. Generally speaking when the latter happens you don’t really want to generate a whole bunch of data like that.

To continue with our sign example, suppose I want to generate non-negative lists of numbers. I could do this by defining a special strategy for non-negative floats, but that’s an annoyingly large amount of work (I’d like to make it less work and have some ideas around that, but there will always be cases like this). It would be nice if I could do things like assume(all(x >= 0 for x in xs)) and have any chance of succeeding for large lists.

Well with the new system it will succeed for about half of parameter values. So the logic becomes: Generate a parameter, try some number of draws of data from this parameter. Eventually draw a new parameter, but if we have a high rejection rate draw a new parameter much sooner.

This has all landed in master as of yesterday evening, and I’m pretty pleased with it so far. I’m going to do a bit more work on hypothesis over the next few days, but expect to see it in a released version near you quite soon.

This entry was posted in Hypothesis, Uncategorized on by .

Hypothesis short term road map

As some of you might be aware, I authored a python testing library called hypothesis.

It’s basically quickcheck for Python. I first wrote it within about a month of my learning python, so it has some… eccentricities (I still maintain that the fact that it has its own object model with multiple dispatch is totally legit, but I admit that the prototype based inheritance feature in it was probably misguided), but I flatter myself into thinking it’s not only pretty good but is by some small margin possibly the most advanced library of its ilk.

It’s also pretty neglected. I haven’t actually released a version of it in just over a year. Part of this is because I was pretty excited about testmachine and was considering it as the successor to hypothesis, but then I didn’t do any work on testmachine either. Basically, for a variety of reasons 2014 was a pretty rough year and I didn’t get much done on my open source work during it.

Despite this neglect, people seem to have started using hypothesis. I’ve got a bunch of issues opened, a few pull requests, and pypi stats are telling me it’s had more than 500 downloads in the last month.

And so, I appear to have failed to avoid success at all costs, and will instead embrace success and resume development on hypothesis.

So here’s the plan.

In the next few days there will be a point release of hypothesis. It will not be large – it will clear up a few bugs, maybe have one or two minor enhancements, improve python 3 support, and generally demonstrate that things are moving again.

In the next month or two there are a variety of larger features I want to work on. Some of them are pretty damn exciting, and if I get all of them working then I’m going to remove the weasel words from the above and simply say that flat out hypothesis will be the most advanced library of its kind.

In rough order of least to most exciting, I’m going to be working on:

Logging

This is super unsexy and to be honest I will probably not be inspired enough to work on it immediately, but I very much want it to get done as it’s both important and something people have actually asked for: Hypothesis needs to report information on its progress and what it’s tried. It’s important to know what sort of things hypothesis has tried on your code – e.g. if it’s only managed to generate a very small number of examples.

Some sort of top level driver

Hypothesis is right now fundamentally a library. If you want to actually use it to write tests you need to use it within pytest or similar.

Continued usage like this will 100% continue to be supported, encouraged and generally considered a first class citizen, but I would like there to be some sort of top level hypothesis program as well so you can get a more tailored view of what’s going on and have a better mechanism for controlling things like timeouts, etc.

Improved data generation

The data generation code is currently both ropy and not very intelligent. It has two discrete concepts – flags and size – which interact to control how things are generated. I want to introduce a more general and unifying notion of a parameter which gives you much more fine grained control over the shape of the distribution. This should also improve the coverage a lot, make this more easily user configurable, and it may even improve performance because currently data generation can be a bit of a bottleneck in some cases.

Merging the ideas from TestMachine

Testmachine is pretty awesome as a concept and I don’t want it to die. It turns out I have this popularish testing library that shares a lot in common with it. Lets merge the two.

One thing that I discovered with TestMachine is that making this sort of thing work well with mutable data is actually pretty hard, so this will probably necessitate some improved support around that. I suspect this will involve a bunch of fixes and improvements to the stateful testing feature.

Remembering failing test cases

One of the problems with randomized testing is that tests are inherently flaky – sometimes a passing test will become a failing test, sometimes vice versa without any changes to the underlying code.

A passing test becoming a failing test is generally fine in the sense that it means that the library has just found an exciting new bug for you to fix and you should be grateful.

A failing test becoming a passing test on the other hand is super annoying because it makes it much harder to reproduce and fix.

One way to do this is to have your library generate test cases that can be copied and pasted into your non-randomized test suite. This is the approach I took in testmachine and it’s a pretty good one.

Another approach that I’d like to explore instead is the idea of a test database which remembers failing test cases. Whenever a small example produces a failure, that example should be saved and tried first next time. Over time you build up a great database of examples to test your code with.

This also opens the possibility of giving hypothesis two possible run modes: One in which it just runs for an extended amount of time looking for bugs and the other in which it runs super quickly and basically only runs on previously discovered examples. I would be very interested in such an approach.

Support for glass-box testing

Randomized property based testing is intrinsically black box. It knows nothing about the code it’s testing except for how to feed it examples.

But what if it didn’t have to be?

American Fuzzy Lop is a fuzz tester, mostly designed for testing things that handle binary file formats (it works for things that are text formats too but the verbosity tends to work against it). It executes roughly the following algorithm:

    1. Take a seed example.
    2. Mutate it a bit
    3. Run the example through the program in question.
      1. If this produces bad behaviour, output it as a test case
      2. If this produces a new interesting state transition in the program, where a state transition is a pair of positions in the code with one immediately following the other in this execution, add it to the list of seed examples.
    4. Run ad infinitum, outputting bad examples as you go

This produces a remarkably good set of examples, and it’s 100% something that hypothesis could be doing. We can detect state transitions using coverage and generate new data until we stop getting new interesting examples. Then we can mutate existing data until we stop getting new interesting examples from that.

This sort of looking inside the box will allow one to have much greater confidence in hypothesis’s ability to find interesting failures. Currently it just executes examples until it has executed a fixed number or runs out of time, which is fine but may mean that it stops too early.

The existing behaviour will continue to remain as an option – initially switched on by default, but once this is robust enough eventually this will become the default option. The old behaviour will remain useful for cases where you want to e.g. test C code and thus can’t get reliable coverage information.

This would also work well with the test database idea, because you could prepopulate the test database with minimal interesting examples.

Probably some other stuff too

e.g. in the above I keep getting the nagging feeling that hypothesis needs a more general notion of settings to support some of these, so I will likely be doing something around that. There’s also some code clean up that really could use doing.

It’s currently unclear to me how long all of this is going to take and whether I will get all of it done. Chances are also pretty high that some of these will turn out to be bad ideas.

If any  of you are using or are considering using hypothesis, do let me know if any of these seem particularly exciting and you’d like me to work on them. I’m also open to suggestions of other features you’d like to see included.

 

 

 

This entry was posted in Hypothesis, Uncategorized on by .