Category Archives: Hypothesis

Testing the installer for your python library

I’m quite proud of of Hypothesis’s test suite. I make zero claims of it being perfect, and it could obviously be better (though it does have 100% branch coverage, which is nice), but it’s good enough that it gives me a lot of confidence to make changes and believe it will be detected if they break something.

So, uh, it was pretty embarrassing that I shipped a version of hypothesis that couldn’t actually run.

I’d manually tested the installer for the 0.2.0 release but when pushing the 0.2.1 release I made a bit of a screw up and included some changes I hadn’t meant to and these broke the installer because they added some new packages without adding them to the setup.py. I forgot to manually test the installer because I thought the change was trivial and voila, one broken package on pypi.

This is of course the general problem with manual testing: You need to remember to do it.

So on the general philosophy that bugs are things that you should respond to by writing automated tests to ensure they don’t happen again, I’ve now got a test for Hypothesis’s installer.

It’s not especially sophisticated. All it does is create a clean virtualenv, do an sdist, install the resulting package in the virtualenv and then run some code to make  sure the library is working correctly. You can see it here (or here for the latest version). This is then run on travis, so any build without a working installer is a failed build and will not be shipped.

It’s entirely possible that this is actually something everyone does already and I just didn’t know about it (some Googling suggests not), but if it’s not then I highly recommend you do it for any python libraries you maintain. It was super easy to do and saves both time and potentially embarrassing mistakes.

This entry was posted in Hypothesis, Uncategorized on by .

On the new data generation in hypothesis

I mentioned in the Hypothesis short term road map that one of the things I wanted to do was improve the data generation.

What’s wrong with the data generation as it currently stands?

Well, in classic quickcheck style testing basically what you have is a size parameter that controls the shape of the distribution. A large value for the size parameter results in a more variable distribution – larger elements, longer lists, etc.

How, given this, would you generate a list of more than 100 floating point numbers in which every element had the same sign?

Assuming your floating point generator has probability p of producing a non-negative number with 0 < p < 1. The probability of a list of at least 100 elements getting all the same sign is max(p^100, (1-p)^100). Or, as I like to call it, as close to zero as makes no difference.

Hypothesis has historically solved this by introducing into the mix a notion of flags, which are boolean variables that can be turned on and off for a whole data generation run. For example, whether negative floating point numbers are allowed is a flag. This means that with a 50/50 chance your list will have all positive numbers. This idea was inspired by John Regehr’s post “Better Random Testing By Leaving Features Out

But it’s not quite enough. What for example if you want to generate a long list with a low random average value? Or a high one? Depending on how you use the size parameter it will be hard to do one of these (in Hypothesis it was the latter which is hard – the size was split amongst the elements and so the elements of a long list would tend to be much smaller).

It’s also very non-orthogonal. There are these two different notions of parameter which are similar but not really and don’t share any implementation.

So what I wanted to do (and now have done! It proved surprisingly easy) is to introduce a common notion of parameter which unified these two.

A parameter is just any sort of distribution you can draw from. It can contain any sort of values – booleans, floats, integers,  composite values, etc. Each strategy has a parameter you can draw from, and data generation then occurs by first drawing a parameter value and then drawing data conditional on that parameter value.

A list then has a composite parameter: One parameter is its average length, and lengths are then drawn from a geometric distribution with that average, and the other parameter is whatever parameter its elements take.

The reason this is importantly different from just drawing from the marginal distribution is that all of the draws then use the same parameter value. So for the floating parameters we have one parameter which says the sign of the value (it can be always positive, always negative or either) and parameters that control the shape of it given that sign. This means that when generating a list, the chances of all elements being high or all elements being low is much better (it does mean that the chances of having a mix are lower, but you can fix that with a more complicated parameter shape).

The overall effect is that this allows you to generate much  more interesting large structured examples. With the classic approach you end up with a sort of curse of dimensionality where everything looks pretty flat because of the differences averaging out.

Another nice perk of this is that you can use the parameters to guide rejection better. A test in hypothesis can essentially fail for two reasons: One is that it’s a legitimate failure and the other is that a precondition has not been met. Generally speaking when the latter happens you don’t really want to generate a whole bunch of data like that.

To continue with our sign example, suppose I want to generate non-negative lists of numbers. I could do this by defining a special strategy for non-negative floats, but that’s an annoyingly large amount of work (I’d like to make it less work and have some ideas around that, but there will always be cases like this). It would be nice if I could do things like assume(all(x >= 0 for x in xs)) and have any chance of succeeding for large lists.

Well with the new system it will succeed for about half of parameter values. So the logic becomes: Generate a parameter, try some number of draws of data from this parameter. Eventually draw a new parameter, but if we have a high rejection rate draw a new parameter much sooner.

This has all landed in master as of yesterday evening, and I’m pretty pleased with it so far. I’m going to do a bit more work on hypothesis over the next few days, but expect to see it in a released version near you quite soon.

This entry was posted in Hypothesis, Uncategorized on by .

Hypothesis short term road map

As some of you might be aware, I authored a python testing library called hypothesis.

It’s basically quickcheck for Python. I first wrote it within about a month of my learning python, so it has some… eccentricities (I still maintain that the fact that it has its own object model with multiple dispatch is totally legit, but I admit that the prototype based inheritance feature in it was probably misguided), but I flatter myself into thinking it’s not only pretty good but is by some small margin possibly the most advanced library of its ilk.

It’s also pretty neglected. I haven’t actually released a version of it in just over a year. Part of this is because I was pretty excited about testmachine and was considering it as the successor to hypothesis, but then I didn’t do any work on testmachine either. Basically, for a variety of reasons 2014 was a pretty rough year and I didn’t get much done on my open source work during it.

Despite this neglect, people seem to have started using hypothesis. I’ve got a bunch of issues opened, a few pull requests, and pypi stats are telling me it’s had more than 500 downloads in the last month.

And so, I appear to have failed to avoid success at all costs, and will instead embrace success and resume development on hypothesis.

So here’s the plan.

In the next few days there will be a point release of hypothesis. It will not be large – it will clear up a few bugs, maybe have one or two minor enhancements, improve python 3 support, and generally demonstrate that things are moving again.

In the next month or two there are a variety of larger features I want to work on. Some of them are pretty damn exciting, and if I get all of them working then I’m going to remove the weasel words from the above and simply say that flat out hypothesis will be the most advanced library of its kind.

In rough order of least to most exciting, I’m going to be working on:

Logging

This is super unsexy and to be honest I will probably not be inspired enough to work on it immediately, but I very much want it to get done as it’s both important and something people have actually asked for: Hypothesis needs to report information on its progress and what it’s tried. It’s important to know what sort of things hypothesis has tried on your code – e.g. if it’s only managed to generate a very small number of examples.

Some sort of top level driver

Hypothesis is right now fundamentally a library. If you want to actually use it to write tests you need to use it within pytest or similar.

Continued usage like this will 100% continue to be supported, encouraged and generally considered a first class citizen, but I would like there to be some sort of top level hypothesis program as well so you can get a more tailored view of what’s going on and have a better mechanism for controlling things like timeouts, etc.

Improved data generation

The data generation code is currently both ropy and not very intelligent. It has two discrete concepts – flags and size – which interact to control how things are generated. I want to introduce a more general and unifying notion of a parameter which gives you much more fine grained control over the shape of the distribution. This should also improve the coverage a lot, make this more easily user configurable, and it may even improve performance because currently data generation can be a bit of a bottleneck in some cases.

Merging the ideas from TestMachine

Testmachine is pretty awesome as a concept and I don’t want it to die. It turns out I have this popularish testing library that shares a lot in common with it. Lets merge the two.

One thing that I discovered with TestMachine is that making this sort of thing work well with mutable data is actually pretty hard, so this will probably necessitate some improved support around that. I suspect this will involve a bunch of fixes and improvements to the stateful testing feature.

Remembering failing test cases

One of the problems with randomized testing is that tests are inherently flaky – sometimes a passing test will become a failing test, sometimes vice versa without any changes to the underlying code.

A passing test becoming a failing test is generally fine in the sense that it means that the library has just found an exciting new bug for you to fix and you should be grateful.

A failing test becoming a passing test on the other hand is super annoying because it makes it much harder to reproduce and fix.

One way to do this is to have your library generate test cases that can be copied and pasted into your non-randomized test suite. This is the approach I took in testmachine and it’s a pretty good one.

Another approach that I’d like to explore instead is the idea of a test database which remembers failing test cases. Whenever a small example produces a failure, that example should be saved and tried first next time. Over time you build up a great database of examples to test your code with.

This also opens the possibility of giving hypothesis two possible run modes: One in which it just runs for an extended amount of time looking for bugs and the other in which it runs super quickly and basically only runs on previously discovered examples. I would be very interested in such an approach.

Support for glass-box testing

Randomized property based testing is intrinsically black box. It knows nothing about the code it’s testing except for how to feed it examples.

But what if it didn’t have to be?

American Fuzzy Lop is a fuzz tester, mostly designed for testing things that handle binary file formats (it works for things that are text formats too but the verbosity tends to work against it). It executes roughly the following algorithm:

    1. Take a seed example.
    2. Mutate it a bit
    3. Run the example through the program in question.
      1. If this produces bad behaviour, output it as a test case
      2. If this produces a new interesting state transition in the program, where a state transition is a pair of positions in the code with one immediately following the other in this execution, add it to the list of seed examples.
    4. Run ad infinitum, outputting bad examples as you go

This produces a remarkably good set of examples, and it’s 100% something that hypothesis could be doing. We can detect state transitions using coverage and generate new data until we stop getting new interesting examples. Then we can mutate existing data until we stop getting new interesting examples from that.

This sort of looking inside the box will allow one to have much greater confidence in hypothesis’s ability to find interesting failures. Currently it just executes examples until it has executed a fixed number or runs out of time, which is fine but may mean that it stops too early.

The existing behaviour will continue to remain as an option – initially switched on by default, but once this is robust enough eventually this will become the default option. The old behaviour will remain useful for cases where you want to e.g. test C code and thus can’t get reliable coverage information.

This would also work well with the test database idea, because you could prepopulate the test database with minimal interesting examples.

Probably some other stuff too

e.g. in the above I keep getting the nagging feeling that hypothesis needs a more general notion of settings to support some of these, so I will likely be doing something around that. There’s also some code clean up that really could use doing.

It’s currently unclear to me how long all of this is going to take and whether I will get all of it done. Chances are also pretty high that some of these will turn out to be bad ideas.

If any  of you are using or are considering using hypothesis, do let me know if any of these seem particularly exciting and you’d like me to work on them. I’m also open to suggestions of other features you’d like to see included.

 

 

 

This entry was posted in Hypothesis, Uncategorized on by .

The horror lurking at the heart of the new hypothesis

I’ve just merged a branch with a redesign of the hypothesis core into master. All the tests are passing, and I’m much happier with the new design than I was with the old one. It takes the idea of the search strategy I mentioned in my previous post and makes it the fundamental building block of the library. This seems to work quite well.

But in doing this rewrite I’ve gone and taken something which was always at the core of hypothesis and made it explicit, and it’s pretty… well actually I don’t want to say it’s bad per se. Using it has actually been very pleasant, and it’s made some very nice things possible that would otherwise have been quite awkward to do.

So while it’s not exactly bad, it’s sure as hell unpythonic.

What is it?

Well, it’s essentially a highly dynamic multiple dispatch object model with prototype based inheritance.

Err, yeah.

Why, David? Why do you do this?

Well, I arrived at it through a sequence of individually logical steps.

Here’s what drove me down this path:

The first requirement:

We need to be able to go from a type to a strategy. e.g. we want to say “give me a strategy for ints”. Or “give me a strategy for strings”. Or “give me a strategy for tuples of (int, str)”.

However, we absolutely don’t want to put the logic for this on the types themselves. We’re not going to monkey patch things to add in a strategy method or anything uncouth like that.

There are two reasons we’re not going to do that:

  1. We want to be able to configure specific strategies per run, e.g. biasing for short lists or only producing alphanumeric strings
  2. It would be gross

Note also that we need to be able to pass instances, not just types here. For example we can ask for a strategy for the tuple (int, str) which will give us a strategy returning pairs where the first is an int and the second a string.

So that’s where the multiple dispatch came from. It’s not implemented in any very clever way, and it currently is closer to just type based overriding than true multiple dispatch because it doesn’t support subtyping (if you define a handler for instances of Foo then it will not trigger for instances of subclasses of Foo. I may fix this at some point but I currently have no need to do so).

Where does the prototype based inheritance come in?

Well, it started at just level one: There was a single default object which all other lookups would delegate to if needed. The reason for this was that it allowed you to both define new strategy handlers wherever you wanted because you could define them on the default object, but you could then override them on a fresh local copy.

Then I realised that having full prototype based inheritance was no more work and was actually quite useful. So I provided that.

Why? Let me show you an example of its usage in practice. Here’s how we define the strategy for strings:

class StringStrategy(MappedSearchStrategy):
    def __init__( self,
                    strategies,
                    descriptor,
                    **kwargs):
        SearchStrategy.__init__(self, strategies, descriptor,**kwargs)
        self.length_strategy = strategies.strategy(int)
        char_strategy = kwargs.get("char_strategy",
                                   OneCharStringStrategy)
 
        cs = strategies.new_child_mapper()
        cs.define_specification_for(str, char_strategy)
        self.mapped_strategy = cs.strategy([str])
 
    def pack(self, ls):
        return ''.join(ls)
 
    def unpack(self, s):
        return list(s)

What are we doing here?

Well, MappedSearchStrategy lets you define a search strategy in terms of another one. You then just need to define methods for converting to and from instances of that other type. In this case we are defining the strategy for strings in terms of the strategy for lists of strings.

Which makes it sound like it’s recursive, but it’s not. Let me draw your attention to the following bit:

        char_strategy = kwargs.get("char_strategy",
                                   OneCharStringStrategy)        
        cs = strategies.new_child_mapper()
        cs.define_specification_for(str, char_strategy)
        self.mapped_strategy = cs.strategy([str])

What do we have here?

We have another strategy for strings which only generates one character strings (python doesn’t have a Character data type). This is not normally installed anywhere.

We now create a new child mapper. This is a fresh SearchStrategies object whose prototype is the SearchStrategies object we started with. We can now define a strategy for strings on it, which overrides strings for that mapper but leaves the original untouched. When we ask it for a strategy for lists of strings, it uses the list strategy it’s inherited and the string strategy that’s just been defined for it to give a strategy for lists of strings that uses the new one character string strategy.

This isn’t the only place it’s used. We also use it in stateful testing in a broadly similar way for generating the individual test steps without interfering with the larger strategies namespace.

Is this all a terrible idea? I dunno. It works pretty well, and it’s allowed to write some quite interestingly powerful code, but I do have the feeling that it’s probably a bit too clever and that there might be a better solution.

This entry was posted in Hypothesis, programming on by .

A possible more principled approach to generation and simplification in hypothesis

Currently the way generation and simplification in hypothesis work are very ad hoc and undisciplined. The API is spread across various places, and the actual behaviour is very under-specified.

There are two perations exposed, produce and simplify. produce takes a size parameter and produces values of the specifed type of about the provided size, whileas simplify takes a value of the specified type and produces a generator over simplified variants of it.

The meaning of these terms is explicit meaning of these terms is deliberately undefined: There is no specified meaning for “about that size” or “simplified variants”.

This post is an attempt to sketch out some ideas about how this could become better specified. I’m just going to use maths rather than Python here – some of the beginnings of this has made it into code, but it’s no more than a starting point right now.

Let \(S\) be a set of values we’ll call our state space. We will call a search tactic for \(S\) a triple:

\[\begin{align*}
\mathrm{complexity} & : S \to [0, \infty) \\
\mathrm{simplify} & : S \to [S]^{< \omega}\\ \mathrm{produce} & : [0, \infty) \to \mathrm{RV}(S) \\ \end{align*}\] Where \(\mathrm{RV}(S)\) is the set of random variables taking values in \(S\) and \([S]^{<\omega}\) is the set of finite sequences taking values in \(S\). These operations should satisfy the following properties:

  1. \(\mathrm{h}(\mathrm{produce}(x)) = \min(x, \log(|S|))\)
  2. \(x \to \mathbb{E}(\mathrm{complexity}(\mathrm{produce}(x)))\) should be monotone increasing in X
  3. \(\mathrm{complexity}(y) \leq \mathrm{complexity}(x)\) for \(y \in \mathrm{simplify}(x)\)

In general it would be nice if the distribution of \(\mathrm{produce}(x)\) minimized the expected complexity, but I think that would be too restrictive.

Where \(\mathrm{h}\) is the entropy function.

The idea here is that the entropy of the distribution is a good measure of how spread out the search space is – a low entropy distribution will be very concentrated, whileas a high entropy distribution will be very spread out. This makes it a good “size” function. The requirement that the expected complexity be monotone increasing captures the idea of the search space spreading out, and the requirement that simplification not increase the complexity captures the idea of moving towards the values more like what you generated at low size.

Here are some examples of how you might produce search strategies:

A search strategy for positive real numbers could be:

\[\begin{align*}
\mathrm{complexity}(x) & = x \\
\mathrm{simplify}(x) & = x, \ldots, x – n \mbox{ for } n < x\\ \mathrm{produce}(h) & = \mathrm{Exp}(e^h - 1) \\ \end{align*}\] The exponential distribution seems to be a nice choice because it's a maximum entropy distribution for a given expectation, but I don't really know if it's a minimal choice of expectation for a fixed entropy. Another example. Given search strategies for \(S\) and \(T\) we can produce a search strategy for \(S \times T\) as follows: \[\begin{align*} \mathrm{complexity}(x, y) & = \mathrm{complexity}_S(x) + \mathrm{complexity}_T(y)\\ \mathrm{simplify}(x, y) & = [(a, b) : a \in \mathrm{simplify}_S(x), b \in \mathrm{simplify}_T(y)] \\ \mathrm{produce}(h) & = (\mathrm{produce}_S(\frac{1}{2}h), \mathrm{produce}_T(\frac{1}{2}h)) \\ \end{align*}\] The first two should be self-explanatory. The produce function works because the entropy of a product of independent variables is the sum of the entropy of its components. It might also be potentially interesting to distribute the entropy less uniformly through the components, but this is probably simplest. You can also generate a search strategy for sequences given a search strategy for natural numbers \(\mathbb{N}\) and a search strategy for \(S\) we can generate a search strategy for \([S]^{< \omega}\): If we define the complexity of a sequence as the sum of the complexities of its components, we can define its simplifications as coordinatewise simplifications of subsequences. The produce is the only hard one to define. Its definition goes as follows: We will generate length as a random variable \(N = \mathrm{produce}_{\mathbb{N}}(i)\) for some entropy \(i\). We will then allocate a total entropy \(j\) to the sequence of length \(N\). So the coordinates will be generated as \(x_i = \mathrm{produce}_S(\frac{j}{N})\). Let \(T\) be the random variable of the sequence produced. The value of \(N\) is completely specified by \(S\), so \(h(S) = h(S,N)\). We can then use conditional entropy to calculate this: We have that \(h(S | N = n) = j\) because we allocated the entropy equally between each of the coordinates.

So \(h(S) = h(N) + h(S | N) = i + j\)

So we can allocate the entropy between coordinates and length as we wish – either an equal split, or biasing in favour of shorter sequences with more complex coordinates or short sequences with complex coordinates.

Anyway, those are some worked examples. It seems to work reasonably well, and is more pleasantly principled than the current ad hoc approach. It remains to be seen how well it works in practice.

This entry was posted in Hypothesis, Numbers are hard, programming, Uncategorized on by .