Category Archives: Hypothesis

Mighty morphing power strategies

The Hypothesis API is a bit of a bizarre sleight of hand. It pretends to be very clean and simple, but that’s mostly a distraction to stop you thinking too hard about it. Everything looks easy, so you aren’t surprised when it just works, and you don’t think too hard and realise that what it just did should actually have been impossible.

Take for example this piece of code using the Django integration (this comes straight from the docs):

from hypothesis.strategies import lists, just
 
def generate_with_shops(company):
  return lists(models(Shop, company=just(company))).map(lambda _: company)
 
company_with_shops_strategy = models(Company).flatmap(generate_with_shops)

We take a strategy that generates a model inserted into the database, then use flatmap to create a strategy for children of that and bind that in. Everything just works neatly and everything will simplify just fine – the list of children can be simplified, both by throwing away children and simplifying individual children, the original element can be simplifies, everything lives in the database, all is good with the world. Examples will be persisted into the Hypothesis example database as normal. Everything works great, no reason to think twice about it.

Except it is completely ridiculous that this works, and it’s certainly unique to Hypothesis. No other Quickcheck has anything like it.

Some of this is just the magic of Hypothesis templating at work. There’s a lot more information available than is present in the end result, and this also explains how you can mutate the value by adding children to it and have simplification still work, etc.

But there’s something that should make you very suspicious going on here: We cannot create the strategy we need to generate the children until we have already performed some side effects (i.e. put some things in the database). What could the template for this possibly be?

The answer to this is quite bad. But it’s quite bad and hidden behind another abstraction layer!

The answer is that we have a type called Morpher. As far as we are concerned for now, a Morpher has one method called “become”. You call my_morpher.become(my_strategy) and you get a value that could have been drawn from my_strategy.

You can think of Morphers as starting out as a reproducible way of getting examples from strategies, but there’s more to it than that, for one very important reason: A Morpher can be simplified and serialized.

This gives us a very easy implementation of flatmap:

def flatmap(self, f):
    return builds(lambda s, m: m.become(f(s)), self, morphers())

i.e. we generate an element of the strategy, apply f to it to get a new strategy, and then tell the morpher to become an instance of that. Easy!

Easy, that is, except for the fact that it still looks completely impossible, because we’re no closer to understanding how morphers work.

I’m not going to explain how they work in too much detail, because the details are still in flux, but I’ll sketch out how the magic works and if you want the gory details you can check the code.

As it starts out, a Morpher is simple: It contains a random seed for a parameter value and a template value, and become() just draws parameters and templates with those standard seeds and then reifies the result.

This would achieve all of the desired results except for simplification: You can save the seeds and you can now generate.

So how do we simplify? Noting that each time we may have to become a different strategy and that templates are not compatible between strategies.

There is a two part trick to this:

  1. The template for a Morpher (which is actually the Morpher itself) is secretly mutable (actually quite a lot of Hypothesis template strategies are mutable). When we call become() on a Morpher, the strategy used is stored on the template for later so we have access to it when we want to simplify, as is the template that was produced in the course of the become() call.
  2. As well as storing a random seed we also store a list of serialized representations of possible templates. These are the representations that would be used when saving in the database. The reason for this is that the Hypothesis database has the following really helpful invariant: Any serialized representation can either be turned into a valid template for the strategy or rejected as invalid. Moreover the representations are quite straightforward, so usually similar strategies will have compatible representations.
  3. When we wish to become a strategy, we first try our serialized representations in order to see if one of them produces a valid template. If it does, we use that template, otherwise if we reach the end we generate a fresh one using the random seed method mentioned above.
  4. When we simplify, we try to simplify the last template generated with the last strategy used, and then replace the representation that generated that strategy with the simplified form of it, thus generating a new morpher with the same seed and parameter but a simplified serialized representation.

If you’re reading that thinking that it’s horrifying, you’re not wrong. It’s also quite fragile – there are definitely some corner cases of it that I haven’t quite shaken out yet, and it’s why flatmap is somewhat slower and buggier than things using more normal methods of generation.

But it’s also really powerful, because you can use this technology for things other than flatmap. I originally intended it as a shared implementation between flatmap and the stateful testing, although for various reasons I haven’t got around to rebuilding the stateful testing on top of it yet. It’s also what powers a really new cool new feature I released today (I know I said I wasn’t doing new features, but I couldn’t resist and it only took me an hour), which is essentially a form of do notation for Hypothesis strategies (only more pythonic).

@composite
def list_and_sample(draw, elements):
    values = draw(lists(elements, min_size=1))
    redraw = draw(lists(sampled_from(values)))
    return (values, redraw)

list_and_sample(integers) now gives you a strategy which draws a list of at least one integer, then draws another sample from that list (with replacement)

What this does is give you a magic “draw” function that produces examples from a strategy, and composes these all together into a single strategy built on repeatedly calling that draw function as many times as you like.

This is also black magic, but it’s not novel black magic: It’s just morphers again. We generate an infinite stream of morphers, then map a function over it. draw maintains a counter, and the nth time you call it it gets the nth element from the stream and tell it to become the strategy that you’ve just passed in. There’s a bit more fiddling and details to work out in terms of making everything line up right, but that’s more or less it. We’ve vastly simplified the definition of strategies that you would previously have used an ugly chain of lambdas and flatmaps to build up.

If you want to read more about this feature, I commend you to the documentation. It’s available in the latest release (1.11.0), so have fun playing with it.

This entry was posted in Hypothesis, Python on by .

Hypothesis 1.10.2 and 1.10.3 are out

Two small bug fix releases. I released one and then immediately found and fixed some unrelated problems, so two patch release for the price of one.

Changelog entries:

  • lists(elements, unique_by=some_function, min_size=n) would have raised a ValidationError if n > Settings.default.average_list_length because it would have wanted to use an average list length shorter than the minimum size of the list, which is impossible. Now it instead defaults to twice the minimum size in these circumstances.
  • basic() strategy would have only ever produced at most ten distinct values per run of the test (which is bad if you e.g. have it inside a list). This was obviously silly. It will now produce a much better distribution of data, both duplicated and non duplicated.
  • star imports from hypothesis should now work correctly.
  • example quality for examples using flatmap will be better, as the way it had previously been implemented was causing problems where Hypothesis was erroneously labelling some examples as being duplicates.
This entry was posted in Hypothesis, Python on by .

Fuzzing examples from the ground up

I’ve been doing some experimenting with how I might use the AFL approach to using coverage information for glass box testing in Hypothesis (apparently even when not working on it I still can’t stop experimenting). In particular I’m interested in the question of how to use this sort of information with purely generative approaches to data where you don’t have a starting pool of examples.

After some experimentation, the following seems to work pretty well with byte strings at least:

We have two operations, expand and shrink. Each of these takes a value and either produces larger or smaller versions of it. We also have some total ordering over examples where smaller is better.

We can run an example and get a set of labels out. Labels are:

  1. The same labels that AFL uses – i.e. hashed branches with bucketed counts of how many times they were taken.
  2. The type of any exception thrown by running the example

We maintain a mapping for each label to the best example which exhibits that label, and a set of previously seen examples (this should probably be some sort of bloom filter really).

We define an operation consider which works as follows:

seen = set()
best = {}
def consider(example):
    if example in seen:
        return False
    seen.add(example)
    changed = False
    for label in labels(example):
        if label not in best or simpler(example, best[label]):
             changed = True
             best[label] = example
    if changed:
        for smaller in shrink(example):
           if consider(smaller):
               break
    return changed

(only written to not be recursive to work around Python).

We then first consider a trivial example (e.g. the empty string) to start us off and then iterate the following loop indefinitely:

  1. Pick a label at random and choose the current best example for that label
  2. For each expansion of that example (including the example itself), consider that until one of them returns True or we run out of expansions.

…and that’s it. The use of minimize inside consider gives us a constant set of small seeds to build off, and then expanding them lets us find them at the end. Note that we’re not actually minimizing for each label deliberately – a lot of the time our labels will end up being quite large because we discovered several at once and got distracted by one of them. This appears to be OK in practice because over time labels get minimized as a byproduct of the constant generate and replace process.

To be concrete, the following are the operations for binary strings:

  1. The trivial starting element is the empty string
  2. Expansion first tries appending up to 1024 bytes of random nonsense to the end, then tries adding one byte to the end for each valid byte
  3. Shrinking does a bunch of sliding deletes. In particular there is no bytewise shrinking, only deletions.
  4. A value is simpler if it has a smaller length or equal length but is lexicographically smaller.

And this seems to work pretty well. I don’t have any empirical quantification, but it’s found two bugs: One in chardet and the other in binaryornot. I’ve tried applying it to cryptography, but it didn’t find anything. That’s not totally surprising as it’s already been extensively fuzzed with python-AFL, which I don’t expect this approach to be much better than – there’s less overhead, so it seems to be faster, but it’s otherwise a very similar approach which should produce similar results. Also, especially in this case, python AFL has a distinct advantage by not trying to synthesize complex message formats out of whole cloth.

This entry was posted in Hypothesis, Python on by .

How to generate a list plus an element of it

A common problem when using Hypothesis is that you want to generate two arguments: One is a list, the other is some element of that list (or some index into it, or some subset of it, etc).

This is pretty easy to do, but there are different ways of doing it that work variably well, so I thought I would unpack some of the options. You should absolutely feel free to ignore this and do whatever is easiest for you: Almost everything that works will work reasonably well, this is really just if you’re interested in The Best way to do it.

That being said, first a caveat. Don’t use this one:

def list_and_element(elements):
   return tuples(lists(elements), elements).
       filter(lambda p: p[1] in p[0])

The main reason not to use this one is that it will appear to work fine but will actually be working really badly: It almost only ever generates pairs where the second element is one which has a very high probability of being drawn from the example distribution. e.g. if you try using this with string elements you’ll find that the second element of the pair is almost always the empty string and basically never a string with more than one character.

Basically as long as you don’t do something like that, which I don’t think I’ve ever seen someone do in the wild anyway, you will probably be fine. That caveat aside, lets look at some of the alternatives.

The easiest thing to do, and the one people most commonly reach for, is to use flatmap:

def list_and_element(elements):
   return lists(elements, min_value=1).flatmap(
        lambda xs: tuples(just(xs), sampled_from(xs)))

This is pretty self-explanatory if you understand what flatmap does: We’re doing literally exactly what the problem says, but we have to break it into two stages. First we draw a list (with at least one element so that there can be any elements in it), then we draw an element from that list and return that list and that drawn element.

There’s nothing wrong with this solution and you should feel free to use it. However there are two ways to improve upon it:

The first is that whenever you use flatmap you should have a think about whether there’s a reasonable way to instead not do that. It’s not that flatmap is especially dangerous per se, it’s just that because it has extremely general purpose semantics Hypothesis has to work a lot harder every time you use it, and performance and example quality may suffer a bit (if you don’t have a timeout it should never hurt the ability to find an example, only the quality of the end result).

map and filter do not have this problem (if you want a technical explanation why: Hypothesis operates on an intermediate representation which is where it does most of its work. map and filter have the same intermediate representation as the underlying strategy, but flatmap cannot and instead has a really complicated one which is quite hard to work with), and it’s often possible to replace flatmap usage with them, so sometimes it’s worth thinking about whether you can. In this case we can, quite easily. The following does this:

def list_and_element(elements):
   return tuples(integers(min_value=0), lists(elements, min_value=1)).
        filter(lambda p: p[0] < len(p[1])).
        map(lambda p: (p[1], p[1][p[0]]))

What we’re doing here is we’re simultaneously generating the list and an index into it, then we’re filtering to the range where that index is valid, then we’re taking the element at that index.

There’s a complication though: Did you notice how we’re swapping the order of the pair halfway through?

The reason for this is because it makes Hypothesis better able to simplify the example in some interesting cases. What will happen is that Hypothesis will first simplify the first element of the tuple as far as it can, then the second element, then back to the first, etc until it hits a point where it can simplify further.

And what this means is that by putting the index first, we automatically seek to the first element of the list that makes the example pass. This both gives us more freedom to trim down the array by deleting later elements and also stops us getting stuck in the case where we have an easy to satisfy property that happens not to be satisfied by the trivial element but can’t shrink the array. For example, when doing it the other way I found that Hypothesis gave me ([”, ”, ”, ”, ”, ”, ”, ”, ”, ”, ”, ”, ”, ”, ”, ”, ”, ”, ‘0’], ‘0’) as a minimal example of a pair where the element chosen was non-empty: It had first shrunk all the elements of the array leading up to the one we’d chosen, then it couldn’t shrink the index because all of the earlier elements were empty.
This is also potentially a problem with the sampled_from + flatmap approach, but as it happens in this particular case you get lucky: What will end up happening is that if you try to shrink the length of the array past the index you’d chosen, you’ll end up redrawing a new index. This will usually get you past being stuck in this case, although no guarantees.

Then there’s the final option, which is my favourite: Don’t do this at all.

A thing that I have often found to be a very useful reframing is instead of asserting “this example is not a counter-example to this property”, assert “A countere-example to this property does not exist within this example”. As long as the set of possible counter-examples isn’t too large a function of the size of the example (e.g. if it’s O(n) you should be fine) this often leads to much better tests: You both vastly increase the chances of your finding a counterexample and improve the end result because you gain the ability to shrink past a bunch of accidental blockers. So instead of writing something like:

@given(list_and_element(text()))
def test_some_stuff(p):
    xs, x = p
    ...

You write:

@given(lists(text(), min_size=1))
def test_some_stuff(xs):
    for x in xs:
        ...

This is a technique I first noticed in my properties for testing optimization work. It’s not without its problems: It can be awkward if your tests need to mutate your arguments, and it of course makes your tests slower. For a lot of cases it seems to be worth trying though, as it can improve matters significantly.

This entry was posted in Hypothesis, programming, Python on by .

A vague roadmap for Hypothesis 2.0

As mentioned, there are a whole bunch of things I’d like to work on in Hypothesis still, but it’s a bit of a full time job and I’ve had enough of doing that on Hypothesis for now.

But I’d like to get the details down while they’re still fresh in my head. Unfortunately that makes this a bit of a “neener neener, these are the features you’re not going to get” post. It’s likely that in about 6 months time when Hypothesis has even more traction than it currently does I will do some sort of Kickstarter to fund working on this, or I might just do it for free the next time I feel like taking a big pile of money and setting fire to it.

This is sorted by order in which I would likely do them in rather than any sort of interest order.

Flaws in the current API

The Hypothesis API is pretty good in my not very humble opinion, but it definitely has some weird eccentricities and things I’d like to do better. Some of these are holdouts from the early early days of Hypothesis back in 2013, some of these are more modern and are just because of things I couldn’t figure out a nice way to express when I wrote them.

Weird parameters to @given

Arguments to @given are confusing. They can be positional or they can be keyword based, and there are some special arguments there. In particular you can explicitly pass in a Settings object or a Random object to use. These currently live in the same namespace as the arguments to be passed to your underlying function. As a result, fun fact: You can’t have arguments to Hypothesis called ‘random’ or ‘settings’ unless you pass them positionally.

New API:

  1. Like @example, @given may take its arguments either positionally or as keywords, but not both.
  2. All arguments to @given will be passed to the underlying test as values. If you want to configure it with a custom random, settings, or any of various other things the API would be free to grow to, the syntax would be something like @given(…).with_configuration(random=my_random) def my_test_function(…)

This opens up a lot of possibilities, simplifies the slightly horrendous edge cases of given, and generally smooths out a lot of difficulties people sometimes run into with using it. It also means there are much fewer confusing edge cases using it with internal decorators because the interactions with arguments and keyword arguments become much simpler.

Status: I know exactly how to do this. It’s a bit fiddly to do well, but I don’t think there are any surprises here.

Life cycle API

One of the things you need for testing is a bunch of hooks into the life-cycle – setup and teardown for example. Hypothesis currently has the executors API which nobody including me likes. It’s a bit weird, very limited, and ties you to class based testing.

I’d like to expose a detailed lifecycle API that lets you hook into a number of events. In particular I need to be able to insert code around just the test body (executors currently treat value creation and test execution identically). Combining this with the better @given configuration above makes it easy to ditch the dependence on class based testing.

I’d still like to be able to support some sort of auto derivation of life cycle events for class based tests (unittest setup and tear down for example).

I’d also like to integrate this with py.test function level fixtures, but that’s currently impossible.

Status: The basic life cycle API I mostly know how to do. Beyond that things start to get a bit shaky.

Better stateful testing

Hypothesis’s stateful testing is extremely powerful and if you’re not using it you should be.

But… I’ll be the first to admit it’s a little hard to use. The generic API is fine. It’s quite low level but it’s easy to use. The rule based stuff should be easy to use, but there’s a bit too much boiler plate and bundles of variables are a weird second class citizen.

What I’d like to be able to do is make them just behave like strategies, where the strategy’s evaluation is deferred until execution time, so you can use them as if they were any other strategy and everything should Just Work.

I would also like the syntax for using it to be more unified with the normal @given syntax. Ideally it would be nice if every @given invocation had an implicit state machine associated with it so as to unify the two approaches.

Status: I know how to do every part of this except the last bit. I suspect the last bit will get punted on.

Better process isolation

Currently if you want to have process isolation for your tests you can use the forking executor.

But you probably shouldn’t. It has a number of weird limitations: It may (usually will) interact poorly with how test runners capture output, and for reasons that will be completely opaque to you but make sense I promise some examples will not minimize correctly.

I’d like to make this better and integrate it more thoroughly into Hypothesis, so it’s just a config item to get process isolation and it should transparently work with things. Ideally this would give built in parallel test execution too.

Status: Given the lifecycle API I’m pretty sure I know how to do this on posix platforms. I’m unclear on the details for windows but think I can probably manage something. This may require breaking my zero dependencies rule to do well, but I’m not against doing that. I would strongly consider just adding execnet as a dependency for implementing this.

Size bounding

A common problem in Hypothesis is that you ask it to generate examples that are too large. e.g. a lists(lists(lists(booleans))) will typically contain more than a thousand elements. This is unfortunate. A lot of this problem comes from the fact that Hypothesis does not use the traditional sizing mechanism from Quickcheck.

The way to fix this is basically to draw from the conditional distribution of values which are <= some maximum size. The mechanisms for doing most of this are already in place from the work on recursive strategies, but it would be nice to be able to generalize it everywhere as it would sovle a common cause of slowness.

Status: I’ve got about 80% of a design sketched out. There are a few things I’m worried might not work as well as I’d hope, but I don’t think this is too hard.

New functionality

Profiles

This is a very simple feature, but it would be nice to have easily configurable “profiles” for Hypothesis: You want to run Hypothesis very differently on your CI server than in local development for example.

Status: Trivial. I’m almost embarrassed I haven’t done this already.

Missing Quickcheck features

There are two Quickcheck features which Hypothesis doesn’t implement. One of these is an “eh I could and probably will but I doubt Python programmers care much”. The other one is one that is genuinely important and I would like to support but haven’t yet got around to.

Function Strategies

One of the nice things Quickcheck can do is generate random functions. I’m pretty sure I know how to do this in Hypothesis, there’s just not really been much demand for it so I’ve never got around to doing it.

Status: Not that hard, but I need to work out some details and figure out the specific requirements for this. Python functions are vastly nastier beasts than Haskell functions.

Example classification

Quickcheck lets you label examples and then at the end reports a breakdown of the statistics for different labels.

I’ve never got around to implementing this in Hypothesis despite the fact that I’ve been intending to do so since even before 1.0. Part of why is that it matters less for Hypothesis – one of the classic reasons you want this is because Quickcheck can easily generate a lot of duplicate examples if you’re not careful. Hypothesis generally doesn’t do that because of its built in deduplication.

Still, it would be a useful feature which I would like to implement at some point.

Status: I know how this should work. I’ve had chunks of a working prototype before and everything was straightforward, but it got deprioritised.

Coverage based example discovery

The big thing that means that I can’t currently say “Hypothesis is great and you should use it for everything” is that actually for a lot of things you should be using python-AFL instead.

This shouldn’t be the case. Really everything python AFL can do, Hypothesis should be able to do better. It just currently can’t because it’s entirely black box.

Hypothesis should be able to use essentially the AFL algorithm to do this: Instead of just generating examples it generates examples, sees the coverage profile of those, then minimizes down to a seed set for that profile and starts mutating from there.

This would be particularly great in combination with the stateful testing, which has exactly the sort of enormous state space that this sort of concept is great for exploring.

Status: I have prototyped this and it works great, but my previous attempts were unacceptably slow. I’m about 80% sure I now know how to fix that.

Grammar based strategy definition

I’d like to be able to generate strings matching some grammar. I’ve previously implemented this in a very early version of Hypothesis for regular grammars but never finished it off.

Status: I’ve done enough of this before that I know how to do it again, but it would require starting from scratch because too much has changed since then.

Strategies for common string formats

I’d like to have built in strategies for URIs, emails, etc. that do not depend on fake factory. Fake factory is fine, but it’s basically insufficiently devious for the level of making your code suffer that Hypothesis users have come to expect.

Status: A bunch of work to get good, but not that hard to do. Especially given grammar based strategy definitions.

Discovery mode

Currently Hypothesis is intended for being run as part of a small test suite of reasonably fast tests. This is great and all, but particularly given the coverage based discovery features what you really want is to be able to part Hypothesis on a server somewhere and just have it sitting there 24/7 looking for bugs in your code.

Status: Requires some fairly hefty designing. A lot of known unknowns, probably some unknown unknowns. I don’t think there’s anything fundamentally difficult about this, but there are a lot of “How should it handle X?” questions that would need answering.

This entry was posted in Hypothesis, Python on by .