Category Archives: Hypothesis

The repr thing

In case you haven’t noticed, some parts of Hypothesis are designed with a lot of attention to detail. Some parts (particularly internals or anything that’s been around since the beginning) are a bit sloppy, some are quite well polished, and some of them are pedantic beyond the ken of mortal man and you would would be forgiven for wondering what on earth I was on when I was writing them.

The repr you get from standard strategies is one of those sections of which I am really quite proud, in a also slightly embarrassed sort of way.

>>> import hypothesis.strategies as st
>>> st.integers()
>>> st.integers(min_value=1)
>>> st.integers(min_value=1).map(lambda x: x * 2)
integers(min_value=1).map(lambda x: )
>>> st.integers(min_value=1) | st.booleans()
integers(min_value=1) | booleans()
>>> st.lists(st.integers(min_value=1) | st.booleans(), min_size=3)
lists(elements=integers(min_value=1) | booleans(), min_size=3)

Aren’t those reprs nice?

The lambda one bugs me a bit. If this had been in a file you’d have actually got the body of the lambda, but I can’t currently make that work in the python console. It works in ipython, and fixing it to work in the normal console would require me to write or vendor a decompiler in order to get good reprs and… well I’d be lying if I said I hadn’t considered it but so far a combination of laziness and judgement have prevailed.

This becomes more interesting when you realise that depending on the arguments you pass in a strategies function may return radically different implementations. e.g. if you do floats(min_value=-0.0, max_value=5e-324) then there are only three floating point numbers in that range, and you get back something that is more or less equivalent to sampled_from((-0.0, 0.0, 5e-324)).

How does all this work?

Well, most of this is done with a single decorator and a bunch of pain:

def defines_strategy(strategy_definition):
    from hypothesis.internal.reflection import proxies, arg_string, \
    argspec = getargspec(strategy_definition)
    defaults = {}
    if argspec.defaults is not None:
        for k in hrange(1, len(argspec.defaults) + 1):
            defaults[argspec.args[-k]] = argspec.defaults[-k]
    def accept(*args, **kwargs):
        result = strategy_definition(*args, **kwargs)
        args, kwargs = convert_positional_arguments(
            strategy_definition, args, kwargs)
        kwargs_for_repr = dict(kwargs)
        for k, v in defaults.items():
            if k in kwargs_for_repr and kwargs_for_repr[k] is defaults[k]:
                del kwargs_for_repr[k]
        representation = u'%s(%s)' % (
            arg_string(strategy_definition, args, kwargs_for_repr)
        return ReprWrapperStrategy(result, representation)
    return accept

What’s this doing?

Well, ReprWrapper strategy is more or less what it sounds like: It wraps a strategy and provides it with a custom repr string. proxies is basically functools.wrap but with a bit more attention given to getting the argspec exactly right.

So in this what we’re doing is:

  1. Converting all positional arguments to their kwargs equivalent where possible
  2. Removing any keyword arguments that are exactly the default
  3. Producing an argument string that when invoked with the remaining args (from varargs) and any keyword args would be equivalent to the ones that were actually passed in (Special note: The keyword arguments are ordered in the order of the argument lists, alphabetically and after real keyword arguments for kwargs. This ensures that we have a stable repr that doesn’t depend on hash iteration order (why are kwargs not an OrderededDict?).

Most of the heavy lifting in here is done in the reflection module, which is named such mostly because myhateforthepythonobjectmodelburnswiththefireoftenthousandsuns was too long a module name.

Then we have the bit with map().

Here is the definition of repr for map:

    def __repr__(self):
        if not hasattr(self, u'_cached_repr'):
            self._cached_repr = u'' % (
                self.mapped_strategy, get_pretty_function_description(
        return self._cached_repr

We cache the repr on first evaluation because get_pretty_function_description is quite slow (not outrageously slow, but quite slow), so we neither want to call it lots of times nor want to calculate it if you don’t need it.

For non-lambda functions, get_pretty_function_description returns their __name__. For lambdas, it tries to figure out their source code through a mix of inspect.getsource (which doesn’t actually work, and the fact that it doesn’t work is considered notabugwontfix) and some terrible terrible hacks. In the event of something going wrong here it returns the “lambda arg, names: <unknown>” we saw above. If you pass something that isn’t a function (e.g. a functools.partial) it just returns the repr so you see things like:

>>> from hypothesis.strategies import integers
>>> from functools import partial
>>> def add(x, y):
...     return x + y
>>> from functools import partial
>>> integers().map(partial(add, 1))
integers().map(functools.partial(, 1))

I may at some point add a special case for functools.partial because I am that pedantic.

This union repr is much more straightforward in implementation but still worth having:

    def __repr__(self):
        return u' | '.join(map(repr, self.element_strategies))

Is all this worth it? I don’t know. Almost nobody has commented on it, but it makes me feel better. Examples in documentation look a bit prettier, it renders some error messages and reporting better, and generally makes it a lot more transparent what’s actually going on when you’re looking at a repr.

It probably isn’t worth the amount of effort I’ve put into the functionality it’s built on top of, but most of the functionality was already there – I don’t think I added any new functions to reflection to write this, it’s all code I’ve repurposed from other things.

Should you copy me? No, probably not. Nobody actually cares about repr quality as much as I do, but it’s a nice little touch that makes interactive usage of the library a little bit easier, so it’s at least worth thinking about.

This entry was posted in Hypothesis, Python on by .

Hypothesis: Staying on brand

You know that person who is always insisting that you use the right capitalisation for your company, you have to use the official font, it’s really important whether you use a dash or a space and you absolutely must use the right terminology in all public communication, etc, etc? Aren’t they annoying? Don’t they seem to be ridiculously focused on silly details that absolutely don’t matter?

I used to feel that way too. I mean I generally humoured them because it always feels like other peoples’ important details don’t matter and that’s usually a sign that you don’t understand their job, but I didn’t really believe that it mattered.

Then I made Hypothesis and now I’m that person.

There’s a long list of guidelines on how I communicate Hypothesis that I have literally never communicated to anyone so I really have no right to get even slightly annoyed when people don’t follow it. To be honest, I have no real right to get annoyed when people don’t follow it even if they have read this post. So consider the intent of this post more “here is what I do, I would appreciate if you do the same when talking about Hypothesis and it will annoy me slightly if you don’t but you are entirely welcome to do your own thing and I appreciate you talking about Hypothesis however you do it”.

Without further ado, here are the Hypothesis brand guidelines.

The little things

  1. You’re probably going to pronounce my name wrong unless you’ve read the pronunciation guide.
  2. Hypothesis is a testing library, not a testing framework. This is actually important, because a thing that people never seem to realise (possibly because I don’t communicate it clearly enough, but it does say so in the documentation) is that Hypothesis does not have its own test runner, it just uses your normal test runners.
  3. I try not to use the phrase “Hypothesis tests” (I slip up on this all the time when speaking) because that’s a statistical concept. I generally use “Tests using Hypothesis”. It’s more awkward but less ambiguous.
  4. Hypothesis isn’t really a Quickcheck. I describe it as “Inspired by Quickcheck” rather than “Based on Quickcheck” or “A Quickcheck port”. It started out life as a Quickcheck port, but modern Hypothesis is a very different beast, both internally and stylistically.

The big thing

All of the classic Quickcheck examples are terrible. Please don’t use them. Pretty please?

In particular I hate the reversing a list example. It’s toy, there’s no easy way to get it wrong, and it’s doing this style of testing a great injustice by failing to showcase all the genuinely nasty edge cases it can find.

In general I try never to show an example using Hypothesis that does not expose a real bug that I did not deliberately introduce. Usually producing this it is enough to write a medium complexity example which you know can be well tested with Hypothesis, then add Hypothesis based tests. You can write it TDD if you like as long as you don’t use Hypothesis to do so.

The Tone

The tone I generally try to go for is “Computers are terrible and you can never devote enough time to testing, so Hypothesis is a tool you can add to your arsenal to make the time you have more effective”.

Disclaimers and safety warnings

  1. Hypothesis will not find all your bugs.
  2. Hypothesis will not replace all your example based tests.
  3. Hypothesis isn’t magic.
  4. Hypothesis will not grant you the ability to write correct code, only help you understand ways in which your code might be incorrect.

The Rest

There are probably a bunch of things I’ve forgotten on this, and I will update the list as I think of them.

This entry was posted in Hypothesis, Python on by .

Soliciting advice: Bindings, Conjecture, and error handling

Edit: I think I have been talked into a significantly simpler system than the one described here that simply uses error codes plus some custom hooks to make this work. I’m leaving this up for posterity and am still interested in advice, but don’t worry I’ve already been talked out of using setjmp or implementing my own exception handling system.

I’m working on some rudimentary Python bindings to Conjecture and running into a bit of a problem: I’d like it to be possible to run Conjecture without forking, but I’m really struggling to come up with an error handling interface that works for this.

In Conjecture’s current design, any of the data generation functions can abort the process, and are run in a subprocess to guard against that. For testing C this makes absolute sense: It’s clean, easy to use, and there are so many things that can go wrong in a C program that will crash the process that your C testing really has to be resilient against the process crashing anyway so you might as well take advantage of that.

For Python, this is a bit sub-optimal. It would be really nice to be able to run Conjecture tests purely in process just looking for exceptions. os.fork() has to do a bunch of things which makes it much slower than just using C forking straight off (and the program behaves really weirdly when you send it a signal if you try to use the native fork function), and it’s also just a bit unneccessary for 90% of what you do with Python testing.

It would also be good to support a fork free mode so that Conjecture can eventually work on Windows (right now it’s very unixy).

Note: I don’t need forkless mode to handle crashes that are not caused by an explicit call into the conjecture API. conjecture_reject and conjecture_fail (which doesn’t exist right now but could) will explicitly abort the test, but other things that cause a crash are allowed to just crash the process in forkless mode.

So the problem is basically how to combine these interfaces, and every thing I come up with seems to be “Now we design an exception system…”

Here is the least objectionable plan I have so far. It requires a lot of drudge work on my part, but this should mostly be invisible to the end user (“Doing the drudge work so you don’t have to” is practically my motto of good library design)

Step 1: For each draw_* function in the API, add a second draw_*_checked function which has exactly the same signature. This does a setjmp, followed by a call to the underlying draw_* function. If that function aborts, it does a longjmp back to the setjmp and sets a is_aborted flag and returns some default value. Bindings must always call the _checked version of the function, then check conjecture_is_aborted() and convert it into a language appropriate error condition.

Note: It is a usage error to call one checked function from another and this will result in your crashing the process. Don’t do that. These are intended to be entry points to the API, not something that you should use in defining data generators.

Step 2: Define a “test runner” interface. This takes a test, some associated data, and runs it and returns one of three states: Passing test, failing test, rejected test. The forking based interface then becomes a single test runner. Another one using techniques similar to the checked interface is possible. Bindings libraries should write their own – e.g. a Python one would catch all exceptions and convert them into an appropriate response.

Step 3: Define a cleanup API. This lets you register a void (*cleanup)(void *data) function and some data to pass to it which may get called right before aborting. In “crash the process” model it is not required to be called, and it will not get called if your process otherwise exits abnormally. Note: This changes the memory ownership model of all data generation. Data returned to you from generators is no longer owned by you and you may not free it.

I think this satisfies the requirements of being easy to use from both C and other languages, but I’m a little worried that I’m not so much reinventing the wheel as trying to get from point A to point B without even having heard of these wheel things and so I invented the pogo stick instead. Can anyone who has more familiarity with writing C libraries designed to be usable from both C and other languages offer me some advice and/or (heh) pointers?

This entry was posted in Hypothesis, programming, Python on by .

A new approach to property based testing

Edit: I’ve put together a prototype of these ideas called Conjecture. The design document may be a better place to start. I’ve had multiple people say “I had no idea what you were talking about when I read that blog post but it’s suddenly incredibly obvious now I’ve read the design document” or similar.

Do you ever have one of those thought processes where you go:

  1. Hmm. That’s an interesting idea.
  2. Ooh. And then I can do this.
  3. And that.
  4. Holy shit this solves so many problems.
  5. …by invalidating almost all the work I’ve done on this project.
  6. …and a lot of work other people have done over the last few years.
  7. I don’t know whether to be happy or sad.
  8. I think this might change literally everything.

Well, I’m having one right now.

Through a collision of multiple ideas – some mine, some other peoples’ – I’ve come up with an entirely new backend and API for property based testing. It’s drastically simpler to use, drastically simpler to implement, and works more or less in any language because it requires basically no advanced features. I’m pretty sure I can rebuild Hypothesis on top of it and that in the course of doing so I will be able to throw away approximately half the code. I’m also strongly considering doing a port to C and rebuilding Hypothesis as a set of bindings to that.

Here’s how it works:

We introduce a type TestContext. A TestContext is three pieces of data:

  1. An immutable sequence of bytes.
  2. An index into that sequence.
  3. A file handle, which may be None/NULL/whatever.

Given a test case, we work with testing functions. A testing function is any function which takes a TestContext as its first argument and then does one of three things:

  1. Returns a value
  2. Rejects the TestContext
  3. Fails

And in the course of doing so writes some data to the test context’s file handle.

A test case is then just a test function which takes no extra arguments and returns an ignored value. We generate a fresh test context, feed it to the test function and see what happens. If we get a failure, the test fails. We then try again until we get some fixed number of examples that are neither failures or rejections.

The primitive functions we build everything on top of are:

  1.  draw_bytes(context, n): If there are more than n bytes left in the buffer starting from the current index, return the next n bytes and then increment the index by n. Otherwise, reject the TestContext.
  2. fail(context): Essentially an ‘assert False’. Mark the current context as a failure.
  3. report(context, message): Print a message to the context’s file handle.

Reporting is used to show intermediate values so you can track what your test case does. Some additional state for controlling quality of reporting may also be useful, as are some helper functions (e.g. for my prototype I ended up defining declare(context, name, value) which prints name=value and then returns value)

We can build everything on top of this in the usual way that you would build an random number generator: e.g. you can generate an n-byte integer by drawing n bytes and combining them.

The important realisation is that this interface supports minimization! Once we have generated a sequence of bytes that produces a failure, we can start to perform standard quickcheck style shrinking on it, but because we’re only interested in sequences of bytes we can do a lot of clever things that are normally not open to us that are specialised to shrinking byte sequences.

So we generate, find a failure, then shrink the failure, then we run one final time with a real file handle passed into the test context this time so the log of our minimized test run is printed.

And this seems to work really rather well. It unifies the concepts of strategy and test case, and handles the problem you can have when defining your own generators of how to display data

You need to arrange your generators a little carefully, and some special shrinks are useful: For example, in my experiments I’ve found that generating integers in big-endian order is important, but you then also need a shrinking operation that lets you swap adjacent bytes x, y when you have x > y. Given that, integer shrinking appears to work quite well. I haven’t yet got a floating point generator working properly (there’s the obvious one where you generate a word of the appropriate size and then reinterpret it as a float, but this is a terrible distribution which is unable to find interesting bugs).

I suspect in general the shrinks this produce will often be a bit worse than that in classic quickcheck because they don’t have access to the structure, but in some cases that may actually improve matters – there are a lot of things where classic quickcheck shrinking can get stuck in a local optimum which this essentially allows you to redraw for.

This supports example saving like the normal Hypothesis interface does, and if anything it supports it better because your examples are literally a sequence of bytes and there’s almost nothing to do.

It supports composition of strategies with side effects in the middle in a way that is currently incredibly complicated.

Another nice feature is that it means you can drive your testing with american fuzzy lop if you like, because you can just write a program that reads in a buffer from standard in, feeds it to your test case and sees what happens.

In general, I’m really very excited by this whole concept and think it is the future of both the Hypothesis internals and the API. I don’t yet know what the path to implementing that is, and I’m currently considering doing the C implementation first, but I’ve put together a prototype which seems to work pretty well. Watch this space to see what comes next.

This entry was posted in Hypothesis on by .