Author Archives: david

What if we had more finished libraries?

Mark Jason Dominus has a piece from a while ago in which he says the following:

I released the Text::Template module several years ago, and it was immediately very successful. It’s small, simple, fast, and it does a lot of things well. At the time, there were not yet 29 templating systems available on CPAN.

Anyway, the module quickly stabilized. I would get bug reports, and they would turn out to be bugs in the module’s users, not in the module; I would get feature requests, and usually it turned out that what the requester wanted was possible, or even easy, without any changes to the module. Since the module was perfect, there was no need to upload new versions of it to CPAN.

But then I started to get disturbing emails. “Hi, I notice you have not updated Text::Template for nine months. Are you still maintaining it?” “Hi, I notice you seem to have stopped work on Text::Template. Have you decided to abandon this approach?” “Hi, I was thinking of using Text::Template, but I saw it wasn’t being maintained any more, so I decided to use Junk::CrappyTemplate, because I need wanted to be sure of getting support for it.”

I started wondering if maybe the thing to do was to release a new version of Text::Template every month, with no changes, but with an incremented version number. Of course, that’s ridiculous. But it seems that people assume that if you don’t update the module every month, it must be dead. People seem to think that all software requires new features or frequent bug fixes. Apparently, the idea of software that doesn’t get updated because it’s finished is inconceivable.

I blame Microsoft.

In my post about really boring programming language fantasies, I expressed a desire for more things in the standard library. A lot of people objected that the standard library is where software goes to die.

And… this is I guess somewhat true, but really the standard library is where software goes when it’s stable, and programmers seem to have lost their taste for stable software. After all, you’re not really living unless your dependencies are constantly shifting under you in backwards incompatible ways, right?

But… I share the fear. Because basically I don’t believe you can finish software. I mean sure it’s possible that MJD wrote bug free software. For such a constrained problem I’m definitely not ruling it out (but, on the other hand, text is terrible, and I’ve certainly written incredibly buggy solutions to highly constrained problems). But also he was writing in Perl.

And I don’t mean that pejoratively. As Avdi put it in another piece, we’re still catching up to Perl. Perl is amazing for backwards compatibility of language releases. In comparison, Python is terrible. For example, Hypothesis doesn’t currently run on Python 3.6 because they decided to remove a function that I depended on from the standard library because it’s been deprecated in favour of a function that doesn’t exist on Python 2.7. Thanks. (I haven’t fixed this yet not because I’m not going to, but because Python 3.5 isn’t even out yet so this is not high on my priority list).

And from the outside it’s often very hard to tell the difference between a finished and an abandoned library, but the balance of probability tends to be towards the latter. There are a lot of abandoned libraries and very few that haven’t had any bugs in them that needed fixing in the last four years.

Part of what motivates this is I’ve run into a bunch of libraries that from an API point of view “should” be finished: They have a very small surface area, are widely used, and haven’t been updated in a while. Unfortunately they’re instead either abandoned or struggling to find enough time to actually put out a new released version. For example bitarray is a very nice library that sadly you should never use because it’s abandoned and has some known and serious bugs that have not been fixed and probably never will be. Naturally it’s also widely deployed.

This is frustrating. The end of life for software should be stability, not bitrot, and I’d like a way to fix that. Here’s the way I’m currently speculating about:

It sure would be nice if there were an organisation which basically existed to take on maintainership of finished libraries. A library maintained by that organisation basically gives you the guarantee that bugs will be fixed, it will be ported to new versions of the language, the documentation will be improved, etc. but no new features will be added and the existing API will remain backwards compatible for basically all time. Any incompatible changes to the library are purely functional: They create a new library, not mutate the old one.

Sketch rules for how it could work (I’m imagining this being for Python, but it’s pretty portable to other languages):

  1. There is a corresponding Github organisation containing all libraries the organisation maintains.
  2. There is a shared email address which has ownership of the libraries on pypi (or whatever your package repository is)
  3. Anyone can and should feel free to work on any given project. Members of the organisation are actively encouraged to do so. A pull request requires sign-off from at least one, ideally two, members of the organisation.
  4. Any library maintainer can ask for the organisation to take ownership of their library. They put it to an internal vote and if they say yes the project is transferred to them. This should usually come with an invite to add the maintainer to the organisation too.
  5. People who want to join the organisation can ask to do so and it is again put to a vote.

Minimum criteria for libraries to be included:

  1. Must be under an OSI approved license.
  2. Must be version at least 1.0.0.
  3. Must have some threshold level of usage (a library nobody is using is probably not finished because you’re almost certainly wrong about the requirements).
  4. Must run on all of some required set of versions of the language (I’d pick CPython 2.7 and 3.4 and pypy for Python)
  5. Must not have any dependencies on libraries that are not maintained by the organisation (testing dependencies are totally fine).
  6. The latest release must not have introduced any new features or breaking changes.

This is not really a fantasy idea. I think it’s quite practical, and I would be prepared to be a founding member of this organisation if other people would be interested in joining it, however I don’t currently actually have any finished libraries to contribute (Hypothesis might be finished at some point, but it probably won’t be – e.g. I’ll likely be adding new and novel strategy combinators forever). Do you?

This entry was posted in programming, Python on by .

Fuzzing examples from the ground up

I’ve been doing some experimenting with how I might use the AFL approach to using coverage information for glass box testing in Hypothesis (apparently even when not working on it I still can’t stop experimenting). In particular I’m interested in the question of how to use this sort of information with purely generative approaches to data where you don’t have a starting pool of examples.

After some experimentation, the following seems to work pretty well with byte strings at least:

We have two operations, expand and shrink. Each of these takes a value and either produces larger or smaller versions of it. We also have some total ordering over examples where smaller is better.

We can run an example and get a set of labels out. Labels are:

  1. The same labels that AFL uses – i.e. hashed branches with bucketed counts of how many times they were taken.
  2. The type of any exception thrown by running the example

We maintain a mapping for each label to the best example which exhibits that label, and a set of previously seen examples (this should probably be some sort of bloom filter really).

We define an operation consider which works as follows:

seen = set()
best = {}
def consider(example):
    if example in seen:
        return False
    seen.add(example)
    changed = False
    for label in labels(example):
        if label not in best or simpler(example, best[label]):
             changed = True
             best[label] = example
    if changed:
        for smaller in shrink(example):
           if consider(smaller):
               break
    return changed

(only written to not be recursive to work around Python).

We then first consider a trivial example (e.g. the empty string) to start us off and then iterate the following loop indefinitely:

  1. Pick a label at random and choose the current best example for that label
  2. For each expansion of that example (including the example itself), consider that until one of them returns True or we run out of expansions.

…and that’s it. The use of minimize inside consider gives us a constant set of small seeds to build off, and then expanding them lets us find them at the end. Note that we’re not actually minimizing for each label deliberately – a lot of the time our labels will end up being quite large because we discovered several at once and got distracted by one of them. This appears to be OK in practice because over time labels get minimized as a byproduct of the constant generate and replace process.

To be concrete, the following are the operations for binary strings:

  1. The trivial starting element is the empty string
  2. Expansion first tries appending up to 1024 bytes of random nonsense to the end, then tries adding one byte to the end for each valid byte
  3. Shrinking does a bunch of sliding deletes. In particular there is no bytewise shrinking, only deletions.
  4. A value is simpler if it has a smaller length or equal length but is lexicographically smaller.

And this seems to work pretty well. I don’t have any empirical quantification, but it’s found two bugs: One in chardet and the other in binaryornot. I’ve tried applying it to cryptography, but it didn’t find anything. That’s not totally surprising as it’s already been extensively fuzzed with python-AFL, which I don’t expect this approach to be much better than – there’s less overhead, so it seems to be faster, but it’s otherwise a very similar approach which should produce similar results. Also, especially in this case, python AFL has a distinct advantage by not trying to synthesize complex message formats out of whole cloth.

This entry was posted in Hypothesis, Python on by .

How to generate a list plus an element of it

A common problem when using Hypothesis is that you want to generate two arguments: One is a list, the other is some element of that list (or some index into it, or some subset of it, etc).

This is pretty easy to do, but there are different ways of doing it that work variably well, so I thought I would unpack some of the options. You should absolutely feel free to ignore this and do whatever is easiest for you: Almost everything that works will work reasonably well, this is really just if you’re interested in The Best way to do it.

That being said, first a caveat. Don’t use this one:

def list_and_element(elements):
   return tuples(lists(elements), elements).
       filter(lambda p: p[1] in p[0])

The main reason not to use this one is that it will appear to work fine but will actually be working really badly: It almost only ever generates pairs where the second element is one which has a very high probability of being drawn from the example distribution. e.g. if you try using this with string elements you’ll find that the second element of the pair is almost always the empty string and basically never a string with more than one character.

Basically as long as you don’t do something like that, which I don’t think I’ve ever seen someone do in the wild anyway, you will probably be fine. That caveat aside, lets look at some of the alternatives.

The easiest thing to do, and the one people most commonly reach for, is to use flatmap:

def list_and_element(elements):
   return lists(elements, min_value=1).flatmap(
        lambda xs: tuples(just(xs), sampled_from(xs)))

This is pretty self-explanatory if you understand what flatmap does: We’re doing literally exactly what the problem says, but we have to break it into two stages. First we draw a list (with at least one element so that there can be any elements in it), then we draw an element from that list and return that list and that drawn element.

There’s nothing wrong with this solution and you should feel free to use it. However there are two ways to improve upon it:

The first is that whenever you use flatmap you should have a think about whether there’s a reasonable way to instead not do that. It’s not that flatmap is especially dangerous per se, it’s just that because it has extremely general purpose semantics Hypothesis has to work a lot harder every time you use it, and performance and example quality may suffer a bit (if you don’t have a timeout it should never hurt the ability to find an example, only the quality of the end result).

map and filter do not have this problem (if you want a technical explanation why: Hypothesis operates on an intermediate representation which is where it does most of its work. map and filter have the same intermediate representation as the underlying strategy, but flatmap cannot and instead has a really complicated one which is quite hard to work with), and it’s often possible to replace flatmap usage with them, so sometimes it’s worth thinking about whether you can. In this case we can, quite easily. The following does this:

def list_and_element(elements):
   return tuples(integers(min_value=0), lists(elements, min_value=1)).
        filter(lambda p: p[0] < len(p[1])).
        map(lambda p: (p[1], p[1][p[0]]))

What we’re doing here is we’re simultaneously generating the list and an index into it, then we’re filtering to the range where that index is valid, then we’re taking the element at that index.

There’s a complication though: Did you notice how we’re swapping the order of the pair halfway through?

The reason for this is because it makes Hypothesis better able to simplify the example in some interesting cases. What will happen is that Hypothesis will first simplify the first element of the tuple as far as it can, then the second element, then back to the first, etc until it hits a point where it can simplify further.

And what this means is that by putting the index first, we automatically seek to the first element of the list that makes the example pass. This both gives us more freedom to trim down the array by deleting later elements and also stops us getting stuck in the case where we have an easy to satisfy property that happens not to be satisfied by the trivial element but can’t shrink the array. For example, when doing it the other way I found that Hypothesis gave me ([”, ”, ”, ”, ”, ”, ”, ”, ”, ”, ”, ”, ”, ”, ”, ”, ”, ”, ‘0’], ‘0’) as a minimal example of a pair where the element chosen was non-empty: It had first shrunk all the elements of the array leading up to the one we’d chosen, then it couldn’t shrink the index because all of the earlier elements were empty.
This is also potentially a problem with the sampled_from + flatmap approach, but as it happens in this particular case you get lucky: What will end up happening is that if you try to shrink the length of the array past the index you’d chosen, you’ll end up redrawing a new index. This will usually get you past being stuck in this case, although no guarantees.

Then there’s the final option, which is my favourite: Don’t do this at all.

A thing that I have often found to be a very useful reframing is instead of asserting “this example is not a counter-example to this property”, assert “A countere-example to this property does not exist within this example”. As long as the set of possible counter-examples isn’t too large a function of the size of the example (e.g. if it’s O(n) you should be fine) this often leads to much better tests: You both vastly increase the chances of your finding a counterexample and improve the end result because you gain the ability to shrink past a bunch of accidental blockers. So instead of writing something like:

@given(list_and_element(text()))
def test_some_stuff(p):
    xs, x = p
    ...

You write:

@given(lists(text(), min_size=1))
def test_some_stuff(xs):
    for x in xs:
        ...

This is a technique I first noticed in my properties for testing optimization work. It’s not without its problems: It can be awkward if your tests need to mutate your arguments, and it of course makes your tests slower. For a lot of cases it seems to be worth trying though, as it can improve matters significantly.

This entry was posted in Hypothesis, programming, Python on by .

A vague roadmap for Hypothesis 2.0

As mentioned, there are a whole bunch of things I’d like to work on in Hypothesis still, but it’s a bit of a full time job and I’ve had enough of doing that on Hypothesis for now.

But I’d like to get the details down while they’re still fresh in my head. Unfortunately that makes this a bit of a “neener neener, these are the features you’re not going to get” post. It’s likely that in about 6 months time when Hypothesis has even more traction than it currently does I will do some sort of Kickstarter to fund working on this, or I might just do it for free the next time I feel like taking a big pile of money and setting fire to it.

This is sorted by order in which I would likely do them in rather than any sort of interest order.

Flaws in the current API

The Hypothesis API is pretty good in my not very humble opinion, but it definitely has some weird eccentricities and things I’d like to do better. Some of these are holdouts from the early early days of Hypothesis back in 2013, some of these are more modern and are just because of things I couldn’t figure out a nice way to express when I wrote them.

Weird parameters to @given

Arguments to @given are confusing. They can be positional or they can be keyword based, and there are some special arguments there. In particular you can explicitly pass in a Settings object or a Random object to use. These currently live in the same namespace as the arguments to be passed to your underlying function. As a result, fun fact: You can’t have arguments to Hypothesis called ‘random’ or ‘settings’ unless you pass them positionally.

New API:

  1. Like @example, @given may take its arguments either positionally or as keywords, but not both.
  2. All arguments to @given will be passed to the underlying test as values. If you want to configure it with a custom random, settings, or any of various other things the API would be free to grow to, the syntax would be something like @given(…).with_configuration(random=my_random) def my_test_function(…)

This opens up a lot of possibilities, simplifies the slightly horrendous edge cases of given, and generally smooths out a lot of difficulties people sometimes run into with using it. It also means there are much fewer confusing edge cases using it with internal decorators because the interactions with arguments and keyword arguments become much simpler.

Status: I know exactly how to do this. It’s a bit fiddly to do well, but I don’t think there are any surprises here.

Life cycle API

One of the things you need for testing is a bunch of hooks into the life-cycle – setup and teardown for example. Hypothesis currently has the executors API which nobody including me likes. It’s a bit weird, very limited, and ties you to class based testing.

I’d like to expose a detailed lifecycle API that lets you hook into a number of events. In particular I need to be able to insert code around just the test body (executors currently treat value creation and test execution identically). Combining this with the better @given configuration above makes it easy to ditch the dependence on class based testing.

I’d still like to be able to support some sort of auto derivation of life cycle events for class based tests (unittest setup and tear down for example).

I’d also like to integrate this with py.test function level fixtures, but that’s currently impossible.

Status: The basic life cycle API I mostly know how to do. Beyond that things start to get a bit shaky.

Better stateful testing

Hypothesis’s stateful testing is extremely powerful and if you’re not using it you should be.

But… I’ll be the first to admit it’s a little hard to use. The generic API is fine. It’s quite low level but it’s easy to use. The rule based stuff should be easy to use, but there’s a bit too much boiler plate and bundles of variables are a weird second class citizen.

What I’d like to be able to do is make them just behave like strategies, where the strategy’s evaluation is deferred until execution time, so you can use them as if they were any other strategy and everything should Just Work.

I would also like the syntax for using it to be more unified with the normal @given syntax. Ideally it would be nice if every @given invocation had an implicit state machine associated with it so as to unify the two approaches.

Status: I know how to do every part of this except the last bit. I suspect the last bit will get punted on.

Better process isolation

Currently if you want to have process isolation for your tests you can use the forking executor.

But you probably shouldn’t. It has a number of weird limitations: It may (usually will) interact poorly with how test runners capture output, and for reasons that will be completely opaque to you but make sense I promise some examples will not minimize correctly.

I’d like to make this better and integrate it more thoroughly into Hypothesis, so it’s just a config item to get process isolation and it should transparently work with things. Ideally this would give built in parallel test execution too.

Status: Given the lifecycle API I’m pretty sure I know how to do this on posix platforms. I’m unclear on the details for windows but think I can probably manage something. This may require breaking my zero dependencies rule to do well, but I’m not against doing that. I would strongly consider just adding execnet as a dependency for implementing this.

Size bounding

A common problem in Hypothesis is that you ask it to generate examples that are too large. e.g. a lists(lists(lists(booleans))) will typically contain more than a thousand elements. This is unfortunate. A lot of this problem comes from the fact that Hypothesis does not use the traditional sizing mechanism from Quickcheck.

The way to fix this is basically to draw from the conditional distribution of values which are <= some maximum size. The mechanisms for doing most of this are already in place from the work on recursive strategies, but it would be nice to be able to generalize it everywhere as it would sovle a common cause of slowness.

Status: I’ve got about 80% of a design sketched out. There are a few things I’m worried might not work as well as I’d hope, but I don’t think this is too hard.

New functionality

Profiles

This is a very simple feature, but it would be nice to have easily configurable “profiles” for Hypothesis: You want to run Hypothesis very differently on your CI server than in local development for example.

Status: Trivial. I’m almost embarrassed I haven’t done this already.

Missing Quickcheck features

There are two Quickcheck features which Hypothesis doesn’t implement. One of these is an “eh I could and probably will but I doubt Python programmers care much”. The other one is one that is genuinely important and I would like to support but haven’t yet got around to.

Function Strategies

One of the nice things Quickcheck can do is generate random functions. I’m pretty sure I know how to do this in Hypothesis, there’s just not really been much demand for it so I’ve never got around to doing it.

Status: Not that hard, but I need to work out some details and figure out the specific requirements for this. Python functions are vastly nastier beasts than Haskell functions.

Example classification

Quickcheck lets you label examples and then at the end reports a breakdown of the statistics for different labels.

I’ve never got around to implementing this in Hypothesis despite the fact that I’ve been intending to do so since even before 1.0. Part of why is that it matters less for Hypothesis – one of the classic reasons you want this is because Quickcheck can easily generate a lot of duplicate examples if you’re not careful. Hypothesis generally doesn’t do that because of its built in deduplication.

Still, it would be a useful feature which I would like to implement at some point.

Status: I know how this should work. I’ve had chunks of a working prototype before and everything was straightforward, but it got deprioritised.

Coverage based example discovery

The big thing that means that I can’t currently say “Hypothesis is great and you should use it for everything” is that actually for a lot of things you should be using python-AFL instead.

This shouldn’t be the case. Really everything python AFL can do, Hypothesis should be able to do better. It just currently can’t because it’s entirely black box.

Hypothesis should be able to use essentially the AFL algorithm to do this: Instead of just generating examples it generates examples, sees the coverage profile of those, then minimizes down to a seed set for that profile and starts mutating from there.

This would be particularly great in combination with the stateful testing, which has exactly the sort of enormous state space that this sort of concept is great for exploring.

Status: I have prototyped this and it works great, but my previous attempts were unacceptably slow. I’m about 80% sure I now know how to fix that.

Grammar based strategy definition

I’d like to be able to generate strings matching some grammar. I’ve previously implemented this in a very early version of Hypothesis for regular grammars but never finished it off.

Status: I’ve done enough of this before that I know how to do it again, but it would require starting from scratch because too much has changed since then.

Strategies for common string formats

I’d like to have built in strategies for URIs, emails, etc. that do not depend on fake factory. Fake factory is fine, but it’s basically insufficiently devious for the level of making your code suffer that Hypothesis users have come to expect.

Status: A bunch of work to get good, but not that hard to do. Especially given grammar based strategy definitions.

Discovery mode

Currently Hypothesis is intended for being run as part of a small test suite of reasonably fast tests. This is great and all, but particularly given the coverage based discovery features what you really want is to be able to part Hypothesis on a server somewhere and just have it sitting there 24/7 looking for bugs in your code.

Status: Requires some fairly hefty designing. A lot of known unknowns, probably some unknown unknowns. I don’t think there’s anything fundamentally difficult about this, but there are a lot of “How should it handle X?” questions that would need answering.

This entry was posted in Hypothesis, Python on by .

New release of hypothesis-pytest

Version 0.16.0 of the Hypothesis pytest plugin is out. This contains two new features kindly contributed by others:

  • Florian Bruhin added a ‘hypothesis’ marker to all tests that use @given, making it easy to select or exclude tests based on whether they use Hypothesis
  • Mikhail Stepura changed the default behaviour so that when ‘–capture=no’ the Hypothesis reporting integration is disabled and it will just print messages from Hypothesis directly.
This entry was posted in Hypothesis, Python, Uncategorized on by .