Soliciting advice: Bindings, Conjecture, and error handling

Edit: I think I have been talked into a significantly simpler system than the one described here that simply uses error codes plus some custom hooks to make this work. I’m leaving this up for posterity and am still interested in advice, but don’t worry I’ve already been talked out of using setjmp or implementing my own exception handling system.

I’m working on some rudimentary Python bindings to Conjecture and running into a bit of a problem: I’d like it to be possible to run Conjecture without forking, but I’m really struggling to come up with an error handling interface that works for this.

In Conjecture’s current design, any of the data generation functions can abort the process, and are run in a subprocess to guard against that. For testing C this makes absolute sense: It’s clean, easy to use, and there are so many things that can go wrong in a C program that will crash the process that your C testing really has to be resilient against the process crashing anyway so you might as well take advantage of that.

For Python, this is a bit sub-optimal. It would be really nice to be able to run Conjecture tests purely in process just looking for exceptions. os.fork() has to do a bunch of things which makes it much slower than just using C forking straight off (and the program behaves really weirdly when you send it a signal if you try to use the native fork function), and it’s also just a bit unneccessary for 90% of what you do with Python testing.

It would also be good to support a fork free mode so that Conjecture can eventually work on Windows (right now it’s very unixy).

Note: I don’t need forkless mode to handle crashes that are not caused by an explicit call into the conjecture API. conjecture_reject and conjecture_fail (which doesn’t exist right now but could) will explicitly abort the test, but other things that cause a crash are allowed to just crash the process in forkless mode.

So the problem is basically how to combine these interfaces, and every thing I come up with seems to be “Now we design an exception system…”

Here is the least objectionable plan I have so far. It requires a lot of drudge work on my part, but this should mostly be invisible to the end user (“Doing the drudge work so you don’t have to” is practically my motto of good library design)

Step 1: For each draw_* function in the API, add a second draw_*_checked function which has exactly the same signature. This does a setjmp, followed by a call to the underlying draw_* function. If that function aborts, it does a longjmp back to the setjmp and sets a is_aborted flag and returns some default value. Bindings must always call the _checked version of the function, then check conjecture_is_aborted() and convert it into a language appropriate error condition.

Note: It is a usage error to call one checked function from another and this will result in your crashing the process. Don’t do that. These are intended to be entry points to the API, not something that you should use in defining data generators.

Step 2: Define a “test runner” interface. This takes a test, some associated data, and runs it and returns one of three states: Passing test, failing test, rejected test. The forking based interface then becomes a single test runner. Another one using techniques similar to the checked interface is possible. Bindings libraries should write their own – e.g. a Python one would catch all exceptions and convert them into an appropriate response.

Step 3: Define a cleanup API. This lets you register a void (*cleanup)(void *data) function and some data to pass to it which may get called right before aborting. In “crash the process” model it is not required to be called, and it will not get called if your process otherwise exits abnormally. Note: This changes the memory ownership model of all data generation. Data returned to you from generators is no longer owned by you and you may not free it.

I think this satisfies the requirements of being easy to use from both C and other languages, but I’m a little worried that I’m not so much reinventing the wheel as trying to get from point A to point B without even having heard of these wheel things and so I invented the pogo stick instead. Can anyone who has more familiarity with writing C libraries designed to be usable from both C and other languages offer me some advice and/or (heh) pointers?

This entry was posted in Hypothesis, programming, Python on by .

Structure aware simplification for Conjecture

I’m continuing to work on some of the issues I described in my last post about simplification for conjecture. I’ve got something that needs further work but is pretty good.

The idea is to let the data generation provide structure hints. As such it adds two additional functions to the conjecture API: One starts an example, another ends it. Examples can be and indeed routinely are nested.

We use this API to produce a series of intervals as [start, end) pairs. These then get to be the main focus of where we try to shrink.

So here’s how the shrinking works given that:

Every time we run a test, if the test fails we replace the current best buffer with the buffer that triggered the failure, trimming it to only the initial segment that was read (we only ever consider buffers that are simpler than the current best, so we don’t need to check that it is).

When we do this we also record all the intervals in the buffer that correspond to examples. We sort these from most complicated to simplest in the same order that we use for buffers – that is, an interval is simpler than another if it is shorter than it or is of the same length but lexicographically before.

The first step is to try to prune the buffer down as much as possible by throwing away parts of it that don’t matter. We greedily delete intervals, starting from most to least complicated. If a deletion succeeds, we repeat this until no intervals can be deleted.

Some details:

  1. A trick: If we start from the most complicated each time then we’re going to end up doing a lot of work that never succeeds. What we actually do is that we maintain an index variable and always take the interval at that index in the list. When we hit the end, we either stop the process or if we’ve made any changes we start again from the beginning. This works despite the fact that the list of intervals may be changing under us.
  2. In my test cases there end up being a lot of small examples and it’s not worth trying to delete them all in this crude trimming pass. The current magic number here is that we ignore any intervals with fewer than 8 bytes.

In my test cases (which are not very representative it must be said) this typically cuts the buffer size by a factor of ten. It’s also a lot faster than structure ignorant shrinking of this sort was – if the structure gave nothing but this then it would be a win.

Starting from that buffer, for each interval (starting from the least complicated to most) we then perform a simplification I’m calling improve and clone.

We first try to improve the interval as much as possible: This involves repeatedly applying bytewise simplification operations (basically:

  1. For each index, try replacing the byte with a smaller one and see if that helps. A linear probe is actually OK here though I use something slightly smarter to cut down on constant factors)
  2. Try a replacement chosen uniformly at random that sorts lexicographically before the current value.
  3. Interpreting the interval as an unsigned n-byte 64-bit integer, try subtracting one.

You should normally avoid operations that look like the third one (actually the second one isn’t great either), but note that we only try it once and then we cycle back to to bytewise lowering, which will tend to then feed on its results to produce a lot more of a shrink.

Once that’s done we try cloning the interval: We take the segment currently in that interval and try replacing other intervals with it. We only choose intervals which are at least the same length (so we don’t make the buffer longer) and are “not too much larger” (and, if the intervals are the same size, sort lexicographically larger than the current one). The current upper bound is no more than twice as large, or no more than 16 if that would be less than 16. We do this regardless of whether any changes were made in the early phase.

The way the cloning phase is mixed in is important: As well as being a generally good strategy, by doing the example cloning on the element we just shrank we can avoid duplicating a lot of our hard work when it comes to shrinking other things.

We use the same index trick to do this for every interval until there are no more improvements or we run out of time / hit other configuration limits.

This still needs  more work, but for the examples I’ve tried it on (from Hypothesis’s example test suite) it appears to be nearly as good as Hypothesis’s simplification now (i.e. probably better than most other Quickchecks). This is only for some of the structurally simpler examples (they’re large but not very complicated) and I suspect as I flesh it out further I’ll find limitations to this approach, but they’re still examples that the initial attempts struggled massively on, so it’s pretty encouraging.

This entry was posted in Uncategorized on by .

Conjecture WIP: Simplification isn’t simple

Advance note: Read the Conjecture design document first or this post won’t make sense.

I’ve been doing some prototyping of Conjecture in a private fork of Hypothesis, mostly to take advantage of Hypothesis’s fairly brutal test suite and comprehensive range of examples. I figure if I can make the Conjecture approach pass the Hypothesis test suite then that’s a pretty good indication that the idea works.

Spoiler alert: I can’t currently make Conjecture pass the Hypothesis test suite.

Fortunately, the way I can’t make Conjecture pass the Hypothesis test suite isn’t the end of the world. The problem is basically one of performance. For small examples in a lot of cases Conjecture’s approach performs as well as or better than Hypothesis. For large examples it’s just too slow: It gets the right answer eventually, but it takes an order of magnitude more time than Hypothesis does.

The reason for this is pretty simple: The underlying data buffers that conjecture ends up generating for an example of any reasonable size are quite large (10s of kilobytes), and this is going to remain the case even after simplification. So any structure unaware binary simplification process is going to be a bit stuck about what to do. It’ll find the right answer eventually, but there are a lot of things to try.

I think the answer is going to have to be to make it structure aware. I haven’t yet hit on exactly the right approach for this yet, but the minimal abstraction that I think will work is to give it an idea of nesting: Examples can call conjecture_example_start() when they begin and conjecture_example_end() when they end, and this gives us nested ranges of where examples live in our data stream, which should be useful information for providing us with hints about where to start. e.g. one of the best simplifications is to try deleting whole ranges of data, and another good one is to try zeroing whole ranges. Unfortunately there are a lot of ranges in a buffer (O(n^2) in fact) and it’s very hard to pick the good ones. Focusing on ranges that snap to example boundaries would help a lot here.

It also potentially gives us a way to deal with the accidental dependency problem, as we can look for pieces that have moved from one example to another in the course of simplification as likely candidates for deletion. This might be too hard to keep track of the be worth it though as this hasn’t proven to be a major problem in practice.

In the course of thinking about these ideas I also came up with another idea that I initially thought was ridiculous but actually now I’m not so sure and think might be a good idea: Prioritizing examples that print less data as a mitigation for cases where simplicity of the underlying buffer isn’t always reflected by simplicity of example.

The way this would work is that we would still maintain our example search exactly as normal, but we would also maintain a “favoured” example, which is the example with the smallest number of bytes printed, with ties being broken by simplicity of underlying buffer (i.e. if we shrink to an example which prints the same number of bytes we always replace).

The reason I think this is a good idea is that most of what we’re focusing on in examples is readability, and size of example is a pretty good proxy for that – large examples are hard to read, small examples are easy to read. It’s not a perfect proxy for it, which is why we still want the underlying simplification process, but I think it’s a good tool for mitigating that process’s potential flaws.

Another thing that would be useful to investigate is whether a smarter search algorithm is useful here. e.g. simulated annealing. The local search approach was a great idea when you had a fairly abstract data representation and needed a simple API that was easy to extend to new things, but when what you’ve got is a concrete binary representation it’s probably not the best search algorithm.

Basically, all of this is harder than I thought. It’s not intractably hard, and for Conjecture itself which is C, as long as you’re running it on pretty fast tests you can basically brute force through the problem (although I need to update its simplification). I still think the idea is a great one, but great ideas also need hard work and that’s still yet to come.

This entry was posted in programming on by .

A new approach to property based testing

Edit: I’ve put together a prototype of these ideas called Conjecture. The design document may be a better place to start. I’ve had multiple people say “I had no idea what you were talking about when I read that blog post but it’s suddenly incredibly obvious now I’ve read the design document” or similar.

Do you ever have one of those thought processes where you go:

  1. Hmm. That’s an interesting idea.
  2. Ooh. And then I can do this.
  3. And that.
  4. Holy shit this solves so many problems.
  5. …by invalidating almost all the work I’ve done on this project.
  6. …and a lot of work other people have done over the last few years.
  7. I don’t know whether to be happy or sad.
  8. I think this might change literally everything.

Well, I’m having one right now.

Through a collision of multiple ideas – some mine, some other peoples’ – I’ve come up with an entirely new backend and API for property based testing. It’s drastically simpler to use, drastically simpler to implement, and works more or less in any language because it requires basically no advanced features. I’m pretty sure I can rebuild Hypothesis on top of it and that in the course of doing so I will be able to throw away approximately half the code. I’m also strongly considering doing a port to C and rebuilding Hypothesis as a set of bindings to that.

Here’s how it works:

We introduce a type TestContext. A TestContext is three pieces of data:

  1. An immutable sequence of bytes.
  2. An index into that sequence.
  3. A file handle, which may be None/NULL/whatever.

Given a test case, we work with testing functions. A testing function is any function which takes a TestContext as its first argument and then does one of three things:

  1. Returns a value
  2. Rejects the TestContext
  3. Fails

And in the course of doing so writes some data to the test context’s file handle.

A test case is then just a test function which takes no extra arguments and returns an ignored value. We generate a fresh test context, feed it to the test function and see what happens. If we get a failure, the test fails. We then try again until we get some fixed number of examples that are neither failures or rejections.

The primitive functions we build everything on top of are:

  1.  draw_bytes(context, n): If there are more than n bytes left in the buffer starting from the current index, return the next n bytes and then increment the index by n. Otherwise, reject the TestContext.
  2. fail(context): Essentially an ‘assert False’. Mark the current context as a failure.
  3. report(context, message): Print a message to the context’s file handle.

Reporting is used to show intermediate values so you can track what your test case does. Some additional state for controlling quality of reporting may also be useful, as are some helper functions (e.g. for my prototype I ended up defining declare(context, name, value) which prints name=value and then returns value)

We can build everything on top of this in the usual way that you would build an random number generator: e.g. you can generate an n-byte integer by drawing n bytes and combining them.

The important realisation is that this interface supports minimization! Once we have generated a sequence of bytes that produces a failure, we can start to perform standard quickcheck style shrinking on it, but because we’re only interested in sequences of bytes we can do a lot of clever things that are normally not open to us that are specialised to shrinking byte sequences.

So we generate, find a failure, then shrink the failure, then we run one final time with a real file handle passed into the test context this time so the log of our minimized test run is printed.

And this seems to work really rather well. It unifies the concepts of strategy and test case, and handles the problem you can have when defining your own generators of how to display data

You need to arrange your generators a little carefully, and some special shrinks are useful: For example, in my experiments I’ve found that generating integers in big-endian order is important, but you then also need a shrinking operation that lets you swap adjacent bytes x, y when you have x > y. Given that, integer shrinking appears to work quite well. I haven’t yet got a floating point generator working properly (there’s the obvious one where you generate a word of the appropriate size and then reinterpret it as a float, but this is a terrible distribution which is unable to find interesting bugs).

I suspect in general the shrinks this produce will often be a bit worse than that in classic quickcheck because they don’t have access to the structure, but in some cases that may actually improve matters – there are a lot of things where classic quickcheck shrinking can get stuck in a local optimum which this essentially allows you to redraw for.

This supports example saving like the normal Hypothesis interface does, and if anything it supports it better because your examples are literally a sequence of bytes and there’s almost nothing to do.

It supports composition of strategies with side effects in the middle in a way that is currently incredibly complicated.

Another nice feature is that it means you can drive your testing with american fuzzy lop if you like, because you can just write a program that reads in a buffer from standard in, feeds it to your test case and sees what happens.

In general, I’m really very excited by this whole concept and think it is the future of both the Hypothesis internals and the API. I don’t yet know what the path to implementing that is, and I’m currently considering doing the C implementation first, but I’ve put together a prototype which seems to work pretty well. Watch this space to see what comes next.

This entry was posted in Hypothesis on by .