Category Archives: programming

Buy me books!

A thing people sometimes ask me is how they can donate to Hypothesis. Historically the answer has been that they can’t – I’d rather people pay for services, support contracts, etc. If companies want to donate I can definitely figure something out for you, but individual donations are both not enough to make the difference between sustainability and not and also honestly I feel a little weird about it.

I probably will arrange some better funding options like this, but in general I’d much rather companies rather than individuals pay me.

But it occurred to me yesterday that a) There are a lot of books that it would be useful for me to have and b) For no obviously sensible reason, I do not feel at all weird about people buying me books. People buying me books is great.

So, to the point! I now have a public amazon wish list for people who want to say thanks for Hypothesis. You can buy me things off it if you’d like to do so. You shouldn’t feel obligated to – it’s totally fine if you can’t afford to or don’t want to – but obviously it would be super appreciated if you do.

Things to note:

  • For now at least it will be entirely books. I may add non-book things later but I’ve no real plans to.
  • I’m always up for more book recommendations to add to the list.
  • It’ll give me your name and a note you can add or not as you prefer, but not share either of our addresses with the other.
  • The general theme of the list is mostly going to be my research interests and other things I can reasonably claim are work related. These are quite Hypothesis centric, so buying me books from it will tend to have useful consequences for Hypothesis development, but no promises.
  • Everyone who buys me something will of course get a public thanks (unless you ask me not to).
  • Naturally, you’re welcome to buy me things off this list as thanks for things other than Hypothesis.

Somewhat to my surprise, so far it’s gone really well! Rob Smallshire has bought me Debug It. I’ve also been bought Combinatorial Optimization and Growing Object Oriented Software, but I won’t know by whom until they arrive (I only know this by process of elimination. Amazon doesn’t tell me anything until I get the book).

This entry was posted in Hypothesis, programming, Python, Uncategorized on by .

Augmenting shrinkers by inducing regular languages

This is an idea I’m playing with at the moment. I haven’t yet determined if it’s a good idea. My current belief is that it’s one of those ideas in the category of “Clever idea, could work, but too slow to be usable in practice”.

The problem I am trying to solve: I have some predicate \(f\) that takes a binary string and returns true or false. I am trying to find a string \(x\) such that \(f(x)\) is true and \(x\) is minimal length, and minimal lexicographically amongst the strings of minimal length (i.e. is minimal when comparing (len(x), x)). Although really we’re not actually interested in the minimal and only really care about smallish.

The traditional approach is to use a greedy algorithm. For example, one of the classic approaches in this field is delta debugging, which deletes sub-intervals to attempt to find a minimal length example (specifically it guarantees it finds an example such that you can’t delete any single character from it, though it usually does better in practice). This works and works well, but has the typical problem of greedy algorithms that it gets stuck in local minima. For example, a problem Hypothesis has in its shrinking right now is that there are often cases where a successful shrink would require both deleting one part of the string and changing the value of another part at the same time.

So there’s a question of how to get out of these local minima.

I’ve spotted one approach recently which I think might be interesting.

It starts from two key observations:

  1. Regardless of the structure of the predicate, the set of strings that satisfy the predicate and are dominated by some other string form a regular language (because they form a finite language). This regular language may in practice be prohibitively complicated, but we can always take a guess that it’s simple and bail out early if it turns out not to be.
  2. Given a regular language represented as a deterministic finite automaton, it is extremely fast to get the minimal element of it – annotate each node with its distance to a terminal node (e.g. using a Floyd-Warshall variant) then walk the graph, at each point picking the next node as the one with the smallest value, breaking ties by picking lower bytes.

So the idea is this: We use our existing shrinker to give us a corpus of positive and negative examples, then we attempt to find a regular language that is corresponds to those examples. We do this using a light variant on L* search where we offer only examples from our corpus that we’ve already calculated. As an optimization, we can also use our greedy shrinker to produce a minimized counter-example here (L* search works better the smaller your counter-examples are, and doing a lexicographic minimization keeps the number of prefixes under control).

Once we’ve done that we use the DFA we’ve built to get the minimal element of our inferred DFA. There are now three possibilities:

  1. The minimal element is also our current best example. Stop.
  2. The minimal element does not in fact satisfy the predicate. Update our L* search with a new counterexample.
  3. The minimal element does satisfy the predicate and is a strict improvement on our current best example. Start again from here.

You can also randomly walk the DFA and check if the result is interesting as another useful source of counter-examples, or indeed rerun the generator that got you the interesting example in the first place. Any way of randomly generating examples can be helpful here, although in the typical case of interest the set of true values for \(f\) is very sparse, so I expect the first form to be more useful because it has a high probability of generating something that is in our regular language but does not satisfy our predicate. I haven’t really experimented with this part yet.

At some point the number of states in the DFA will probably get prohibitively large. This can be used to terminate the shrink too. If we’ve made progress since we started the L* search we can restart the search from a new shrink, otherwise we stop there.

Does this work?

Empirically based on my prototyping the answer seems to be “Kinda”. My L* implementation appears to be quite slow in practice, but I’ve gotten a lot wrong in the course of writing it so I’m not sure that this isn’t just a bug. Additionally, there are definitely algorithmic improvements you can make to L* search. I’ve seen one that apparently improves the number of tests you have to perform from n^2 to n log(n) (n is the size of the final state machine), but I don’t understand it yet (mostly because I don’t quite get what a homing sequence is). Additionally there’s a paper I’ve yet to read on how to infer a non-deterministic finite automata instead, which I think may be more effective in practice (because converting NFAs to DFAs can cause an exponential blowup in the number of states, which I think may be happening here).

So at the moment the answer is “Ask again later”. But it’s an interesting idea that I think has some promise, so I thought I’d share my preliminary thoughts.

This entry was posted in programming, Python on by .

The easy way to get started with property based testing

There seems to be a common impression that property based testing is really for when you have code that manipulates perfect platonic objects with elegant algebraic relationships between them.

And, well, don’t get me wrong. When you do have those, property-based testing really shines. It’s an incredibly powerful way to test them, and it works really well.

But… that’s mostly because this is a realm that is incredibly easy to test compared to the complexity of getting it right, so almost anything you can do to improve your testing will help it. It’s less that that’s what property based testing is good for, and more that it’s easy to try new things when you’re playing testing on easy mode.

Here’s a recipe for getting started when you’re not playing testing on easy mode:

  1. Pick a function in your code base that you want to be better tested
  2. Call it with random data

This will possibly require you to figure out how to generate your domain objects. Hypothesis has a pretty extensive library of tools for generating custom types, but if you can try to start somewhere where the types you need aren’t too complicated to generate.

Chances are actually pretty good that you’ll find something wrong this way if you pick a sufficiently interesting entry point. For example, there’s a long track record of people trying to test interesting properties with their text handling and getting unicode errors when text() gives them something that their code didn’t know how to handle (The astral plane: It’s not just for D&D).

You’ll probably get exceptions here you don’t care about. e.g. some arguments to functions may not be valid. Set up your test to ignore those.

So at this point you’ll have something like this:

from hypothesis import given, reject
 
@given(some(), strategies())
def test_some_stuff(x, y):
    try:
       my_function(x, y)
    except (Ignorable, Exceptions):
       reject()

(reject simply filters out the example – you’re trying to find a large number of examples that don’t raise any of those exceptions).

This is already a pretty good starting point and does have a decent tendency to flush out bugs. You’ll often find cases where you forgot some boundary condition and your code misbehaves as a result. But there’s still plenty of room to improve.

There are now two directions you can go in from here:

  1. Try to assert some things about the function’s result. Anything at all. What type is it? Can it be None? Does it have any relation at all to the input values that you can check? It doesn’t have to be clever – even very trivial properties are useful here.
  2. Start making your code more defensive

The second part is probably the most productive one.

The goal is to turn faulty assumptions in your code into crashes instead of silent corruption of your application state. You can do this in a couple ways:

  1. Add argument checking to your functions (Hypothesis uses a dedicated InvalidArgumentException for this case, but you can raise whatever errors you find appropriate)
  2. Add a whole bunch of assertions into your code itself

Even when it’s hard to reason about formal properties of your code, it’s usually relatively easy to add local properties, and assertions are a great way to encode them. I like this post by John Regehr on the subject.

This approach will make your code more robust even if you don’t find any bugs in it during testing (and you’ll probably find bugs in it during testing), and it gives you a nice easy route into property based testing by letting you focus on only one half of the problem at a time.

And, of course, if you’re still having trouble getting started with property-based testing, the other easy way is to persuade your company to hire me for a training course. Drop me an email if you’re interested.

This entry was posted in Hypothesis, programming, Python on by .

Hypothesis progress is alarming

I had a brilliant idea just now:

08:54 <DRMacIver> shapr: So I’ve been thinking about your point that you have to save pre-shrinking state because you might have found multiple bugs, and I think it’s wrong.
08:54 <DRMacIver> Because I think you’re much more likely to have found those other bugs in the course of shrinking than you are with the original value
08:54 <DRMacIver> So what will happen if you save the pre-shrinking state is that it’ll rerun and go “Yup, that’s fine”
08:54 <DRMacIver> My solution of saving the post-shrinking state is *also* wrong in this case mind you.
08:55 <DRMacIver> I think I have a solution, but it relies pretty heavily on glassbox testing in order to be any good
08:57 <DRMacIver> I have a really frustrating level of future designs for Hypothesis in my head.
08:58 <DRMacIver> (frustrating because I can tell how far I am from implementing it, and every time I get closer I come up with new ones so the goal recedes into the distance)
09:22 <DRMacIver> Ooh. I’ve just had a *great* idea
09:23 <DRMacIver> It not only solves this problem it solves a bunch of historic problems too
09:24 <DRMacIver> Basically 1) save every interesting example in the database. 2) examples loaded from the database which are valid but uninteresting “radioactively decay”. I.e. they get deleted with some probability once they’ve been run

The context: Both Haskell’s QuickCheck and Hypothesis save the last failing example. QuickCheck saves it prior to minimization, Hypothesis saves it post minimization. Because of the issues pointed out in Reducers are Fuzzers, both are the wrong thing to do.

Additionally, Hypothesis has historically had the following problems:

  • What do you do when the database fills up with more examples than you know what to do with in a given run and none of them are interesting?
  • What do you do if the shrinker gets stuck or the process crashes before you’ve saved the example?

This solves both problems: We save every intermediate shrink as we find it and let the garbage collection process deal with the debris that subsequently proves uninteresting.

The above is a clever idea but is not the point of this post: The point of this post is that I have a growing sense that I am missing something.

Basically there is no reasonable way that I should be making as much progress on Hypothesis, Conjecture, etc as I am. This is not the first brilliant idea – I’ve had dozens. I’ve had so many brilliant ideas that I haven’t had time to implement all of them yet.

This is  bad sign.

There are three ways to make an astonishing amount of progress:

  1. Be wrong about the amount of progress you are making
  2. Be an unparalleled genius
  3. Be where the low hanging fruit is

I do not think I am wrong about the amount of progress I’m making. I will grant that some of my ideas turn out to be bad ones, or modifiable to make them good ones, but I’ve had enough practical validation that I’m reasonably sure that a lot of these concepts work and are useful. I’m probably off by a factor of two or three, but I doubt I’m out by an order of magnitude.

I am not an unparalleled genius. I am smart, but I’m one in a thousand, not one in a million. There are plenty of other people smarter than me, many of them working in similar fields.

So the only real option is that I’m hanging out in an orchard full of low hanging fruit, and I don’t really understand how that could be. Right now, life feels a bit like James Mickens describes in The Slow Winter:

I think that it used to be fun to be a hardware architect. Anything that you invented would be amazing, and the laws of physics were actively trying to help you succeed. Your friend would say, “I wish that we could predict branches more accurately,” and you’d think, “maybe we can leverage three bits of state per branch to implement a simple saturating counter,” and you’d laugh and declare that such a stupid scheme would never work, but then you’d test it and it would be 94% accurate, and the branches would wake up the next morning and read their newspapers and the headlines would say OUR WORLD HAS BEEN SET ON FIRE

I keep coming up with genuinely useful ideas with practical significance that work really well, and it all feels too easy.

There are a number of reasonable hypotheses about how this could be the case:

  1. I had one clever idea that unlocked everything else.
  2. Nobody cares about this problem, so they haven’t bothered to pick the fruit.
  3. I am ignoring giant swathes of prior art that I just don’t know exist because I’m an outsider to this field and couldn’t find on a google search because competence or academic firewall of doom.
  4. I am ignoring huge swathes of prior art because nobody has ever bothered to write it down or it’s locked up in proprietary tools.

Obviously the first is the one that I hope it is. To some extent I even have a plausible candidate for it: The idea that having data generators work on an intermediate representation of the final result rather than the final result directly would be useful. But this isn’t actually a very clever idea. It’s proven very fruitful, but it’s sufficiently obvious that it should have occurred to someone else before.

Two seems plausible. e.g. the idea I started out with is only really interesting if you want to integrate this sort of testing with normal testing workflows, which I do and some other people do but not many people who get to do full time research on this do. The conjecture concept only really matters if you’re trying to make these ideas work in imperative languages. etc. I’m in enough intersections that it’s at least plausible that nobody has cared about this intersection before. Additionally, my impression is that random testing isn’t very interesting to academics and most of the people who have put a lot of work into it are security researches, who have rather different focuses than I do.

So those are the optimistic scenarios. Both are a little depressing because it suggests there just hasn’t been enough interest in something that massively improves software quality to do even the level of research I’ve put into it, but at least they suggest I’m not wasting my time.

But I really can’t rule out 3 and 4, both of which are worrying.

The prior art I am mostly aware of and have based my work on is:

And I’m semi aware of but have consciously decided not to use the work on concolic testing, etc.

But I can’t help but feel there’s a lot more out there.

So, um, halp. Any prior art you can send me on the general subject of random testing, etc. would be very appreciated. Papers or working software or random blog posts that someone thought of a thing. It’s all good.

This entry was posted in programming, Python on by .

Superposition values for testing

Cory was talking about using AFL to fuzz test hyper-h2 the other day.

We talked about the difficulty of building a good starter corpus, and I thought I’d pull out some old code I had for using glassbox to do this.

It proceeded to fail utterly.

The problem is that H2 is a complex protocol which is very hard to basically probe through due to the number of different interactions. Simply throwing bytes at it just to see what happens is unlikely to do anything useful. This is similar to why historically that approach worked will with binaryornot and chardet, but was pretty rubbish on pyasn1.

Yesterday evening I thought about this problem a bit more and then started giggling. This is usually a bad sign.

The idea I’d come up with was this: What if we use a custom type to hold off on deciding the values until the last possible minute. That way we can get values that do interesting things to the internals of complex protocols by looking at the questions that the parser asks about the values and deciding what the answer is then rather than immediately.

The way this works is that you have a mock object that internally is saying “I am one of these values but I don’t know which”. Every time you perform a comparison it picks a possible answer at random and uses that to narrow down the list of possible values. At the end, your program should have behaved as if there had just been a really well chosen initial set of integers.

It turns out to work pretty well based on brief experimentation. My initial prototype didn’t support anything more than comparison operators, but after some further pondering I got it to support arbitrary arithmetic expressions. And thus, I present schroedinteger.

>>> x = schroedinteger({1, 3, 7})
>>> y = schroedinteger({-1, 1})
>>> z = x * y
>>> z
indeterminate: {-7, -3, -1, 1, 3, 7}
>>> x == 1
False
>>> x == 3
False
>>> z
indeterminate: {-7, 7}
>>> z == -7
True
>>> z
-7
>>> y
-1

The way this works is to separate out concepts: You have observables which are basically just a list of possible integers that get whittled down over time, and you have schroedintegers, which consist of:

  1. A set of observables they are interested in
  2. A function which maps an assignment of those observables to integers to a concrete integer

So when you perform arithmetic operations on schroedintegers it just creates a new one that shares the sets of observables of the two sides and evaluates both.

Every time you observe something about the system it looks at the set of possible answers, picks an answer uniformly at random from the results, and then collapses the state of possibilities to only those that would have produced that answer, and then returns it.

Performance is… OK. Where by OK I mean “not very good”. The set of heuristics used keep it manageable, but no better than that.

It could be improved if I really cared, but right now this project is a bit of a toy. In particular most operations are currently O(m * n). Many of these could be fixed to not be quite readily with a little more work – currently the implementation is very generic and many of the operations admit a nicely specific implementation that I haven’t used. e.g. a thing that would be pretty easy to do is to track upper and lower bounds for every schroedinteger and use those to exclude many possibilities.

I also investigated using Z3 to do this. It would be an interesting backend that would remove the most of the limitations. Unfortunately the results were really slow and kinda crashy (I had at least one hard to minimize segfault), so I’ve given up on that for now.

All told, an interesting result for about a day of hacking. Right now I plan to leave it where it is unless I come up with a particularly interesting use case. Let me know if you have one.

 

This entry was posted in programming, Python on by .