Augmenting shrinkers by inducing regular languages

This is an idea I’m playing with at the moment. I haven’t yet determined if it’s a good idea. My current belief is that it’s one of those ideas in the category of “Clever idea, could work, but too slow to be usable in practice”.

The problem I am trying to solve: I have some predicate \(f\) that takes a binary string and returns true or false. I am trying to find a string \(x\) such that \(f(x)\) is true and \(x\) is minimal length, and minimal lexicographically amongst the strings of minimal length (i.e. is minimal when comparing (len(x), x)). Although really we’re not actually interested in the minimal and only really care about smallish.

The traditional approach is to use a greedy algorithm. For example, one of the classic approaches in this field is delta debugging, which deletes sub-intervals to attempt to find a minimal length example (specifically it guarantees it finds an example such that you can’t delete any single character from it, though it usually does better in practice). This works and works well, but has the typical problem of greedy algorithms that it gets stuck in local minima. For example, a problem Hypothesis has in its shrinking right now is that there are often cases where a successful shrink would require both deleting one part of the string and changing the value of another part at the same time.

So there’s a question of how to get out of these local minima.

I’ve spotted one approach recently which I think might be interesting.

It starts from two key observations:

  1. Regardless of the structure of the predicate, the set of strings that satisfy the predicate and are dominated by some other string form a regular language (because they form a finite language). This regular language may in practice be prohibitively complicated, but we can always take a guess that it’s simple and bail out early if it turns out not to be.
  2. Given a regular language represented as a deterministic finite automaton, it is extremely fast to get the minimal element of it – annotate each node with its distance to a terminal node (e.g. using a Floyd-Warshall variant) then walk the graph, at each point picking the next node as the one with the smallest value, breaking ties by picking lower bytes.

So the idea is this: We use our existing shrinker to give us a corpus of positive and negative examples, then we attempt to find a regular language that is corresponds to those examples. We do this using a light variant on L* search where we offer only examples from our corpus that we’ve already calculated. As an optimization, we can also use our greedy shrinker to produce a minimized counter-example here (L* search works better the smaller your counter-examples are, and doing a lexicographic minimization keeps the number of prefixes under control).

Once we’ve done that we use the DFA we’ve built to get the minimal element of our inferred DFA. There are now three possibilities:

  1. The minimal element is also our current best example. Stop.
  2. The minimal element does not in fact satisfy the predicate. Update our L* search with a new counterexample.
  3. The minimal element does satisfy the predicate and is a strict improvement on our current best example. Start again from here.

You can also randomly walk the DFA and check if the result is interesting as another useful source of counter-examples, or indeed rerun the generator that got you the interesting example in the first place. Any way of randomly generating examples can be helpful here, although in the typical case of interest the set of true values for \(f\) is very sparse, so I expect the first form to be more useful because it has a high probability of generating something that is in our regular language but does not satisfy our predicate. I haven’t really experimented with this part yet.

At some point the number of states in the DFA will probably get prohibitively large. This can be used to terminate the shrink too. If we’ve made progress since we started the L* search we can restart the search from a new shrink, otherwise we stop there.

Does this work?

Empirically based on my prototyping the answer seems to be “Kinda”. My L* implementation appears to be quite slow in practice, but I’ve gotten a lot wrong in the course of writing it so I’m not sure that this isn’t just a bug. Additionally, there are definitely algorithmic improvements you can make to L* search. I’ve seen one that apparently improves the number of tests you have to perform from n^2 to n log(n) (n is the size of the final state machine), but I don’t understand it yet (mostly because I don’t quite get what a homing sequence is). Additionally there’s a paper I’ve yet to read on how to infer a non-deterministic finite automata instead, which I think may be more effective in practice (because converting NFAs to DFAs can cause an exponential blowup in the number of states, which I think may be happening here).

So at the moment the answer is “Ask again later”. But it’s an interesting idea that I think has some promise, so I thought I’d share my preliminary thoughts.

This entry was posted in programming, Python on by .

The easy way to get started with property based testing

There seems to be a common impression that property based testing is really for when you have code that manipulates perfect platonic objects with elegant algebraic relationships between them.

And, well, don’t get me wrong. When you do have those, property-based testing really shines. It’s an incredibly powerful way to test them, and it works really well.

But… that’s mostly because this is a realm that is incredibly easy to test compared to the complexity of getting it right, so almost anything you can do to improve your testing will help it. It’s less that that’s what property based testing is good for, and more that it’s easy to try new things when you’re playing testing on easy mode.

Here’s a recipe for getting started when you’re not playing testing on easy mode:

  1. Pick a function in your code base that you want to be better tested
  2. Call it with random data

This will possibly require you to figure out how to generate your domain objects. Hypothesis has a pretty extensive library of tools for generating custom types, but if you can try to start somewhere where the types you need aren’t too complicated to generate.

Chances are actually pretty good that you’ll find something wrong this way if you pick a sufficiently interesting entry point. For example, there’s a long track record of people trying to test interesting properties with their text handling and getting unicode errors when text() gives them something that their code didn’t know how to handle (The astral plane: It’s not just for D&D).

You’ll probably get exceptions here you don’t care about. e.g. some arguments to functions may not be valid. Set up your test to ignore those.

So at this point you’ll have something like this:

from hypothesis import given, reject
 
@given(some(), strategies())
def test_some_stuff(x, y):
    try:
       my_function(x, y)
    except (Ignorable, Exceptions):
       reject()

(reject simply filters out the example – you’re trying to find a large number of examples that don’t raise any of those exceptions).

This is already a pretty good starting point and does have a decent tendency to flush out bugs. You’ll often find cases where you forgot some boundary condition and your code misbehaves as a result. But there’s still plenty of room to improve.

There are now two directions you can go in from here:

  1. Try to assert some things about the function’s result. Anything at all. What type is it? Can it be None? Does it have any relation at all to the input values that you can check? It doesn’t have to be clever – even very trivial properties are useful here.
  2. Start making your code more defensive

The second part is probably the most productive one.

The goal is to turn faulty assumptions in your code into crashes instead of silent corruption of your application state. You can do this in a couple ways:

  1. Add argument checking to your functions (Hypothesis uses a dedicated InvalidArgumentException for this case, but you can raise whatever errors you find appropriate)
  2. Add a whole bunch of assertions into your code itself

Even when it’s hard to reason about formal properties of your code, it’s usually relatively easy to add local properties, and assertions are a great way to encode them. I like this post by John Regehr on the subject.

This approach will make your code more robust even if you don’t find any bugs in it during testing (and you’ll probably find bugs in it during testing), and it gives you a nice easy route into property based testing by letting you focus on only one half of the problem at a time.

And, of course, if you’re still having trouble getting started with property-based testing, the other easy way is to persuade your company to hire me for a training course. Drop me an email if you’re interested.

This entry was posted in Hypothesis, programming, Python on by .

Hypothesis progress is alarming

I had a brilliant idea just now:

08:54 <DRMacIver> shapr: So I’ve been thinking about your point that you have to save pre-shrinking state because you might have found multiple bugs, and I think it’s wrong.
08:54 <DRMacIver> Because I think you’re much more likely to have found those other bugs in the course of shrinking than you are with the original value
08:54 <DRMacIver> So what will happen if you save the pre-shrinking state is that it’ll rerun and go “Yup, that’s fine”
08:54 <DRMacIver> My solution of saving the post-shrinking state is *also* wrong in this case mind you.
08:55 <DRMacIver> I think I have a solution, but it relies pretty heavily on glassbox testing in order to be any good
08:57 <DRMacIver> I have a really frustrating level of future designs for Hypothesis in my head.
08:58 <DRMacIver> (frustrating because I can tell how far I am from implementing it, and every time I get closer I come up with new ones so the goal recedes into the distance)
09:22 <DRMacIver> Ooh. I’ve just had a *great* idea
09:23 <DRMacIver> It not only solves this problem it solves a bunch of historic problems too
09:24 <DRMacIver> Basically 1) save every interesting example in the database. 2) examples loaded from the database which are valid but uninteresting “radioactively decay”. I.e. they get deleted with some probability once they’ve been run

The context: Both Haskell’s QuickCheck and Hypothesis save the last failing example. QuickCheck saves it prior to minimization, Hypothesis saves it post minimization. Because of the issues pointed out in Reducers are Fuzzers, both are the wrong thing to do.

Additionally, Hypothesis has historically had the following problems:

  • What do you do when the database fills up with more examples than you know what to do with in a given run and none of them are interesting?
  • What do you do if the shrinker gets stuck or the process crashes before you’ve saved the example?

This solves both problems: We save every intermediate shrink as we find it and let the garbage collection process deal with the debris that subsequently proves uninteresting.

The above is a clever idea but is not the point of this post: The point of this post is that I have a growing sense that I am missing something.

Basically there is no reasonable way that I should be making as much progress on Hypothesis, Conjecture, etc as I am. This is not the first brilliant idea – I’ve had dozens. I’ve had so many brilliant ideas that I haven’t had time to implement all of them yet.

This is  bad sign.

There are three ways to make an astonishing amount of progress:

  1. Be wrong about the amount of progress you are making
  2. Be an unparalleled genius
  3. Be where the low hanging fruit is

I do not think I am wrong about the amount of progress I’m making. I will grant that some of my ideas turn out to be bad ones, or modifiable to make them good ones, but I’ve had enough practical validation that I’m reasonably sure that a lot of these concepts work and are useful. I’m probably off by a factor of two or three, but I doubt I’m out by an order of magnitude.

I am not an unparalleled genius. I am smart, but I’m one in a thousand, not one in a million. There are plenty of other people smarter than me, many of them working in similar fields.

So the only real option is that I’m hanging out in an orchard full of low hanging fruit, and I don’t really understand how that could be. Right now, life feels a bit like James Mickens describes in The Slow Winter:

I think that it used to be fun to be a hardware architect. Anything that you invented would be amazing, and the laws of physics were actively trying to help you succeed. Your friend would say, “I wish that we could predict branches more accurately,” and you’d think, “maybe we can leverage three bits of state per branch to implement a simple saturating counter,” and you’d laugh and declare that such a stupid scheme would never work, but then you’d test it and it would be 94% accurate, and the branches would wake up the next morning and read their newspapers and the headlines would say OUR WORLD HAS BEEN SET ON FIRE

I keep coming up with genuinely useful ideas with practical significance that work really well, and it all feels too easy.

There are a number of reasonable hypotheses about how this could be the case:

  1. I had one clever idea that unlocked everything else.
  2. Nobody cares about this problem, so they haven’t bothered to pick the fruit.
  3. I am ignoring giant swathes of prior art that I just don’t know exist because I’m an outsider to this field and couldn’t find on a google search because competence or academic firewall of doom.
  4. I am ignoring huge swathes of prior art because nobody has ever bothered to write it down or it’s locked up in proprietary tools.

Obviously the first is the one that I hope it is. To some extent I even have a plausible candidate for it: The idea that having data generators work on an intermediate representation of the final result rather than the final result directly would be useful. But this isn’t actually a very clever idea. It’s proven very fruitful, but it’s sufficiently obvious that it should have occurred to someone else before.

Two seems plausible. e.g. the idea I started out with is only really interesting if you want to integrate this sort of testing with normal testing workflows, which I do and some other people do but not many people who get to do full time research on this do. The conjecture concept only really matters if you’re trying to make these ideas work in imperative languages. etc. I’m in enough intersections that it’s at least plausible that nobody has cared about this intersection before. Additionally, my impression is that random testing isn’t very interesting to academics and most of the people who have put a lot of work into it are security researches, who have rather different focuses than I do.

So those are the optimistic scenarios. Both are a little depressing because it suggests there just hasn’t been enough interest in something that massively improves software quality to do even the level of research I’ve put into it, but at least they suggest I’m not wasting my time.

But I really can’t rule out 3 and 4, both of which are worrying.

The prior art I am mostly aware of and have based my work on is:

And I’m semi aware of but have consciously decided not to use the work on concolic testing, etc.

But I can’t help but feel there’s a lot more out there.

So, um, halp. Any prior art you can send me on the general subject of random testing, etc. would be very appreciated. Papers or working software or random blog posts that someone thought of a thing. It’s all good.

This entry was posted in programming, Python on by .

Superposition values for testing

Cory was talking about using AFL to fuzz test hyper-h2 the other day.

We talked about the difficulty of building a good starter corpus, and I thought I’d pull out some old code I had for using glassbox to do this.

It proceeded to fail utterly.

The problem is that H2 is a complex protocol which is very hard to basically probe through due to the number of different interactions. Simply throwing bytes at it just to see what happens is unlikely to do anything useful. This is similar to why historically that approach worked will with binaryornot and chardet, but was pretty rubbish on pyasn1.

Yesterday evening I thought about this problem a bit more and then started giggling. This is usually a bad sign.

The idea I’d come up with was this: What if we use a custom type to hold off on deciding the values until the last possible minute. That way we can get values that do interesting things to the internals of complex protocols by looking at the questions that the parser asks about the values and deciding what the answer is then rather than immediately.

The way this works is that you have a mock object that internally is saying “I am one of these values but I don’t know which”. Every time you perform a comparison it picks a possible answer at random and uses that to narrow down the list of possible values. At the end, your program should have behaved as if there had just been a really well chosen initial set of integers.

It turns out to work pretty well based on brief experimentation. My initial prototype didn’t support anything more than comparison operators, but after some further pondering I got it to support arbitrary arithmetic expressions. And thus, I present schroedinteger.

>>> x = schroedinteger({1, 3, 7})
>>> y = schroedinteger({-1, 1})
>>> z = x * y
>>> z
indeterminate: {-7, -3, -1, 1, 3, 7}
>>> x == 1
False
>>> x == 3
False
>>> z
indeterminate: {-7, 7}
>>> z == -7
True
>>> z
-7
>>> y
-1

The way this works is to separate out concepts: You have observables which are basically just a list of possible integers that get whittled down over time, and you have schroedintegers, which consist of:

  1. A set of observables they are interested in
  2. A function which maps an assignment of those observables to integers to a concrete integer

So when you perform arithmetic operations on schroedintegers it just creates a new one that shares the sets of observables of the two sides and evaluates both.

Every time you observe something about the system it looks at the set of possible answers, picks an answer uniformly at random from the results, and then collapses the state of possibilities to only those that would have produced that answer, and then returns it.

Performance is… OK. Where by OK I mean “not very good”. The set of heuristics used keep it manageable, but no better than that.

It could be improved if I really cared, but right now this project is a bit of a toy. In particular most operations are currently O(m * n). Many of these could be fixed to not be quite readily with a little more work – currently the implementation is very generic and many of the operations admit a nicely specific implementation that I haven’t used. e.g. a thing that would be pretty easy to do is to track upper and lower bounds for every schroedinteger and use those to exclude many possibilities.

I also investigated using Z3 to do this. It would be an interesting backend that would remove the most of the limitations. Unfortunately the results were really slow and kinda crashy (I had at least one hard to minimize segfault), so I’ve given up on that for now.

All told, an interesting result for about a day of hacking. Right now I plan to leave it where it is unless I come up with a particularly interesting use case. Let me know if you have one.

 

This entry was posted in programming, Python on by .

Commons: A board game

I’m overdue for a blog post and half asleep, so you get one of my half-arsed board game design posts instead of something deeper.

The idea is it’s a game about the tragedy of the commons. It works as follows (as per usual, numbers and details are essentially picked out of a hat and this has not ever been play tested so it is probably not actually fun. It may however contain the seed of a fun game):

You have a deck of 60 event cards, a large collection of resource tokens, and a starting player crown.

Resource tokens are split into the bank (an essentially unlimited collection of tokens), one pool per player, and one central pool.

Each player starts with ten tokens and the central pool starts with five per player. The crown is randomly assigned to one player to start.

Each pool will go up and down over the course of the game. If a player’s pool is ever depleted (goes down to zero or lower) that player drops out of the game. If the central pool is ever depleted, the game is over and everyone loses.

Play proceeds until there is only one player left (that player wins) or until the deck of cards runs out, at which point the person with the most tokens wins (if there is a tie they are joint winner).

Play occurs in a series of rounds:

Starting from the player with the crown and proceeding clockwise, each player draws two cards. One of them they play in front of them, the other they place face down in the middle.

Once everybody has played, add one additional face down card to the the cards in the middle, shuffle them, and the one at a time draw them and apply their effect to the central pool.

Cards are as follows:

  • 10 x Lose 1 token
  • 10 x Gain 1 token
  • 6 x Lose 2 tokens
  • 6 x Gain 2 tokens
  • 3 x Lose 3 tokens
  • 3 x Gain 3 tokens
  • 1 x Gain 5 tokens
  • 1 x Lose 5 tokens
  • 1 x If you have > 10 tokens, reset to 10, else do nothing
  • 1 x If you have < 10 tokens, reset to 10, else do nothing
  • 2 x Gain one resource token for every 5 you have, rounded up
  • 2 x Lose one resource token for every 5 you have, rounded up [Note: Yes this will kill you if you’re down to 1]
  • 3 x Given each other pool one resource token. If you don’t have enough, deal out all you have remaining starting with the common pool, then from the player with the crown going clockwise.
  • 3 x Take one resource token from each other pool.
  • 4 x Give 5 resource tokens to the pool with the fewest tokens that isn’t you [Tie break: Prefer the common pool if it’s joint. Else, prefer clockwise closest to the starting player]
  • 4 x Take 5 resource tokens from the pool with the most tokens, or as many as they have if it’s fewer than 5 [Tie break as above]

Each of these should be available in a variety of values of X, and the distribution should be such that every card is paired with one of the opposite effect. The deck should be quite heavily stacked with “Gain X / Lose X” cards, with the more complicated cards making up a bit under half the deck.

After a round has been completed, the crown moves one player clockwise and the next round begins.

Design notes

One of the interesting features of this game is that it is very hard to tell what other players are doing. You can observe the effects “Hmm your pile is suspiciously large for if you haven’t been fucking over the central pile” but you can’t actually tell who has played what card. There is a huge amount of plausible deniabilty in every move, especially with the random extra card in the center.

There’s also no way to target someone specific (there can’t be, because of the central pool needing to be able to play without choice).

Overall, the game has ended up as a sort of hilarious head on collision between the tragedy of the commons and the prisoner’s dilemma. I think I like that aspect of it.

Thanks to the other David for helping me think through some of the aspects of this.

This entry was posted in Games on by .