# Language reconstruction based fuzzing without reconstructing languages

Note: This idea has not been implemented yet so may turn out not to work. It all clicked together while I was walking around Paris this afternoon and I’m writing it down while it’s fresh in my mind. I think it will work though.

Secondary note: This post may be too inside baseball for you.

An idea that I’ve been banging my head against recently is that of reconstructing languages corresponding to tests based on some bounded data. I’ve posted a bit about its use for reducing, but I’ve also been interested in its use for fuzzing.

Here’s a problem Hypothesis experiences: The | operator is currently not associative. x | (y | z) will produce a different data distribution than (x | y) | z. This is suboptimal. I could special case it (and for a long time I thought I had, but apparently I dreamed that code), but it’s easy to come up with trivial ways to defeat that. e.g. (x | y).map(lambda x: x) | z should really have the same distribution as x | y | z but doesn’t if you special case |.

But under the hood Hypothesis just works with streams of bytes, so in theory you can represent its data distribution as a finite state machine. You could then randomly walk the state machine reweighting your choices by the number of reachable states and thus get much get better coverage of the set of available paths by reaching rare paths much more often.

Or, somewhat suggestively, here’s another algorithm that could work well and is a lot simpler than reweighting:

1. Pick a state uniformly at random
2. Take the shortest path to that state (or, indeed, any random path to that state).
3. Continue the random walk from there

Unfortunately we run into a major problem with either of these approaches: Reconstructing regular languages for unknown predicates is really hard.

The insight I had this afternoon is this: We don’t need to! We can use the same trick as I used to produce self-optimizing boolean expressions to just store exemplars of distinct states we’ve found. We then just pick a random state we’ve already discovered and run the generation from there (Hypothesis can extend any string to a complete but possibly invalid example).

How does this work?

We maintain three sets of data:

1. A corpus of examples such that we know that every element of this list definitely corresponds to a different state in the state machine for our predicate.
2. A set of previously seen strings (this doesn’t have to be precise and can e.g. be a bloom filter)
3. A binary search tree with branches labelled by strings and leaves labelled by indices into our list of examples.

We populate this as follows:

1. The corpus starts as the empty string
2. The set of previously seen strings contains just the empty string
3. The binary tree is a single leaf pointing to 0 (the index of the empty string).

We then fuzz as follows:

1. Pick a random element of our corpus.
2. Generate an example such that that random element is a strict prefix of the example (it’s easy to do this in Hypothesis).
3. For each proper prefix of that example, perform the incorporate operation.

The incorporate operation when run on a string s is:

1. If the string is in our set of seen strings, stop. Else, add it to the set.
2. Walk the binary tree until we hit a leaf. At each branch look at the label e. If the concatenation of s followed by e is valid, walk right. Else, walk left.
3. Once we are at a node, find the string in our corpus index by it. Call that t. By repeatedly fuzzing extending s and t, attempt to find e such that s + e and t + e produce different results. If we do find such an e, add s to the corpus as a new element and split the current tree node, creating a branch labelled by e and with the left and right children being leaves containing s and t (which one is which depending on which way around the difference occurred). Otherwise, if s is a better example than t (shorter, or same length and lexicographically smaller), replace t in the corpus with s.

The experiments we split on witness that two nodes must be in different states (cf. the Myhill-Nerode theorem), so we guarantee that our corpus consists of unique states. The fuzzing step is probabilistic, so we can sometimes discard a state inappropriately, but this shouldn’t be a major problem in practice – states that we can’t distinguish with high probability might as well be the same from a fuzzing point of view.

Anyway, like I said, it’s currently unclear how useful this is because I haven’t actually implemented it! In particular I suspect it’s more useful for long running fuzzing processes than for the sort of fast fuzzing Hypothesis tries to do. Fortunately I’m planning how to incorporate such workflows into Hypothesis. However I definitely still think it’s a good idea and I’m looking forward to exploring it further.

This entry was posted in Hypothesis, programming on by .