This is an experiment I’ve been running for the last couple of days (on and off and with a bunch of tinkering). It was intended as a prototype for using glassbox in a next-gen version of Hypothesis, but it’s proven interesting in its own right.
The idea is a specific automated way of using a test case reducer as a fuzzer using branch instrumentation (I’m using afl‘s instrumentation via the afl-showmap command): For every branch we ever observe the program taking, we try to construct a minimal example that hits that branch.
This will tend to produce interesting examples because you throw away a lot of extraneous detail that isn’t required to hit that branch. This is is particularly true of “tolerant” parsers which try to recover from a lot of errors.
How it works
The core idea is that we take a normal test case reducer and repeatedly apply it in a way that automatically turns it into a multi-objective reducer.
Say we have a function, label, which takes a binary string and returns a set of labels. Labels can be anything, but in the case of using AFL’s instrumentation they’re essentially branches the program can take along with a rough count of how many times that branch was taken (essentially because the branches are hashes so some different branches may end up equated with each other).
We replace the labelling function with a side-effectful version of it which returns the original results but also updates a table which maps each label to its “best” example we’ve seen so far. We consider a string better than another if it is either shorter or the same length but sorts lexicographically before the other (when viewed as a sequence of unsigned 8 bit integers).
We then repeatedly iterate the following process: Pick a label, take the best example for that label, and reduce that test case with respect to the condition that it has that label (updating the other test cases with every call).
There are various heuristics you could use to pick a label. The ones I’ve tried are:
- Pick one of the labels which currently has the best example
- Pick one of the labels which currently has the worst example
- Pick any label, uniformly at random
Uniformly at random seems to work the best: The others have a tendency to get stuck. In the case of ‘best’ there are a lot of small labels and it ends up spending a lot of time trying to shrink them all, not doing very interesting work in the process. In the case of ‘worst’ it tends to spend all its time trying to shrink very hard to shrink labels and not getting very far. Uniformly at random seems to consistently make progress and find interesting results.
There are a couple of extra useful things you can do to speed up the process.
The first is that every time label is called you can mark the string as known. Then when shrinking instead of shrinking by whether the string has the label, you shrink by whether the string is either the current best for the label or is unknown.
This works because if the string were simpler than the current best and already known, then the current best would already have been updated to that string.
This is the equivalent of caching the predicate for delta-debugging, but you don’t want to cache the label function because its outputs are complex values (they’re sets of labels, so there are \(2^n\) distinct values even after interning) so end up consuming a lot of memory if you cache them.
The second is that you can often tell when a label is going to be useless to shrink and skip it. There are two things you can do here:
- If when you tried to shrink a label it made no changes, you can mark that label as ‘finished’. If another shrink later improves the label, you remove the finished mark. A finished label cannot be shrunk further and thus can be skipped.
- By maintaining a counter that is updated every time a label is improved or added to the table, you can tell if an attempt to shrink did anything at all by checking the counter before and after. If it did nothing, you can mark the string as finished. Any labels whose current best string is finished can also be skipped.
This also gives a way of terminating the fuzz when there’s nothing left that’s discoverable: If every label is skippable, you’re done.
This seems to work quite well in practice. Starting from a relatively large initial example, it quickly increases the number of labels by about an order of magnitude (some of these are just difference in branch counts, as AFL counts not just whether the branch was hit but also a bucketed version of how many times).
It also works pretty well at finding bugs. I’ve been running it for about 48 hours total (a bit longer by clock time but I turned it off in the middle while I made some changes) and it’s found two bugs in a widely deployed file format parser that’s been stable for a couple of years (I’ve sent both to the author of the parser, and don’t want to say which one it is until I’ve got permission to do so. I don’t think either of them are security issues but hard to say for sure). One of them is confirmed novel, and I haven’t heard back about the other one yet. It found the first one after about 10 hours, but that appears to have been mostly luck – rerunning with a bunch of changes that otherwise improved the process hasn’t refound that bug yet.
Anecdotally, almost all of the examples produced are not valid instances of the format (i.e. the tool under test exits with a non-zero status code). This isn’t very surprising: The expectation is that it will give you just enough of the file to get you to the point you’re looking for and then throw away the rest, which is unlikely to get you a valid file unless the branch you’re looking for is taken after the file validity has already been determined.
Comparison with AFL
In some ways this is obviously quite similar to AFL, given that it uses the same instrumentation, but in other ways it’s quite different. My suspicion is that overall this approach will work better as an approach to providing a corpus to AFL than it will just on its own, but it’s surprisingly competitive even without that.
In particular it seems like it hits an order of magnitude increase in the number of seen labels much faster than I would expect AFL to. I think it helps that it’s using AFL’s instrumentation much more extensively than AFL itself actually does – AFL just uses the instrumentation for novelty detection, whileas this approach actually treats each label as a target in its own right and thus can take much more advantage of it.
The AFL algorithm is roughly just to repeatedly iterates the following:
- Pick an example from the corpus and mutate it
- If the mutated example exhibits any labels that we’ve not previously seen, add it to the corpus
It’s not really interested in the labels beyond novelty detection, and it doesn’t ever prune the corpus down to smaller examples like this does.
This approach also has a “self-optimizing” character that AFL lacks: Because AFL never replaces examples in its corpus, if you start with large examples you’re stuck with large examples forever. Because of this, AFL encourages you to start with very small, fast examples. This approach on the other hand will take whatever large examples you throw at it and will generally turn them into small examples.
To be clear: This isn’t a better approach than AFL. Maybe if it were highly optimized, tuned and refined it would become at least as good, but even then they would both have strengths and weaknesses compared to each other. But it’s not obviously a worse approach either, and even now it has some interesting advantages over the approach that AFL takes.