Category Archives: Python

A sketch of a new simplify API for Hypothesis

One of the decisions I’m quite happy I made is that the SearchStrategy API in Hypothesis is not public for 1.0. This gives me some freedom to break it when I figure out ways to make it better.

Which is good, because I’ve already figured out a way to break it to make it better.

In my fuzzer for yapf, I realised an interesting thing: The minimization happens in two phases, first line based, then character based. Once character based simplification has started we stop bothering with line based simplification altogether: Although in principle character based simplification could get us back into a state where line based simplification would be useful again, in practice this happens so rarely that trying line based simplification is basically never worth going back to.

Meanwhile, back in Hypothesis land, I was staring at the following quality test this morning:

def dedupe(xs):
    result = []
    for x in xs:
        if x not in result:
            result.append(x)
    return result
 
@given(strategy([bytes]).map(dedupe))
def is_sorted(xs):
    assume(len(xs) >= 10)
    assert sorted(xs) == xs

This is an example that is designed to fail and mostly exists to test the quality of the minimization: What’s the simplest list of byte strings it can find such that:

  • The list only has distinct elements
  • The list has at least 10 elements
  • The list is not sorted

It’s not hard to find such an example – generally speaking Hypothesis will find an example within the first couple of throws of the dice – but producing a good example is harder, which is what this test looks for (In the actual test we make a bunch of assertions about the quality of the result we got at the end).

The problem is that sometimes this can get into a bad state where it takes an extraordinarily long time to minimize.

The problem is basically this: Every time it successfully minimizes an example, the simplification process starts again from scratch on the new example. This means that it tries all the usual simplifications for deleting elements from the list. These are a complete waste of time because they always shrink the list, violating the size assumption. But every time we shrink an element (and there are 10 elements to shrink) we do it all again.

I was originally thinking of trying to fix this in terms of making simplify skip simplifications which fail to produce examples too often, but that’s not really very satisfactory and requires a lot of delicate state tracking. It might be worth it in some cases, but I think it probably wouldn’t be.

Then I realised that this problem fits entirely into the approach I mentioned using for yapf, and we can use it to fix this problem.

Right now Hypothesis essentially follows the classic quickcheck API where there is a function shrink which takes values and provides a generator over simpler versions of the value. Simplified, the hypothesis example finding algorithm looks like:

def find_minimal(x, f):
    for s in shrink(x):
       if f(s):
          return find_minimal(s, f)
    return x

(that’s not actual Hypothesis code. The real implementation is more complex in several ways and doesn’t use recursion because Python, but it’s the idea of it).

The idea is to instead change this so that there are multiple shrinkers, each behaving like our shrink function above, and then we apply each shrinker in turn, never returning to an earlier one. This would be something akin to:

def minimize_with_shrinker(x, f, shrinker):
    for s in shrinker(x):
       if f(s):
          return minimize_with_shrinker(s, f, shrinker)
    return x
 
def find_minimal(x, f):
   for shrinker in shrinkers:
      x = minimize_with_shrinker(x, f, shrinker)
   return x

This lets us fix the list example by splitting up shrinking into two phases: First it tries to shrink by removing elements from the list. Then it tries to shrink by shrinking individual elements. Once it starts shrinking elements, the size of the list is fixed – it’s already found the smallest list it’s going to.

This does impact the quality of the examples a little bit. If necessary that could be fixed by iterating to a fixed point: We run the above algorithm then we run it again until it no longer changes the result. A simpler version would just be to run it twice. I don’t plan to do this unless I find the results from a single pass of this are much worse than I expect them to be. The goal here isn’t always to find the single simplest example, only to find a simple enough example in a short amount of time. I think this will be entirely acceptable quality and it should be very aceptable in time in comparison to the existing implementation.

This entry was posted in Hypothesis, Python, Uncategorized on by .

27 bugs in 24 hours

So yesterday I started investigating a project called yapf. It’s a code formatter for Python based on the clang-format algorithm. I love code formatters, and am a big fan of clang-format, so I’m really excited about this project. This is not sarcasm.

I started using it, and then I reported 25 bugs in it in the course of a day (I actually filed 31 issues, but that includes 2 feature requests and 4 where I’d made a mistake by using the wrong version of yapf or an unsupported version of Python). I also filed a few bugs this morning from ongoing running of my fuzzing software. I’m still ridiculously excited about it, but I think I’ll give them a little while to iron the kinks out before I switch over from pyformat.

This shouldn’t be taken as a serious indictment of yapf. It is buggy, but that’s because it’s an alpha release of something that is trying to solve a really hard problem. The majority of software which has to deal with something as complicated as a programming language is full of bugs (I’ve run into four bugs in pypy and 3 in cpython in the course of my work in Hypothesis), and the early releases like this are where you shake down the low hanging fruit. The authors have been great, and have already fixed a number of the bugs I’ve filed.

But I’d like to talk a little bit about how I found so many bugs.

The answer is if course, by using Hypothesis!

Well, not really. I actually used two things from Hypothesis to do this, but neither them involved running it. The first thing I used was the code base, the second thing I used was some of the general ideas about how to write fuzzers and minimizers that I’ve learned while writing it. This sort of thing isn’t at all hard to do, but it’s a lot more obvious that it’s not hard to do when you’ve got some experience with the general category of solution.

I considered using Hypothesis proper, but it’s not actually currently very good for this sort of bug finding expedition where you want to keep going once you’ve found a bug. I have some plans for how to support that, but they’re way in the future.

I’ll go through what I did instead in roughly chronological order.

Using Hypothesis

This is the first thing I tried, and what got me interested in doing this in the first place.

Hypothesis is mostly auto-formatted, and is checked for pep-8 compliance. The entire code base is a fixed point of pyformat (in non-aggressive mode), and runs flake8 clean (with a few exceptions for files which are designed for python  2/3 compatibility and testing some introspection on terrible source code).

I wanted to investigate whether I could replace pyformat with yapf in this pipeline, so I just swapped it out. flake8 promptly screamed at me about a thousand pep8 violation errors yapf had produced on the previously clean code.

So I did what any good open source citizen would do and patiently went through the list of errors reported by the pep8 checker and manually minimized a source code example for each violation. This gave me about a dozen bugs, which I duly reported.

Using Django and Pyformat

I figured, if I’d found a bunch of bugs running yapf on a small (a bit under 10kloc) project like Hypothesis, I’d probably find some interesting things running it on a much larger one. What’s a large project? I know! Django!

Here is the methodology I used:

  1. Run pyformat -ia on the whole django source tree. Go do something useful for the roughly half an hour this took to run.
  2. Commit the result locally so I don’t have to do that again oh god
  3. Run the pep8 checker on the entire source tree
  4. Go make a cup of tea
  5. Note any errors that are still reported by pep8. These are errors we’re going to suppress later.
  6. Run yapf on the entire source tree.
  7. Run pep8 again, suppressing any errors we found in step 5
  8. Delight in the smorgasborg of test cases with which we have been presented

Basically the invariant I am testing for here is that yapf should not introduce any new pep8 errors in source it operates on: It’s fine if it can’t fix a pep8 error (I’d like it to be able to fix all pep8 errors, but it’s not necessarily a bug if it doesn’t), but if it introduces new ones it’s not doing its job correctly.

So I went through all the error categories for each output produced and hand-minimized an example out of each one. This produced a bunch of interesting new errors. I also had to delete a few files and minimize examples out of them to debug some yapf crashes.

Writing a fuzzer

At this point I was having a lot of fun, but I was getting a bit bored of doing this by hand. Moreover, I literally work on a tool for generating and minimizing examples that break invariants as my day job. Doing all of this by hand was basically letting the side down.

As previously mentioned, Hypothesis isn’t that well suited to this problem, so I wrote a fuzzer by hand. It’s not hard to develop one that works quite well for a given problem very quickly.

I had no interest in attempting to generate random valid python source, but fortunately as demonstrated earlier there’s more than enough interesting existing python source code in the wild to trigger bugs. So the design of the fuzzer is to “generate” random python simply by taking from a collection of source files you give it and seeing what happens on them.

The invariant is the same as before, but more fine grained: For each file we run pep8 on the file, then we run yapf, then we check if there are any new pep8 errors that were not present in the original file. Additionally, if yapf crashes that’s interesting too.

There’s then the world’s shonkiest source code minimizer. How it works is it first proceeds entirely string-wise, and then it runs ast.parse on the resulting strings and throws them away if they don’t parse correctly.

It first minimizes line-wise and then once it can no longer find any line-wise simplifications that work it minimizes character-wise. It stops when it’s found an example under some user definable maximum (I mostly used 140 characters. Examples should fit in a tweet!).

Line-wise minimization:

  1. Try all strings that consist of some leading segment of the lines in the file, starting from the smallest
  2. Try all strings that consist of some trailing segment of the lines in the file, starting from the smallest.
  3. Try deleting interior blocks of lines, starting from blocks equal to about half the number of lines in the file and shrinking downwards

Once we can no longer shrink the file this way, we proceed character-wise, doing basically the same thing as step 3 only with blocks of characters instead of lines.

Empirically what I found was that we run out of line based simplications when the source is down to around the 500-1000 character mark, then the character based minimization was enough to take it the rest of the way to tweet sized. For large files this was pretty slow, but only on the order of minutes at most so it wasn’t a major problem.

I then put this all together into an algorithm which I’ve not seen elsewhere but that’s probably more a function of my lack of familiarity with the literature than it is a sign of its novelty.

Basically the idea is this: For each pep8 error, we have three states it can be in:

  1. known and minimized, meaning we have a very small example of this error
  2. known but not minimized, meaning we have any example at all of this error
  3. unknown

Whenever we consider a file and find a pep8 error produced by yapf that was previously unknown, we change its state to “known but not minimized” and save it as an example of a file exhibiting that error.

Whenever we consider a file and find a pep8 error produced by yapf that was previously known but not minimized, we check if the file is smaller than our best example of that error. If it is, we replace the example with the current file being considered.

Now, the loop is as follows: In a random order, consider each file we have been provided with.

After considering each file, check if we have any errors in the “known but not minimized” state.

If we do, we proceed to minimize them.

This works as follows: We take our current best example. If this is already smaller than the required size, we mark the error as known and minimized. If not, we consider all source code shrinks of it. The first one of these we find that also exhibits the error, we replace the best example with that and start again. If no valid shrinks exhibit the error we mark the example as known and minimized. In theory this can result in an example larger than the desired maximum size, but I’ve never actually seen that happen.

An important note! While we are looking for errors in sub-examples, these examples may be other errors: They may just be smaller examples of known but not minimized errors (this is very common because large source files are more likely to exhibit multiple errors) or they may be entirely new errors (this usually doesn’t happen, but can happen as an artifact of the fact that the minimizer produces some really weird but technically valid code. I’ve only actually seen it happen once). The main benefit of this is performance – if we’ve spent ages trying to minimize some huge source file we might as well reap the benefits when we encounter the next error.

An additional detail: If we crash yapf while minimizing to produce a PEP8 error, we mark this example as not exhibiting that error, but we try to reduce it to a minimal example that produces a crash and note that. Unfortunately I don’t have any good deduplication for crashes like I do for pep8 where there’s a discrete set of errors and every different one can be considered differently interesting, so if this happens I have to check the error manually and try to figure out what’s going on. I also have it set up to skip minimizing crashes where the program took more than 2 seconds to run, as these tend to be pathological memory usage issues (I have everything set up to run with a ulimit of 2GB after one too many runaway processes in my life crashing my laptop). I’ve reported one of these and that’s enough for now.

One final detail and then I’m done: A trick I figured out early on which has been super useful for debugging issues is that every valid source file I ever consider gets saved into a directory called trashcan (a better name would have been recyclingbin). It’s saved with a hash of contents as a name so it’s automatically deduped. This was very useful in replaying halfway through a hung minimization among other things. It’s also an interesting source of weird python code. The trashcan is about 150MB on my system right now, but it’s so heavily redundant that a tar.bz2 of it is just over 2MB.

This is more or less it. About half of the bugs I’ve reported have been from fuzzer output I think (I haven’t actually checked, but that’s about the right fraction). Most of them have been from this version – I had a bit of an experiment running yapf pyformat interactions but they weren’t very interesting.

What’s next?

I think I’m mostly done to be honest. I’ve given the people responsible for yapf more than enough issues to work through, and I should probably give them a chance to catch up. Also I have to actually get back to working on Hypothesis. I’ve got the 1.0 release to handle today (there’s not much to handle. I need to tag a release, update the readme and send some emails), and there’s a whole bunch of post 1.0 work to do.

One thing I probably will try as a follow up to this when I next have a free/bored moment is that I think it would be interesting to come up with a set of minimal pep8 violations: Basically run this entire fuzzer stack without yapf in the middle and just minimize examples subject to the constraint they generate a particular error. The result would be an interesting data set for testing various python tools on, and also it would probably be fun to read through. But maybe I have an odd idea of fun.

This entry was posted in Hypothesis, Python, Uncategorized on by .

Simplify starts from the wrong end

This is a thing I just noticed when working out some performance bugs in Hypothesis in prep for the 1.0 release: In a lot of cases, I’ve been doing simplification (shrinking in Quickcheck terms) entirely backwards.

It turns out that if I’d paid attention this information was in Quickcheck all along – I haven’t found it in the paper (ok I haven’t looked in the paper) but this is what the API does.

Basically, if you write a function simplify which takes a value and gives you something to iterate over the simpler versions of that value, you will be tempted to start from things that are most similar to that value and work your way down. This is totally wrong. The correct order is to start from the simplest version and work your way up.

The reason for this is that it will result in far fewer calls to simplify in most cases: Most of the time your values will be a lot more complicated than you need them to be, and you will end up with a lot of recursive calls to simplify if you start from the top. If you start from the bottom you will very rapidly converge on the simplest thing that can possibly work.

I can’t guarantee I’ll catch all the instances of my doing this before the 1.0 release, but I should catch all the instances where this is causing significant performance problems. In particular the performance of simplifying strings is now much faster.

This entry was posted in Hypothesis, Python, Uncategorized on by .