Category Archives: Hypothesis

27 bugs in 24 hours

So yesterday I started investigating a project called yapf. It’s a code formatter for Python based on the clang-format algorithm. I love code formatters, and am a big fan of clang-format, so I’m really excited about this project. This is not sarcasm.

I started using it, and then I reported 25 bugs in it in the course of a day (I actually filed 31 issues, but that includes 2 feature requests and 4 where I’d made a mistake by using the wrong version of yapf or an unsupported version of Python). I also filed a few bugs this morning from ongoing running of my fuzzing software. I’m still ridiculously excited about it, but I think I’ll give them a little while to iron the kinks out before I switch over from pyformat.

This shouldn’t be taken as a serious indictment of yapf. It is buggy, but that’s because it’s an alpha release of something that is trying to solve a really hard problem. The majority of software which has to deal with something as complicated as a programming language is full of bugs (I’ve run into four bugs in pypy and 3 in cpython in the course of my work in Hypothesis), and the early releases like this are where you shake down the low hanging fruit. The authors have been great, and have already fixed a number of the bugs I’ve filed.

But I’d like to talk a little bit about how I found so many bugs.

The answer is if course, by using Hypothesis!

Well, not really. I actually used two things from Hypothesis to do this, but neither them involved running it. The first thing I used was the code base, the second thing I used was some of the general ideas about how to write fuzzers and minimizers that I’ve learned while writing it. This sort of thing isn’t at all hard to do, but it’s a lot more obvious that it’s not hard to do when you’ve got some experience with the general category of solution.

I considered using Hypothesis proper, but it’s not actually currently very good for this sort of bug finding expedition where you want to keep going once you’ve found a bug. I have some plans for how to support that, but they’re way in the future.

I’ll go through what I did instead in roughly chronological order.

Using Hypothesis

This is the first thing I tried, and what got me interested in doing this in the first place.

Hypothesis is mostly auto-formatted, and is checked for pep-8 compliance. The entire code base is a fixed point of pyformat (in non-aggressive mode), and runs flake8 clean (with a few exceptions for files which are designed for python  2/3 compatibility and testing some introspection on terrible source code).

I wanted to investigate whether I could replace pyformat with yapf in this pipeline, so I just swapped it out. flake8 promptly screamed at me about a thousand pep8 violation errors yapf had produced on the previously clean code.

So I did what any good open source citizen would do and patiently went through the list of errors reported by the pep8 checker and manually minimized a source code example for each violation. This gave me about a dozen bugs, which I duly reported.

Using Django and Pyformat

I figured, if I’d found a bunch of bugs running yapf on a small (a bit under 10kloc) project like Hypothesis, I’d probably find some interesting things running it on a much larger one. What’s a large project? I know! Django!

Here is the methodology I used:

  1. Run pyformat -ia on the whole django source tree. Go do something useful for the roughly half an hour this took to run.
  2. Commit the result locally so I don’t have to do that again oh god
  3. Run the pep8 checker on the entire source tree
  4. Go make a cup of tea
  5. Note any errors that are still reported by pep8. These are errors we’re going to suppress later.
  6. Run yapf on the entire source tree.
  7. Run pep8 again, suppressing any errors we found in step 5
  8. Delight in the smorgasborg of test cases with which we have been presented

Basically the invariant I am testing for here is that yapf should not introduce any new pep8 errors in source it operates on: It’s fine if it can’t fix a pep8 error (I’d like it to be able to fix all pep8 errors, but it’s not necessarily a bug if it doesn’t), but if it introduces new ones it’s not doing its job correctly.

So I went through all the error categories for each output produced and hand-minimized an example out of each one. This produced a bunch of interesting new errors. I also had to delete a few files and minimize examples out of them to debug some yapf crashes.

Writing a fuzzer

At this point I was having a lot of fun, but I was getting a bit bored of doing this by hand. Moreover, I literally work on a tool for generating and minimizing examples that break invariants as my day job. Doing all of this by hand was basically letting the side down.

As previously mentioned, Hypothesis isn’t that well suited to this problem, so I wrote a fuzzer by hand. It’s not hard to develop one that works quite well for a given problem very quickly.

I had no interest in attempting to generate random valid python source, but fortunately as demonstrated earlier there’s more than enough interesting existing python source code in the wild to trigger bugs. So the design of the fuzzer is to “generate” random python simply by taking from a collection of source files you give it and seeing what happens on them.

The invariant is the same as before, but more fine grained: For each file we run pep8 on the file, then we run yapf, then we check if there are any new pep8 errors that were not present in the original file. Additionally, if yapf crashes that’s interesting too.

There’s then the world’s shonkiest source code minimizer. How it works is it first proceeds entirely string-wise, and then it runs ast.parse on the resulting strings and throws them away if they don’t parse correctly.

It first minimizes line-wise and then once it can no longer find any line-wise simplifications that work it minimizes character-wise. It stops when it’s found an example under some user definable maximum (I mostly used 140 characters. Examples should fit in a tweet!).

Line-wise minimization:

  1. Try all strings that consist of some leading segment of the lines in the file, starting from the smallest
  2. Try all strings that consist of some trailing segment of the lines in the file, starting from the smallest.
  3. Try deleting interior blocks of lines, starting from blocks equal to about half the number of lines in the file and shrinking downwards

Once we can no longer shrink the file this way, we proceed character-wise, doing basically the same thing as step 3 only with blocks of characters instead of lines.

Empirically what I found was that we run out of line based simplications when the source is down to around the 500-1000 character mark, then the character based minimization was enough to take it the rest of the way to tweet sized. For large files this was pretty slow, but only on the order of minutes at most so it wasn’t a major problem.

I then put this all together into an algorithm which I’ve not seen elsewhere but that’s probably more a function of my lack of familiarity with the literature than it is a sign of its novelty.

Basically the idea is this: For each pep8 error, we have three states it can be in:

  1. known and minimized, meaning we have a very small example of this error
  2. known but not minimized, meaning we have any example at all of this error
  3. unknown

Whenever we consider a file and find a pep8 error produced by yapf that was previously unknown, we change its state to “known but not minimized” and save it as an example of a file exhibiting that error.

Whenever we consider a file and find a pep8 error produced by yapf that was previously known but not minimized, we check if the file is smaller than our best example of that error. If it is, we replace the example with the current file being considered.

Now, the loop is as follows: In a random order, consider each file we have been provided with.

After considering each file, check if we have any errors in the “known but not minimized” state.

If we do, we proceed to minimize them.

This works as follows: We take our current best example. If this is already smaller than the required size, we mark the error as known and minimized. If not, we consider all source code shrinks of it. The first one of these we find that also exhibits the error, we replace the best example with that and start again. If no valid shrinks exhibit the error we mark the example as known and minimized. In theory this can result in an example larger than the desired maximum size, but I’ve never actually seen that happen.

An important note! While we are looking for errors in sub-examples, these examples may be other errors: They may just be smaller examples of known but not minimized errors (this is very common because large source files are more likely to exhibit multiple errors) or they may be entirely new errors (this usually doesn’t happen, but can happen as an artifact of the fact that the minimizer produces some really weird but technically valid code. I’ve only actually seen it happen once). The main benefit of this is performance – if we’ve spent ages trying to minimize some huge source file we might as well reap the benefits when we encounter the next error.

An additional detail: If we crash yapf while minimizing to produce a PEP8 error, we mark this example as not exhibiting that error, but we try to reduce it to a minimal example that produces a crash and note that. Unfortunately I don’t have any good deduplication for crashes like I do for pep8 where there’s a discrete set of errors and every different one can be considered differently interesting, so if this happens I have to check the error manually and try to figure out what’s going on. I also have it set up to skip minimizing crashes where the program took more than 2 seconds to run, as these tend to be pathological memory usage issues (I have everything set up to run with a ulimit of 2GB after one too many runaway processes in my life crashing my laptop). I’ve reported one of these and that’s enough for now.

One final detail and then I’m done: A trick I figured out early on which has been super useful for debugging issues is that every valid source file I ever consider gets saved into a directory called trashcan (a better name would have been recyclingbin). It’s saved with a hash of contents as a name so it’s automatically deduped. This was very useful in replaying halfway through a hung minimization among other things. It’s also an interesting source of weird python code. The trashcan is about 150MB on my system right now, but it’s so heavily redundant that a tar.bz2 of it is just over 2MB.

This is more or less it. About half of the bugs I’ve reported have been from fuzzer output I think (I haven’t actually checked, but that’s about the right fraction). Most of them have been from this version – I had a bit of an experiment running yapf pyformat interactions but they weren’t very interesting.

What’s next?

I think I’m mostly done to be honest. I’ve given the people responsible for yapf more than enough issues to work through, and I should probably give them a chance to catch up. Also I have to actually get back to working on Hypothesis. I’ve got the 1.0 release to handle today (there’s not much to handle. I need to tag a release, update the readme and send some emails), and there’s a whole bunch of post 1.0 work to do.

One thing I probably will try as a follow up to this when I next have a free/bored moment is that I think it would be interesting to come up with a set of minimal pep8 violations: Basically run this entire fuzzer stack without yapf in the middle and just minimize examples subject to the constraint they generate a particular error. The result would be an interesting data set for testing various python tools on, and also it would probably be fun to read through. But maybe I have an odd idea of fun.

This entry was posted in Hypothesis, Python, Uncategorized on by .

Simplify starts from the wrong end

This is a thing I just noticed when working out some performance bugs in Hypothesis in prep for the 1.0 release: In a lot of cases, I’ve been doing simplification (shrinking in Quickcheck terms) entirely backwards.

It turns out that if I’d paid attention this information was in Quickcheck all along – I haven’t found it in the paper (ok I haven’t looked in the paper) but this is what the API does.

Basically, if you write a function simplify which takes a value and gives you something to iterate over the simpler versions of that value, you will be tempted to start from things that are most similar to that value and work your way down. This is totally wrong. The correct order is to start from the simplest version and work your way up.

The reason for this is that it will result in far fewer calls to simplify in most cases: Most of the time your values will be a lot more complicated than you need them to be, and you will end up with a lot of recursive calls to simplify if you start from the top. If you start from the bottom you will very rapidly converge on the simplest thing that can possibly work.

I can’t guarantee I’ll catch all the instances of my doing this before the 1.0 release, but I should catch all the instances where this is causing significant performance problems. In particular the performance of simplifying strings is now much faster.

This entry was posted in Hypothesis, Python, Uncategorized on by .

Stable serialization and cryptographic hashing for tracking seen objects

This is a clever trick I figured out yesterday. It’s in retrospect moderately obvious, but I’ve never seen it before and it’s taken Hypothesis example tracking down from an expected space complexity of O(mn) and time complexity of O(god I don’t even know) to a deterministic O(n) space complexity and an O(mn) time complexity. It’s also vastly improved the constant factors by moving a bunch of stuff from pure python to native code.

(Actually there’s a bit of a wrinkle in that it’s not strictly deterministic because n is a random function of the number of examples you’re looking for based on the duplicate example rate, but in practice it’s rarely more than a few times larger than the number of examples).

First: The problem. Due to the way its data is generated, Hypothesis has a decent chance of producing the same example more than once (if it didn’t there would be a lot of other interesting examples it struggled to produce). Rather than run the same example multiple times, Hypothesis keeps a record of which examples it’s already tried and doesn’t try them again. This is also used during simplification to avoid loops in the simplify graph (it’s a lot easier to write simplifies that might occasionally have loops and let Hypothesis sort that out than it is for every simplify method to do a lot of careful book keeping).

Previously I was using the obvious data structure to track which examples we’ve already seen: A hash table. Stick the object in a hash table when it’s seen, check in the hash table to see if we’ve already seen it.

There are two problems with this. One is intrinsic, the other implementation specific.

The intrinsic one is that we’re keeping all these objects around! That uses up a lot of memory. Some of these objects may be large nested lists of things, and even if they aren’t Python is not exactly good at compact object representation.

The second problem is that Python’s default equality and hashing are completely unsuitable for this. We don’t want to consider {1} and frozenset({1.0}) equal for these purposes, and we do want to consider nan and nan equal for these purposes. Additionally we need to be able to track things like lists which aren’t hashable. So this means we need our own hashing and equality logic. This was written in pure python and thus was very slow when dealing with large objects.

This is very sub-optimal, but it’s not clear how I could have done it better when tracking arbitrary python objects.

…which is the clue. These days Hypothesis doesn’t need to track arbitrary Python objects. It tracks templates, which I can require to have a much more specific type than the objects tracked. In particular I can require them to be marshallable.

In the end I decided on a slightly laxer requirement because marshalling doesn’t handle certain types I wanted to be able to use (specifically namedtuple instances). Every template needs to be either:

  1. Marshallable
  2. An iterable type containing other templates

Combining these by flattening out collections into tuples + a type tag and then marshalling the whole lot lets us generate a binary string for every template value such that the two binary strings are equal the templates are equivalent (and generally speaking the converse too, though there are some special cases where that might not quite work if the iteration order is unstable and we just accept the slight duplication).

This is already a pretty great saving: We have native equality and hashing which does what we want it to do, and the representation is so much more compact than the Python object representation.

But we can do better! Sometimes (often) the strings we get here will be quite long. When a string is at least twenty bytes long we replace it with its sha1 hash. This gives us yet another space saving (in theory it also gives us a time saving, but the hash for binary strings is quite good and with random data equality will tend to return False early, so in practice I don’t expect equality comparisons between long strings would have been a noticable time cost).

The result of this change has been that some parts of my build are now nearly ten times faster than they used to be, and the memory usage has become much more predictable. It’s also meant that I could ditch a ton of code that I really hated – the custom hashing and equality code used to have other use cases, but this was the last remaining one. Over all, this has been one of my better ideas.

This entry was posted in Hypothesis, Uncategorized on by .

Monadic data generation strategies and why you should care

I posted this gist earlier. It’s a toy port of one of the new templatized data generation for Hypothesis to Haskell.

It doesn’t do most of the important things. Its purpose is to demonstrate one simple point: Hypothesis strategies are now monads.

I don’t want to get into the deep philosophical implications of this or how it means Hypothesis is now like a burrito in a space suit. What I want to do here is point out that this enables some super useful things.

Consider the following example from the Hypothesis README (at the time of this writing. This is going to change soon for various reasons, oe of them being the stuff I’m about to get into)

from decimal import Decimal
from hypothesis.searchstrategy import MappedSearchStrategy
 
class DecimalStrategy(MappedSearchStrategy):
    def pack(self, x):
        return Decimal(x) / 100
 
    def unpack(self, x):
        return int(x * 100)

This is for defining a Decimal strategy in terms of an integer strategy – it has an operation to convert a decimal to an int and an int to a decimal.

The reason it needs to convert a decimal to an int is because of simplification. If it can’t convert back then it can’t simplify. This is also what stops strategies from being a functor (remember that all monads are functors): In order for something to be a functor we need to be able to define a new version by just mapping values. We can’t require the mapping to go both ways.

Which is why it’s pretty great that the new template API lets us throw away half of this! Now MappedSearchStrategy no longer requires you to implement unpack. As well as being half the work, this means you can use it in cases where you might sometimes need to throw away data – the mapping no longer has to be a bijection. The reason it can do this is that it just uses the templates for the original type, so there’s no need for you to convert back.

But that’s just an example where being a functor is useful. Why is being a monad useful?

Well, the operation that monads add over functors is bind. map let us take a function a -> b and turn a Strategy a into a Strategy b. bind lets us take something that maps a to a Strategy b and turn a Strategy a into a Strategy b. When would we want to do this?

Well, one example is when we want some sort of sharing constraint. Suppose for example we wanted to generate a list of dates, but we wanted them to all be in the same time zone. The bind operation would let us do this: We could do something like strategy(timezones).bind(lambda s: [dates_from(s)]) (this is a made up API, the details for this are not yet in place in actual Hypothesis). This would generate a timezone, then we generate a strategy for generating dates in that time zone, and a strategy for producing lists from that.

Given that this is Python, you don’t have the advantage you get in Haskell that being a monad gives you nice syntax and a rich set of support libraries, but that’s OK. The reason monads are a thing in the first place is that the monad operations are generally super useful in their own right, and that remains true in Python too.

This entry was posted in Hypothesis, Uncategorized on by .

The three stage value pipeline for Hypothesis generation

Note: This describes a work in progress rather than something that is in a released Hypothesis version. As such it’s liable to change quite a lot before it’s done. I’m partly writing this as a form of thinking through the design, partly as a way of procrastinating from the fact that I’ve literally broken the entire test suite in the course of moving to this and trying to fix it is really making me wish I’d written Hypothesis in Haskell.

I’ve previously talked about how generating data in Hypothesis is a two step process: Generate a random parameter, then for a given parameter value generate a random value.

I’ve introduced a third step to the process because I thought that wasn’t complicated enough. The process now goes:

  1. Generate a parameter
  2. From a parameter generate a template
  3. Reify that template into a value.

Why?

The idea is that rather than working with values which might have state we instead always work with merely the information that could be used to construct the values. The specific impetus for this change was the Django integration, but it also allows me to unify two features of Hypothesis that… honestly it would never have even occurred to me to unify.

In order to motivate this, allow me to present some examples:

  1. When generating say, a list of integers, we want to be able to pass it to user provided functions without worrying about it being mutated. How do we do this?
  2. If we have a strategy that produces things for one_of((integers_in_range(1, 10), (9, 20))) and we have found a counter-example of 10, how do we determine which strategy to pass it to for simplification?
  3. How does one simplify Django models we’ve previously generated once we’ve rolled back the database?

Previously my answers to these would have been respectively:

  1. We copy it
  2. We have a method which says whether a strategy could have produced a value and arbitrarily pick one that could have produced it
  3. Oh god I don’t know can I have testmachine back?

But the templates provide a solution to all three problems! The new answers are:

  1. Each time we reify the template it produces a fresh list
  2. A template for a one_of strategy encodes which strategy the value came from and we use that one
  3. We instantiate the template for the Django model outside the transaction and reify it inside the transaction.

A key point in order for part 3 to work is that simplification happens on templates, not values. In general most things that would previously have happened to values now happen to templates. The only point at which we actually need values is when we actually want to run the test.

As I said, this is still all a bit in pieces on the floor, but I think it’s a big improvement in the design. I just wish I’d thought of it earlier so I didn’t have to fix all this code that’s now broken by the change.

(Users who do not have their own SearchStrategy implementations will be mostly unaffected. Users who do have their own SearchStrategy implementations will probably suffer quite a lot in this next release. Sorry. That’s why it says 0.x in the version)

This entry was posted in Hypothesis, Uncategorized on by .