Category Archives: Uncategorized

Hypothesis 1.0 is out

The following is the release announcement I sent to the testing in python mailing list and the main python mailing list:

Hypothesis is a Python library for turning unit tests into generative tests,
covering a far wider range of cases than you can manually. Rather than just
testing for the things you already know about, Hypothesis goes out and
actively hunts for bugs in your code. It usually finds them, and when it
does it gives you simple and easy to read examples to demonstrate.

Hypothesis is based on Quickcheck (
https://wiki.haskell.org/Introduction_to_QuickCheck2) but is designed to
have a naturally Pythonic API and integrate well with Python testing
libraries.

It’s easy to use, extremely solid, and probably more devious than you
are at finding
edge cases.

The 1.0 release of Hypothesis has a stable and well documented public API.
It’s more than ready for you to use and it’s easy to get started.

Full documentation is available at
http://hypothesis.readthedocs.org/en/latest/, or if you prefer you can skip
straight to the quick start guide:
http://hypothesis.readthedocs.org/en/latest/quickstart.html

It seems to be a hit – less on the mailing list, but the link to it is doing the rounds of the internet and has the dubious honour of being the first time in ages I’ve been on the front page of Hacker News.

This entry was posted in Hypothesis, Python, Uncategorized on by .

What I would do with the Stargate program

Advance warning: This post was written while not entirely sober and consists entirely of extremely nerdy overthinking of a weird 90s/early 2000s TV series. If you haven’t watched stargate you don’t care. If you have you probably still don’t care.

People seemed to like my competent Stargate season 1 and there was some demand for me to write a season 2.

I’m not massively keen on the idea. The problem is that it just starts to diverge too radically for my tastes. Season 1 I could more or less do along the lines of “I wonder how this would happen if they didn’t mess it up completely“? and basically play the initial conditions forward.

The problem is that at this point earth is in a much stronger position, all the rules and players are different, and I’d have to actually start making serious decisions about the universe and the consequences in it.

And at this point you run into the fact that the universe doesn’t actually make much sense. The rules are inconsistent and the simple fact is that the Goa’uld level of technology and industry makes absolutely no sense. You simply cannot perform the level of construction that they routinely perform with the industrial base they have (maybe this is addressed in some of the RPG materials? I haven’t read them).

So basically this all requires a level of universe building in order to rationalise it that a) Takes us significantly away from the core concept of “What would happen if people were competent?” and b) If I were going to do I’d actually write my own fiction instead.

I also don’t like that the requirement of competence basically means that Sam and Teal’C can’t be on SG1. Aside from the fact that I’ve removed everyone from the team who isn’t a white man, I really like them as characters and it would be a shame to not have them on the team.

On the other hand, I really cannot justify a competent version of the universe where the foremost expert on the Stargate and the person who is a hugely valuable source of intelligence and will routinely get you in trouble on every world you go to because he’s got “I am a bad guy” literally tattooed on his forehead.

So basically although I stand by the idea that I think this is how it would have played out, it doesn’t really put the story in a position I’d want to write about.

All that being said, I do have some interesting ideas…

This post is about one set of those ideas: Suppose you gave me an R&D budget and the ability to make strategic decisions. What would I do to the Stargate program?

All of this is stuff that could have been figured out fairly early on and mostly does not depend on detailed semantics of how the stargates work.

My plan basically consists of two complimentary parts:

Opsec

OK. So “seeing the thing someone else is dialling” is a known security vulnerability where you can totally figure out where someone who has gone through the stargate went. This happens literally all the time in the series.

Even if we’re not in competent verse where no-one in the Goa’uld knows the address for earth, we still don’t want the knowledge to proliferate.

So, rule 1 is: You never, ever, dial earth from an insecure location.

An insecure location in this case meaning “Any Stargate not enclosed in a building surrounded by soldiers”.

i.e. we establish off world bases today. The stargate program does not operate out of Cheyenne mountain. The stargate program operates out of multiple off planet bases. At least two, ideally three or more. Ideally you’d follow the “Stick the stargate deep in a mountain” approach that both earth and the alpha site use in the classic series, but for starters you’re still better off just sending a bunch of marines through the stargate and having them stick a tent over it. We can improve from there.

Obviously you never gate directly to any of these worlds either. That would be silly. You’d be revealing your bases.

No, your route home is always that you gate to one of the dozen worlds that you have been specifically tasked with memorizing and not allowed off world before you’ve passed a test proving you’ve memorized them. These are allocated to you at random from a set of worlds that have been explored and designated really really boring (ideally uninhabited, certainly with no permanent Goa’uld presence). You then gate to one of the established bases. If you need to get back to earth you gate to earth from there.

Obviously the gating to earth process is not automated and the coordinates are not stored anywhere.

Equally obviously, none of the people you regularly send on expeditions know the coordinates. I mean why would you do anything so stupid as sending someone who knows the address of earth into enemy territory?

As soon as is feasible, every offworld base is equipped with an ultrasound scanner. All incoming travellers are given an ultrasound scan to determine whether or not they carry a Goa’uld.

(Depending on whether we know about Cimmeria or not, and whether we’re in competentverse, anyone detected as infected is sent there. Obviously in competentverse we weren’t stupid enough to send Teal’C there and destroy Thor’s hammer as a result, so that’s still a viable solution)

As a result the chances of anyone discovering your off world bases is low (they’d need a watcher on the planet you were on) and the chances of getting to earth from there is even lower.

You also don’t gate out from earth to anywhere except one of these bases. This is less mandatory, but signals are bidirectional through a stargate, so if I were being security conscious I sure wouldn’t want to give people the ability to fire a beacon back through the world I gated to.

As a bonus, the offworld bases act as a firebreak. If, for example, we bring back a bomb, a plague, gate into a black hole, etc. worst case scenario is you lose the base and the couple of hundred people you’ve stationed there. You don’t lose your entire world of billions of people and the war to save humanity from enslavement.

At this point we’ve been minimally paranoid. Earth is probably not going to get invaded any time soon because they don’t know where we are. Now it’s time to proceed to the technical solutions that will be used to help us take the galaxy from the occupying force that are literally too incompetent to be allowed to live.

Technology to develop

All technology I’m going to suggest is technology that I believe could be developed by 90s era engineering without having to figure out alien super science. This has the advantage that it does not require any knowledge of the details of gate mechanics. All technology I’m interested in here is dumb, mechanical, and is not useful unless we can mass produce it for cheap.

There are basically three pieces of kit I consider it essentially mandatory to develop.

How fast and cheap can we build and install an iris?

In the series the answer is always “super fast, very  not cheap”. The super fast is ludicrously unrealistic and I’m going to assume that there’s lots of behind the scenes prep to which we are not privy because it makes for boring TV. The very not cheap is obvious when they’re talking about precision engineering and extremely expensive materials (some of them literally not available on earth). We don’t want to do that. We want to be making hundreds of irises. If we explore a planet and it has people and doesn’t currently have Goa’uld on it, we want to stick an iris there.

I’m not interested in “we have fancy dilating titanium shield which can withstand an arbitrary number of assaults”. I’m interested in “it might take us ten minutes to undo the iris but that’s OK because the gate stays open for half an hour, and if there’s a major assault we might have to replace it afterwards”. The point is not to be impervious to harm, the point is to have something we can easily install that will be vastly better than not having it there.

This is the sort of solution I’d like to do better than:

  1. Put the stargate on a hinged mount allowing it to rotate forward
  2. Place this arrangement on a concrete platform
  3. Tilt it slightly forward, suspended by cables on a winch
  4. When you get an incoming wormhole drop the stargate. It’s now flush with the concrete (if you’re feeling fancy you can even stick a concrete plint there which perfectly matches the stargate position). Sure, people might half rematerialize on the other side but they won’t actually be able to get even fully out of the wormhole. When the wormhole disengages there will be bits of them to clena up.
  5. If you get a valid GDO signal, winch the stargate up and signal when it’s ready for incoming travellers.

This solution is OK but it is too slow to install. I’m sure a team of military engineers can do better.

Portable dialling devices

It is well established that you can manually dial a stargate if you just provide it with energy. I want a device I can clamp to the side of a stargate with an electric motor and a power source  that can power a stargate for a couple minutes.

We want this for three reasons:

  1. Emergency procedure for when an SG team is stranded due to an issue with the DHD (this happens several times)
  2. Earth needs a DHD (sure, we have the antarctic one, but in competent!SG1 we never find that because why the hell would you dial back to earth when under heavy fire?) and we want a way to take one
  3. The following complete dick move that will bring the Goa’uld war machine to a stand still

Once we have the dialler, people will practice the following operation:

  1. Send a lovely handfull of flash bangs through the stargate
  2. Send an extremely grumpy set of marines through the stargate
  3. Kill or incapacitate every readily available guard
  4. Steal or destroy the DHD
  5. Take it back through the stargate with you
  6. Leave the autodialler behind to self destruct, leaving a pile of meaningless slag
  7. Giggle as you’ve just cut off a Goa’uld world from greater galactic civilization

How effective this is depends. The Goa’uld may or may not be able to build DHDs, but they can certainly ship them in from other places. This will take time and ships, and basically have them running around wasting most of their ships on transit power. In the meantime you have the ability to lock down a substantial proportion of their forces.

Automated DHD covers

The DHD is in almost all ways 100% better than the system that earth has built. It’s easier to use, faster to dial, draws less power, etc.

There’s one way in which it is much worse: It is entirely manual, with no capacity for automation.

Oh, in canon inside it there’s all sorts of fancy computing going on inside it and the stargate network that people eventually figure out how to use. We don’t need any of that.

We’re going to stick a physical cover on top of it with motors that can push each button. We’ll use what we’ve already figured out about the stargate to tap into some basic diagnostics, but worst case scenario we can probably get by with “how much energy is currently flowing through the thing”.

There are a number of use cases for this, most notably that it lets us handle our own infrastructure much better. It also gives us much faster dialling, which is a significant tactical advantage.

It can also be used offensively. Oh so offensively.

Basically, the Goa’uld are going to figure out irises at some point. We can only steal so many DHDs in surprise raids before they realise that maybe they should take a leaf from our book and start blocking incoming travellers.

They’ll probably do something fancy and hard to mass produce like a force field.

Once this start being widespread, we giggle at how annoyed the pompous little megalomaniacs are going to be and turn on the war dialler program.

You see, all you need to permanently shut down a stargate in a way that makes it permanently inoperable is two stargates and good automation. I think with decent scheduling you can shut down n stargates with n + 1.

Here’s how it works:

A stargate can’t dial out when there’s an incoming wormhole. The only defence against this is to dial out faster than your opponent can dial in.

So you dial in and set a timer. Meanwhile your second stargate dials the first 6 symbols. As close to exactly as you can make it, when the timer goes off the first stargate shuts down the wormhole and the second stargate presses the seventh symbol. Voila, new incoming wormhole faster than you can ever dial out.

And what do we do with all this?

Oh that’s policy. I don’t set policy.

But on this front I’m mostly in agreement with canon. Starting with the currently most annoying system lord and killing your way downwards seems like a pretty good strategy.

This entry was posted in Stargate, Uncategorized on by .

A sketch of a new simplify API for Hypothesis

One of the decisions I’m quite happy I made is that the SearchStrategy API in Hypothesis is not public for 1.0. This gives me some freedom to break it when I figure out ways to make it better.

Which is good, because I’ve already figured out a way to break it to make it better.

In my fuzzer for yapf, I realised an interesting thing: The minimization happens in two phases, first line based, then character based. Once character based simplification has started we stop bothering with line based simplification altogether: Although in principle character based simplification could get us back into a state where line based simplification would be useful again, in practice this happens so rarely that trying line based simplification is basically never worth going back to.

Meanwhile, back in Hypothesis land, I was staring at the following quality test this morning:

def dedupe(xs):
    result = []
    for x in xs:
        if x not in result:
            result.append(x)
    return result
 
@given(strategy([bytes]).map(dedupe))
def is_sorted(xs):
    assume(len(xs) >= 10)
    assert sorted(xs) == xs

This is an example that is designed to fail and mostly exists to test the quality of the minimization: What’s the simplest list of byte strings it can find such that:

  • The list only has distinct elements
  • The list has at least 10 elements
  • The list is not sorted

It’s not hard to find such an example – generally speaking Hypothesis will find an example within the first couple of throws of the dice – but producing a good example is harder, which is what this test looks for (In the actual test we make a bunch of assertions about the quality of the result we got at the end).

The problem is that sometimes this can get into a bad state where it takes an extraordinarily long time to minimize.

The problem is basically this: Every time it successfully minimizes an example, the simplification process starts again from scratch on the new example. This means that it tries all the usual simplifications for deleting elements from the list. These are a complete waste of time because they always shrink the list, violating the size assumption. But every time we shrink an element (and there are 10 elements to shrink) we do it all again.

I was originally thinking of trying to fix this in terms of making simplify skip simplifications which fail to produce examples too often, but that’s not really very satisfactory and requires a lot of delicate state tracking. It might be worth it in some cases, but I think it probably wouldn’t be.

Then I realised that this problem fits entirely into the approach I mentioned using for yapf, and we can use it to fix this problem.

Right now Hypothesis essentially follows the classic quickcheck API where there is a function shrink which takes values and provides a generator over simpler versions of the value. Simplified, the hypothesis example finding algorithm looks like:

def find_minimal(x, f):
    for s in shrink(x):
       if f(s):
          return find_minimal(s, f)
    return x

(that’s not actual Hypothesis code. The real implementation is more complex in several ways and doesn’t use recursion because Python, but it’s the idea of it).

The idea is to instead change this so that there are multiple shrinkers, each behaving like our shrink function above, and then we apply each shrinker in turn, never returning to an earlier one. This would be something akin to:

def minimize_with_shrinker(x, f, shrinker):
    for s in shrinker(x):
       if f(s):
          return minimize_with_shrinker(s, f, shrinker)
    return x
 
def find_minimal(x, f):
   for shrinker in shrinkers:
      x = minimize_with_shrinker(x, f, shrinker)
   return x

This lets us fix the list example by splitting up shrinking into two phases: First it tries to shrink by removing elements from the list. Then it tries to shrink by shrinking individual elements. Once it starts shrinking elements, the size of the list is fixed – it’s already found the smallest list it’s going to.

This does impact the quality of the examples a little bit. If necessary that could be fixed by iterating to a fixed point: We run the above algorithm then we run it again until it no longer changes the result. A simpler version would just be to run it twice. I don’t plan to do this unless I find the results from a single pass of this are much worse than I expect them to be. The goal here isn’t always to find the single simplest example, only to find a simple enough example in a short amount of time. I think this will be entirely acceptable quality and it should be very aceptable in time in comparison to the existing implementation.

This entry was posted in Hypothesis, Python, Uncategorized on by .

27 bugs in 24 hours

So yesterday I started investigating a project called yapf. It’s a code formatter for Python based on the clang-format algorithm. I love code formatters, and am a big fan of clang-format, so I’m really excited about this project. This is not sarcasm.

I started using it, and then I reported 25 bugs in it in the course of a day (I actually filed 31 issues, but that includes 2 feature requests and 4 where I’d made a mistake by using the wrong version of yapf or an unsupported version of Python). I also filed a few bugs this morning from ongoing running of my fuzzing software. I’m still ridiculously excited about it, but I think I’ll give them a little while to iron the kinks out before I switch over from pyformat.

This shouldn’t be taken as a serious indictment of yapf. It is buggy, but that’s because it’s an alpha release of something that is trying to solve a really hard problem. The majority of software which has to deal with something as complicated as a programming language is full of bugs (I’ve run into four bugs in pypy and 3 in cpython in the course of my work in Hypothesis), and the early releases like this are where you shake down the low hanging fruit. The authors have been great, and have already fixed a number of the bugs I’ve filed.

But I’d like to talk a little bit about how I found so many bugs.

The answer is if course, by using Hypothesis!

Well, not really. I actually used two things from Hypothesis to do this, but neither them involved running it. The first thing I used was the code base, the second thing I used was some of the general ideas about how to write fuzzers and minimizers that I’ve learned while writing it. This sort of thing isn’t at all hard to do, but it’s a lot more obvious that it’s not hard to do when you’ve got some experience with the general category of solution.

I considered using Hypothesis proper, but it’s not actually currently very good for this sort of bug finding expedition where you want to keep going once you’ve found a bug. I have some plans for how to support that, but they’re way in the future.

I’ll go through what I did instead in roughly chronological order.

Using Hypothesis

This is the first thing I tried, and what got me interested in doing this in the first place.

Hypothesis is mostly auto-formatted, and is checked for pep-8 compliance. The entire code base is a fixed point of pyformat (in non-aggressive mode), and runs flake8 clean (with a few exceptions for files which are designed for python  2/3 compatibility and testing some introspection on terrible source code).

I wanted to investigate whether I could replace pyformat with yapf in this pipeline, so I just swapped it out. flake8 promptly screamed at me about a thousand pep8 violation errors yapf had produced on the previously clean code.

So I did what any good open source citizen would do and patiently went through the list of errors reported by the pep8 checker and manually minimized a source code example for each violation. This gave me about a dozen bugs, which I duly reported.

Using Django and Pyformat

I figured, if I’d found a bunch of bugs running yapf on a small (a bit under 10kloc) project like Hypothesis, I’d probably find some interesting things running it on a much larger one. What’s a large project? I know! Django!

Here is the methodology I used:

  1. Run pyformat -ia on the whole django source tree. Go do something useful for the roughly half an hour this took to run.
  2. Commit the result locally so I don’t have to do that again oh god
  3. Run the pep8 checker on the entire source tree
  4. Go make a cup of tea
  5. Note any errors that are still reported by pep8. These are errors we’re going to suppress later.
  6. Run yapf on the entire source tree.
  7. Run pep8 again, suppressing any errors we found in step 5
  8. Delight in the smorgasborg of test cases with which we have been presented

Basically the invariant I am testing for here is that yapf should not introduce any new pep8 errors in source it operates on: It’s fine if it can’t fix a pep8 error (I’d like it to be able to fix all pep8 errors, but it’s not necessarily a bug if it doesn’t), but if it introduces new ones it’s not doing its job correctly.

So I went through all the error categories for each output produced and hand-minimized an example out of each one. This produced a bunch of interesting new errors. I also had to delete a few files and minimize examples out of them to debug some yapf crashes.

Writing a fuzzer

At this point I was having a lot of fun, but I was getting a bit bored of doing this by hand. Moreover, I literally work on a tool for generating and minimizing examples that break invariants as my day job. Doing all of this by hand was basically letting the side down.

As previously mentioned, Hypothesis isn’t that well suited to this problem, so I wrote a fuzzer by hand. It’s not hard to develop one that works quite well for a given problem very quickly.

I had no interest in attempting to generate random valid python source, but fortunately as demonstrated earlier there’s more than enough interesting existing python source code in the wild to trigger bugs. So the design of the fuzzer is to “generate” random python simply by taking from a collection of source files you give it and seeing what happens on them.

The invariant is the same as before, but more fine grained: For each file we run pep8 on the file, then we run yapf, then we check if there are any new pep8 errors that were not present in the original file. Additionally, if yapf crashes that’s interesting too.

There’s then the world’s shonkiest source code minimizer. How it works is it first proceeds entirely string-wise, and then it runs ast.parse on the resulting strings and throws them away if they don’t parse correctly.

It first minimizes line-wise and then once it can no longer find any line-wise simplifications that work it minimizes character-wise. It stops when it’s found an example under some user definable maximum (I mostly used 140 characters. Examples should fit in a tweet!).

Line-wise minimization:

  1. Try all strings that consist of some leading segment of the lines in the file, starting from the smallest
  2. Try all strings that consist of some trailing segment of the lines in the file, starting from the smallest.
  3. Try deleting interior blocks of lines, starting from blocks equal to about half the number of lines in the file and shrinking downwards

Once we can no longer shrink the file this way, we proceed character-wise, doing basically the same thing as step 3 only with blocks of characters instead of lines.

Empirically what I found was that we run out of line based simplications when the source is down to around the 500-1000 character mark, then the character based minimization was enough to take it the rest of the way to tweet sized. For large files this was pretty slow, but only on the order of minutes at most so it wasn’t a major problem.

I then put this all together into an algorithm which I’ve not seen elsewhere but that’s probably more a function of my lack of familiarity with the literature than it is a sign of its novelty.

Basically the idea is this: For each pep8 error, we have three states it can be in:

  1. known and minimized, meaning we have a very small example of this error
  2. known but not minimized, meaning we have any example at all of this error
  3. unknown

Whenever we consider a file and find a pep8 error produced by yapf that was previously unknown, we change its state to “known but not minimized” and save it as an example of a file exhibiting that error.

Whenever we consider a file and find a pep8 error produced by yapf that was previously known but not minimized, we check if the file is smaller than our best example of that error. If it is, we replace the example with the current file being considered.

Now, the loop is as follows: In a random order, consider each file we have been provided with.

After considering each file, check if we have any errors in the “known but not minimized” state.

If we do, we proceed to minimize them.

This works as follows: We take our current best example. If this is already smaller than the required size, we mark the error as known and minimized. If not, we consider all source code shrinks of it. The first one of these we find that also exhibits the error, we replace the best example with that and start again. If no valid shrinks exhibit the error we mark the example as known and minimized. In theory this can result in an example larger than the desired maximum size, but I’ve never actually seen that happen.

An important note! While we are looking for errors in sub-examples, these examples may be other errors: They may just be smaller examples of known but not minimized errors (this is very common because large source files are more likely to exhibit multiple errors) or they may be entirely new errors (this usually doesn’t happen, but can happen as an artifact of the fact that the minimizer produces some really weird but technically valid code. I’ve only actually seen it happen once). The main benefit of this is performance – if we’ve spent ages trying to minimize some huge source file we might as well reap the benefits when we encounter the next error.

An additional detail: If we crash yapf while minimizing to produce a PEP8 error, we mark this example as not exhibiting that error, but we try to reduce it to a minimal example that produces a crash and note that. Unfortunately I don’t have any good deduplication for crashes like I do for pep8 where there’s a discrete set of errors and every different one can be considered differently interesting, so if this happens I have to check the error manually and try to figure out what’s going on. I also have it set up to skip minimizing crashes where the program took more than 2 seconds to run, as these tend to be pathological memory usage issues (I have everything set up to run with a ulimit of 2GB after one too many runaway processes in my life crashing my laptop). I’ve reported one of these and that’s enough for now.

One final detail and then I’m done: A trick I figured out early on which has been super useful for debugging issues is that every valid source file I ever consider gets saved into a directory called trashcan (a better name would have been recyclingbin). It’s saved with a hash of contents as a name so it’s automatically deduped. This was very useful in replaying halfway through a hung minimization among other things. It’s also an interesting source of weird python code. The trashcan is about 150MB on my system right now, but it’s so heavily redundant that a tar.bz2 of it is just over 2MB.

This is more or less it. About half of the bugs I’ve reported have been from fuzzer output I think (I haven’t actually checked, but that’s about the right fraction). Most of them have been from this version – I had a bit of an experiment running yapf pyformat interactions but they weren’t very interesting.

What’s next?

I think I’m mostly done to be honest. I’ve given the people responsible for yapf more than enough issues to work through, and I should probably give them a chance to catch up. Also I have to actually get back to working on Hypothesis. I’ve got the 1.0 release to handle today (there’s not much to handle. I need to tag a release, update the readme and send some emails), and there’s a whole bunch of post 1.0 work to do.

One thing I probably will try as a follow up to this when I next have a free/bored moment is that I think it would be interesting to come up with a set of minimal pep8 violations: Basically run this entire fuzzer stack without yapf in the middle and just minimize examples subject to the constraint they generate a particular error. The result would be an interesting data set for testing various python tools on, and also it would probably be fun to read through. But maybe I have an odd idea of fun.

This entry was posted in Hypothesis, Python, Uncategorized on by .

Simplify starts from the wrong end

This is a thing I just noticed when working out some performance bugs in Hypothesis in prep for the 1.0 release: In a lot of cases, I’ve been doing simplification (shrinking in Quickcheck terms) entirely backwards.

It turns out that if I’d paid attention this information was in Quickcheck all along – I haven’t found it in the paper (ok I haven’t looked in the paper) but this is what the API does.

Basically, if you write a function simplify which takes a value and gives you something to iterate over the simpler versions of that value, you will be tempted to start from things that are most similar to that value and work your way down. This is totally wrong. The correct order is to start from the simplest version and work your way up.

The reason for this is that it will result in far fewer calls to simplify in most cases: Most of the time your values will be a lot more complicated than you need them to be, and you will end up with a lot of recursive calls to simplify if you start from the top. If you start from the bottom you will very rapidly converge on the simplest thing that can possibly work.

I can’t guarantee I’ll catch all the instances of my doing this before the 1.0 release, but I should catch all the instances where this is causing significant performance problems. In particular the performance of simplifying strings is now much faster.

This entry was posted in Hypothesis, Python, Uncategorized on by .