Hypothesis progress is alarming

I had a brilliant idea just now:

08:54 <DRMacIver> shapr: So I’ve been thinking about your point that you have to save pre-shrinking state because you might have found multiple bugs, and I think it’s wrong.
08:54 <DRMacIver> Because I think you’re much more likely to have found those other bugs in the course of shrinking than you are with the original value
08:54 <DRMacIver> So what will happen if you save the pre-shrinking state is that it’ll rerun and go “Yup, that’s fine”
08:54 <DRMacIver> My solution of saving the post-shrinking state is *also* wrong in this case mind you.
08:55 <DRMacIver> I think I have a solution, but it relies pretty heavily on glassbox testing in order to be any good
08:57 <DRMacIver> I have a really frustrating level of future designs for Hypothesis in my head.
08:58 <DRMacIver> (frustrating because I can tell how far I am from implementing it, and every time I get closer I come up with new ones so the goal recedes into the distance)
09:22 <DRMacIver> Ooh. I’ve just had a *great* idea
09:23 <DRMacIver> It not only solves this problem it solves a bunch of historic problems too
09:24 <DRMacIver> Basically 1) save every interesting example in the database. 2) examples loaded from the database which are valid but uninteresting “radioactively decay”. I.e. they get deleted with some probability once they’ve been run

The context: Both Haskell’s QuickCheck and Hypothesis save the last failing example. QuickCheck saves it prior to minimization, Hypothesis saves it post minimization. Because of the issues pointed out in Reducers are Fuzzers, both are the wrong thing to do.

Additionally, Hypothesis has historically had the following problems:

  • What do you do when the database fills up with more examples than you know what to do with in a given run and none of them are interesting?
  • What do you do if the shrinker gets stuck or the process crashes before you’ve saved the example?

This solves both problems: We save every intermediate shrink as we find it and let the garbage collection process deal with the debris that subsequently proves uninteresting.

The above is a clever idea but is not the point of this post: The point of this post is that I have a growing sense that I am missing something.

Basically there is no reasonable way that I should be making as much progress on Hypothesis, Conjecture, etc as I am. This is not the first brilliant idea – I’ve had dozens. I’ve had so many brilliant ideas that I haven’t had time to implement all of them yet.

This is  bad sign.

There are three ways to make an astonishing amount of progress:

  1. Be wrong about the amount of progress you are making
  2. Be an unparalleled genius
  3. Be where the low hanging fruit is

I do not think I am wrong about the amount of progress I’m making. I will grant that some of my ideas turn out to be bad ones, or modifiable to make them good ones, but I’ve had enough practical validation that I’m reasonably sure that a lot of these concepts work and are useful. I’m probably off by a factor of two or three, but I doubt I’m out by an order of magnitude.

I am not an unparalleled genius. I am smart, but I’m one in a thousand, not one in a million. There are plenty of other people smarter than me, many of them working in similar fields.

So the only real option is that I’m hanging out in an orchard full of low hanging fruit, and I don’t really understand how that could be. Right now, life feels a bit like James Mickens describes in The Slow Winter:

I think that it used to be fun to be a hardware architect. Anything that you invented would be amazing, and the laws of physics were actively trying to help you succeed. Your friend would say, “I wish that we could predict branches more accurately,” and you’d think, “maybe we can leverage three bits of state per branch to implement a simple saturating counter,” and you’d laugh and declare that such a stupid scheme would never work, but then you’d test it and it would be 94% accurate, and the branches would wake up the next morning and read their newspapers and the headlines would say OUR WORLD HAS BEEN SET ON FIRE

I keep coming up with genuinely useful ideas with practical significance that work really well, and it all feels too easy.

There are a number of reasonable hypotheses about how this could be the case:

  1. I had one clever idea that unlocked everything else.
  2. Nobody cares about this problem, so they haven’t bothered to pick the fruit.
  3. I am ignoring giant swathes of prior art that I just don’t know exist because I’m an outsider to this field and couldn’t find on a google search because competence or academic firewall of doom.
  4. I am ignoring huge swathes of prior art because nobody has ever bothered to write it down or it’s locked up in proprietary tools.

Obviously the first is the one that I hope it is. To some extent I even have a plausible candidate for it: The idea that having data generators work on an intermediate representation of the final result rather than the final result directly would be useful. But this isn’t actually a very clever idea. It’s proven very fruitful, but it’s sufficiently obvious that it should have occurred to someone else before.

Two seems plausible. e.g. the idea I started out with is only really interesting if you want to integrate this sort of testing with normal testing workflows, which I do and some other people do but not many people who get to do full time research on this do. The conjecture concept only really matters if you’re trying to make these ideas work in imperative languages. etc. I’m in enough intersections that it’s at least plausible that nobody has cared about this intersection before. Additionally, my impression is that random testing isn’t very interesting to academics and most of the people who have put a lot of work into it are security researches, who have rather different focuses than I do.

So those are the optimistic scenarios. Both are a little depressing because it suggests there just hasn’t been enough interest in something that massively improves software quality to do even the level of research I’ve put into it, but at least they suggest I’m not wasting my time.

But I really can’t rule out 3 and 4, both of which are worrying.

The prior art I am mostly aware of and have based my work on is:

And I’m semi aware of but have consciously decided not to use the work on concolic testing, etc.

But I can’t help but feel there’s a lot more out there.

So, um, halp. Any prior art you can send me on the general subject of random testing, etc. would be very appreciated. Papers or working software or random blog posts that someone thought of a thing. It’s all good.

This entry was posted in programming, Python on by .

Superposition values for testing

Cory was talking about using AFL to fuzz test hyper-h2 the other day.

We talked about the difficulty of building a good starter corpus, and I thought I’d pull out some old code I had for using glassbox to do this.

It proceeded to fail utterly.

The problem is that H2 is a complex protocol which is very hard to basically probe through due to the number of different interactions. Simply throwing bytes at it just to see what happens is unlikely to do anything useful. This is similar to why historically that approach worked will with binaryornot and chardet, but was pretty rubbish on pyasn1.

Yesterday evening I thought about this problem a bit more and then started giggling. This is usually a bad sign.

The idea I’d come up with was this: What if we use a custom type to hold off on deciding the values until the last possible minute. That way we can get values that do interesting things to the internals of complex protocols by looking at the questions that the parser asks about the values and deciding what the answer is then rather than immediately.

The way this works is that you have a mock object that internally is saying “I am one of these values but I don’t know which”. Every time you perform a comparison it picks a possible answer at random and uses that to narrow down the list of possible values. At the end, your program should have behaved as if there had just been a really well chosen initial set of integers.

It turns out to work pretty well based on brief experimentation. My initial prototype didn’t support anything more than comparison operators, but after some further pondering I got it to support arbitrary arithmetic expressions. And thus, I present schroedinteger.

>>> x = schroedinteger({1, 3, 7})
>>> y = schroedinteger({-1, 1})
>>> z = x * y
>>> z
indeterminate: {-7, -3, -1, 1, 3, 7}
>>> x == 1
False
>>> x == 3
False
>>> z
indeterminate: {-7, 7}
>>> z == -7
True
>>> z
-7
>>> y
-1

The way this works is to separate out concepts: You have observables which are basically just a list of possible integers that get whittled down over time, and you have schroedintegers, which consist of:

  1. A set of observables they are interested in
  2. A function which maps an assignment of those observables to integers to a concrete integer

So when you perform arithmetic operations on schroedintegers it just creates a new one that shares the sets of observables of the two sides and evaluates both.

Every time you observe something about the system it looks at the set of possible answers, picks an answer uniformly at random from the results, and then collapses the state of possibilities to only those that would have produced that answer, and then returns it.

Performance is… OK. Where by OK I mean “not very good”. The set of heuristics used keep it manageable, but no better than that.

It could be improved if I really cared, but right now this project is a bit of a toy. In particular most operations are currently O(m * n). Many of these could be fixed to not be quite readily with a little more work – currently the implementation is very generic and many of the operations admit a nicely specific implementation that I haven’t used. e.g. a thing that would be pretty easy to do is to track upper and lower bounds for every schroedinteger and use those to exclude many possibilities.

I also investigated using Z3 to do this. It would be an interesting backend that would remove the most of the limitations. Unfortunately the results were really slow and kinda crashy (I had at least one hard to minimize segfault), so I’ve given up on that for now.

All told, an interesting result for about a day of hacking. Right now I plan to leave it where it is unless I come up with a particularly interesting use case. Let me know if you have one.

 

This entry was posted in programming, Python on by .

Commons: A board game

I’m overdue for a blog post and half asleep, so you get one of my half-arsed board game design posts instead of something deeper.

The idea is it’s a game about the tragedy of the commons. It works as follows (as per usual, numbers and details are essentially picked out of a hat and this has not ever been play tested so it is probably not actually fun. It may however contain the seed of a fun game):

You have a deck of 60 event cards, a large collection of resource tokens, and a starting player crown.

Resource tokens are split into the bank (an essentially unlimited collection of tokens), one pool per player, and one central pool.

Each player starts with ten tokens and the central pool starts with five per player. The crown is randomly assigned to one player to start.

Each pool will go up and down over the course of the game. If a player’s pool is ever depleted (goes down to zero or lower) that player drops out of the game. If the central pool is ever depleted, the game is over and everyone loses.

Play proceeds until there is only one player left (that player wins) or until the deck of cards runs out, at which point the person with the most tokens wins (if there is a tie they are joint winner).

Play occurs in a series of rounds:

Starting from the player with the crown and proceeding clockwise, each player draws two cards. One of them they play in front of them, the other they place face down in the middle.

Once everybody has played, add one additional face down card to the the cards in the middle, shuffle them, and the one at a time draw them and apply their effect to the central pool.

Cards are as follows:

  • 10 x Lose 1 token
  • 10 x Gain 1 token
  • 6 x Lose 2 tokens
  • 6 x Gain 2 tokens
  • 3 x Lose 3 tokens
  • 3 x Gain 3 tokens
  • 1 x Gain 5 tokens
  • 1 x Lose 5 tokens
  • 1 x If you have > 10 tokens, reset to 10, else do nothing
  • 1 x If you have < 10 tokens, reset to 10, else do nothing
  • 2 x Gain one resource token for every 5 you have, rounded up
  • 2 x Lose one resource token for every 5 you have, rounded up [Note: Yes this will kill you if you’re down to 1]
  • 3 x Given each other pool one resource token. If you don’t have enough, deal out all you have remaining starting with the common pool, then from the player with the crown going clockwise.
  • 3 x Take one resource token from each other pool.
  • 4 x Give 5 resource tokens to the pool with the fewest tokens that isn’t you [Tie break: Prefer the common pool if it’s joint. Else, prefer clockwise closest to the starting player]
  • 4 x Take 5 resource tokens from the pool with the most tokens, or as many as they have if it’s fewer than 5 [Tie break as above]

Each of these should be available in a variety of values of X, and the distribution should be such that every card is paired with one of the opposite effect. The deck should be quite heavily stacked with “Gain X / Lose X” cards, with the more complicated cards making up a bit under half the deck.

After a round has been completed, the crown moves one player clockwise and the next round begins.

Design notes

One of the interesting features of this game is that it is very hard to tell what other players are doing. You can observe the effects “Hmm your pile is suspiciously large for if you haven’t been fucking over the central pile” but you can’t actually tell who has played what card. There is a huge amount of plausible deniabilty in every move, especially with the random extra card in the center.

There’s also no way to target someone specific (there can’t be, because of the central pool needing to be able to play without choice).

Overall, the game has ended up as a sort of hilarious head on collision between the tragedy of the commons and the prisoner’s dilemma. I think I like that aspect of it.

Thanks to the other David for helping me think through some of the aspects of this.

This entry was posted in Games on by .

Automated patch minimization for bug origin finding

Ok, so you know git bisect? You take a commit history and try to find the commit that introduced a problem.

Imagine you had the problem that git bisect solves but uh you hadn’t been very good about commit discipline and you had a whole bunch of large messy commits, several consecutive ones not in a working state, and it wasn’t really clear what happened where.

Not that I know anyone in that situation of course. This is a purely hypothetical scenario.

Well it turns out there’s a solution to this too, and it’s possibly better than git bisect.

First you write a script to detect the case you want. You can do this with git bisect too, but people rarely bother because you only need O(log(n)) tests. In this case you’ll need a lot more than that so you should do this. Running a test and grepping its output is usually sufficient here. This script exits 0 if the code exhibits the problem or non-zero otherwise.

These usually end up terrible. It’s OK – it’s genuinely a one off and you’re not going to have to maintain it. Here’s the relevant part of mine:

export PYTHONPATH=src 
if python -u -m pytest tests/cover/test_flatmap.py -kmixed_list 2&>1 > test.log  ; then 
  echo "Test passed"
  exit 1
fi
 
grep -q "Extra items in the left set" test.log
grep -q "'0'" test.log

i.e. you run the test, it should fail. Further, it’s output should contain some particular lines. Sometimes you’ll find that your script wasn’t specific enough and you’ll need to add more conditions.

Now you do what you do to start git bisect off: You find a pair of commits, one after the other, where the latter one exhibits the property you want to isolate (e.g. “This particular test fails in this particular way”)

Now, you take the patch between these two commits:

 git diff --no-prefix master andsoitbegins > working.patch

(yes, my branch is called andsoitbegins. I’m not good at branch names. Or possibly I’m excellent at branch names, you decide)

This produces a patch file. We’re now going to perform file based minimization on that patch to try to find the smallest patch that causes the problem.

I still don’t have a good go to minimizer – I tried using delta for this and continued my streak of never getting a good result from doing so, so once again I wrote my own terribly crappy file based minimizer. Here you go:

One of these days I’m going to bite the bullet and just package this all up into a good piece of standalone software. Today might be that day, but it probably isn’t.

What this script does is it repeatedly replaces the contents of a file and reruns a test script to see if that matters. It does this by deleting large chunks of lines then small chunks of lines. If those produced an invalid patch that we can’t apply, it just exits early and doesn’t bother running the test.

Here’s the full test script:

This cleans up the repo, applies the current patch in working.patch and then runs the tests as described above.

In the end our starting 6kloc patch file is turned into the following:

--- src/hypothesis/searchstrategy/strings.py
@@ -204,10 +204,10 @@ class StringStrategy(MappedSearchStrategy):
         )
 
     def __repr__(self):
-        return u'StringStrategy()'
+        return 'StringStrategy()'
 
     def pack(self, ls):
-        return u''.join(ls)
+        return ''.join(ls)

I had accidentally made the text() strategy return raw strings instead of unicode. This caused surprisingly few problems, but did break at least one unrelated test.

The first part of the patch is actually not required to reproduce the problem. What it was needed for is to locate the relevant lines – there are multiple pack() methods in this file, so apparently patch needed help to disambiguate.

In conclusion: This worked really well. It took what would have been quite a complicated piece of detective work and automated it. A++ experience, and I’ll definitely add this to my general box of tools.

This entry was posted in programming on by .

2015 in review

I don’t think I’ve done these in previous years, but they seem to be a thing this year and according to todoist I’m due a blog post (overdue by 3 days actually), so here we go.

So… 2015. That was a weird year, huh?

This year I have:

  1. Moved country (again)
  2. Upgraded my internet celebrity status from D-list to C minus-list if you’re in specific niches
  3. Earned precisely £0 (note: This is a slight technicality).

What happened?

Well, if you may recall, in 2014 I moved to Zurich to work for Google. Despite what people thought from the timing of the post, this was not an April fools joke but an actual thing.

It didn’t really work out and I lasted about 6 months. For a variety of reasons, Google was not a good environment for me, and between that and it hampering my ability to work on my own stuff in my spare time, I got a lovely inside view of what happens when the sub-clinical depression I’ve been mostly coping with all my life removes the ‘sub-‘ from its name (OK, I didn’t actually get diagnosed, because I figured out how to self-medicate through a judicious application of quitting my job instead).

So, starting the new year unemployed was pretty great, even if I was still in one of the most expensive cities in the world.

I’d intended to work on a variety of projects, and actually spent the initial part of the year brushing up on combinatorial optimisation and learning about integer programming. But I had this weird little project called Hypothesis that I’d written back in 2013 to learn Python, and for some weird reason people had started using it, so I thought I might fix a few bugs, maybe work on some features, etc.

So, um, yeah.

For me 2015 has basically been the year of Hypothesis. This was very much not intended from the outset. I originally intended to put in a couple months work on it. I even made a concerted decision to give it up. Now I’m building a business around it.

It’s been an interesting experience for a number of reasons. It involves research to solve practical problems, randomized algorithms, combinatorial optimization, API design, coming up with good abstractions, and software quality. All it needs is voting to become my ideal project.

…voting and financial viability that is. Which is why I’m trying to build a business around it.

The business side of things is going OK. I’ve had some work this year (which I have not yet paid myself for from the business’s bank account, which is why £0 earned in 2015), and a bunch more scheduled for the new year. If I can keep up my average income over the next three months for the rest of 2016, I’ve got something pretty sustainable. I don’t yet know if I can, but I’m a lot more hopeful about the possibility than I was 3 or 4 months ago.

This has also got me to go to a lot more conferences and give talks, which has been a rather great thing for me. Turns out getting up in front of a few hundred people and talking entertainingly is a thing I can do. I had no idea. I credit this blog with a lot of that.

Other things than Hypothesis have happened this year too…

Writing

Obviously I’ve written stuff this year. I always write stuff. Notable things I’ve written here on this blog include:

This was also the year I got into writing fan-fiction.

It started with Stargate Physics 101. While this is Stargate fan-fiction, it’s really a comedy about software testing and existential threats. “I send this to software developers to teach them about testing” is an actual thing I have been told about this piece. “My job is a lot like this, but with fewer genderqueer werewolves” is an actual thing I have said about this piece.

I’ve also since been writing The Rules of Wishing. This is fan-fiction for Disney’s Aladdin.

I’ve also started writing docs.drmaciver.com, which is facetiously described as a handover manual for being DRMacIver but is mostly a catalogue of stuff right now.

Living situation

For the first time (not counting holidays) in more than a decade, I’ve been living with my parents since May. It’s been an interesting experience.

The living with my parents part has been fine actually. We get along fairly well, even if I occasionally start arguments about shower lengths and ham sandwiches. The real problem with living with them is that they live in the country, and the country is bullshit.

One result of living in the country is that I’m doing much less walking (if this sounds backwards to you, I assure you it is not. Living in the city naturally integrates a lot of walking into your life. Living in the country mostly surrounds you with roads that are not safe to walk on and would take you an hour to get anywhere), which means that my overall fitness level is way down. I’ve been actively trying to combat that in the last month or so, which seems to be having some effect.

Another consequence of living here is that it really helps me grind my driving skill. I’ve probably driven about an order of magnitude more total hours in the last 6 months than in the rest of my life put together.

This has both been interesting and frustrating. I’m still politically quite against cars, and I don’t really like being so dependent on them. However, it’s taken my driving skill from “I don’t think you should trust me behind the wheel of a car” to being pretty OK.

Beeminding

I went from wildly enthusing about beeminder to just quietly using it. It petered out for a while, but now is picking back up again.

My two main things for beeminding are todoist and exercise. I’ve also got some tagtime goals for work and reading non-fiction.

What next?

My plans for 2016 are mostly “do what I’m doing now except more so”. I’m in the process of getting back in shape. I’d like to do that, but more so. I’m in the process of building a business. I’d like to do that, but more so.

I would definitely like to move back to civilization. I’m giving strong consideration to this being not London, but it will have to be somewhere London accessible.

Another thing that will happen in 2016 that I’m excited about is that 2016 will involve travel to at least two countries I’ve never been to before. I’m going to Namibia in January (and offering a great deal on Hypothesis training until I do!), and later in the year I’ll be going to Mexico for my brother’s wedding.

I tend not to travel much, so both should be interesting experiences.

Other than that, I have some interesting research directions I want to take Hypothesis/Conjecture in, and may be exploring some plans for how to focus on that. Watch this space.

 

This entry was posted in Uncategorized on by .