Category Archives: Hypothesis

Looking into doing a PhD

As regular readers of this blog have probably figured out, I’m a researchy sort of person.

A lot of my hobbies – maths, voting theory, weird corners of programming, etc – are research oriented, and most of my work has had some sort of research slant to it.

The last two years I’ve basically been engaged in a research project working on Hypothesis. It’s come quite far in that time, and I feel reasonably comfortable saying that it’s the best open source property based testing library on most metrics you’d care to choose. It has a number of novel features and implementation details that advance the state of the art.

It’s been pretty great working on Hypothesis like this, but it’s also been incredibly frustrating.

The big problem is that I do not have an academic background. I have a masters in mathematics (more technically I have a BA, an MA, and a CASM. Cambridge is weird. It’s entirely equivalent to a masters in mathematics though), but that’s where I stopped. Although it says “DR” in my online handle and the domain of this blog, those are just my initials and not my qualification.

As a result, I have little to no formal training or experience in doing academic research, and a similarly low understanding of who’s who and what’s what within the relevant fields. So I’ve been reading papers and trying to figure out the right people to talk to all on my own, and while it’s gone OK it’s still felt like fumbling around in the dark.

Which leads to the obvious solution that I spoilered in the title: If the problem is that I’m trying to do research outside of an academic context, the solution is to do research in an academic context.

So I’d like to do a PhD that is either about Hypothesis, or about something close enough to Hypothesis that each can benefit from the other.

There’s probably enough novel work in Hypothesis already that I could “just” clean it up, factor it out, and turn it into a PhD thesis as it is, but I’m not really expecting to do that (though I’d like that to be part of it). There are a number of additional directions that I think it would be worth exploring, and I expect most PhD funding will come with a focus subject attached which I would be happy to adapt to (a lot of the most interesting innovations in Hypothesis came because some external factor forced me to think about things in ways I wouldn’t otherwise have!). If you’d like to know more, I’ve written up a fairly long article about Hypothesis and why I think it’s interesting research on the main Hypothesis site.

Which, finally, brings me to the main point of the post: What I want from you.

I’m already looking into and approaching potential universities and interesting researchers there who might be good supervisors or able to recommend people who are. I’ve been in touch with a couple (some of whom might be reading this post. Hi), but I would also massively appreciate suggestions and introductions.

So, if you work in relevant areas or know of people who do and think it would be useful for me to talk to, please drop me an email at [email protected]. Or just leave a comment on this blog post, tweet at me, etc.

This entry was posted in Hypothesis, life, programming, Python on by .

Randomized exhaustive exploration of a many branching tree

Attention conservation notice: This post is largely an obvious in retrospect solution to an obscure problem.

It’s a problem I spent a lot of time considering wrong or complicated solutions too though (this is what the persistent data structure I was considering was for, but it’s not needed in the end), so I thought it was worth writing up the correct solution.

The problem is this: I have a tree of unknown structure. Each node in the tree is either a leaf or has an ordered sequence of N (variable per node) children (technically the leaves are just a special case of this with N=0, but leaves have special properties as we’ll see in a moment).

I want to do an exhaustive search of the leaves of this tree, but there’s a limitation: when walking this tree I may only walk complete branches starting from the root and finishing at a leaf. I cannot back up and choose another branch.

The tree is potentially very large, to the point where an exhaustive search might be impossible. So this should actually be treated as an anytime algorithm, where we’re iteratively yielding new paths to a leaf which some external algorithm is then doing something with. When yielding those we want to guarantee:

  • We never yield the same path twice
  • We know when we have explored the entire tree and stop

It’s easy to do a lexicographic exploration of the branches in sorted order – just build a trail of the choices you took. At each step, increment the last choice. If that causes it to exceed the number of choices available, pop it from the end and increment the previous next choice. When making an unknown choice always choose zero.

But there’s a reason you might not want to do this: Under certain reasonable assumptions, you may end up spending a lot of time exploring very deep leaves.

Those reasonable assumptions are this:

  • All else being equal, we should prefer leaves which are closer to the root (e.g. because each walk to a child is expensive, so it takes less time to explore them and we can explore more leaves / unit time)
  • The expected number depth of a tree below any node given uniformly at random choices is usaully much lower than the maximum depth of the tree elow that point.

Under these assumptions the optimal solution for exploring the tree is instead to just randomly walk it: You will avoid the long paths most of the time because of the expectation assumption, and will do a better job of exploring close to the root.

Unfortunately, pure random exploration does not satisfy our requirements: We might generate duplicate paths, and we don’t know when we’ve explored all the nodes.

So what we want is a hybrid solution that exhaustively searches the tree without duplicates, but chooses its paths as randomly as possible.

Any way, there turns out to be a very simple way of doing this. It’s as follows:

We maintain a list of paths to unexplored nodes, bucketed by their length. This is initially the empty path leading to the root.

We then repeatedly iterate the following steps until there are no unexplored nodes in the list:

  1. Pick an unexplored node of minimal length uniformly at random and remove it from the list
  2. Walk to that node
  3. Walk randomly below that node. Every time we make a random decision on a child, add the path we took to get to this point plus each other decision we could have taken to the list of unexplored nodes.
  4. When we reach a leaf node, yield that path to the controlling algorithm.

Every path we walk goes through a node we have not previously explored, so must be unique. When the list is empty we’ve explored all the nodes and are done.

The correct storage format for the paths is just an immutable singly linked list stored with the last node in the path first – appending a node to the path is just consing it to the front of the list. You’re going to have to do O(n) work to reverse it before walking upwards, but that’s OK: You’re going to have to do O(n) work with n the same and a higher constant factor to walk back up the tree along that path anyway.

The reason for my original need for a data structure with better forward iteration than that was because the algorithm I was originally considering always picked the unexplored node with the shortlex minimal (lexicographically first amongst those with the shortest length) path to it, but there’s no real reason you need to do that and it makes things much more expensive.

If you want to instead do weighted random walks of the tree and are prepared to relax the requirements around exploring shorter paths first a bit, there’s an alternative algorithm you can use (and I’m actually more likely to use this one in practice because I’ll need to do weighted draws) which works as follows:

We build a shadow tree in parallel every time we walk the real tree. The shadow tree is mutable, and each node in it is in one of three states: Unexplored, Active or Finished. It starts with a single root node which is Unexplored.

We maintain the following invariant: Any node in the shadow tree which is not Finished has at least one child node that is not Finished. In particular, leaf nodes are always Finished.

We then decide how to walk the tree as follows: We always walk the shadow tree in parallel, matching our walk of the real tree. Whenever we walk to a node that is Unexplored, we do one of two things:

  1. If the corresponding real node is a leaf, we immediately mark it as Finished (and our walk is now done)
  2. If the corresponding real node has N children, we mark it as Active and create N Unexplored child nodes for it.

Once our walk has complete (i.e. we’ve reached a leaf), we look at the list of nodes in the shadow tree that we encountered  and walk it backwards. If all of the children of the current node are Finished, we mark the node as Finished as well. If not, we stop the iteration.

We can now use the shadow tree to guide our exploration of the real tree as follows: Given some random process for picking which child of the real tree we want to navigate to, we simply modify the process to never pick a child whose corresponding shadow node is Finished.

We can also use the shadow tree to determine when to stop: Once the root node is marked as Finished, all nodes have been explored and we’re done.

Note that these two approaches don’t produce at all the same distribution: If one child has many more branches below it than another, the initial algorithm will spend much more time below that child, while the shadow tree algorithm will explore both equally until it has fully exhausted one of them (even if this results in much deeper paths). It probably depends on the use case which of these are better.

You can use the shadow tree to guide the exploration too if you like. e.g. if there are any unexplored child nodes you could always choose one of them. You could also store additional information on it (e.g. you could track the average depth below an explored node and use that to guide the search).

Anyway, both of these algorithms are pretty obvious, but it took me a long time to see that they were pretty obvious so hopefully sharing this is useful.

As an aside: You might be wondering why I care about this problem.

The answer is that it’s for Hypothesis.

As described in How Hypothesis Works, Hypothesis is essentially an interactive byte stream fuzzer plus a library for doing structured testing on top of this. The current approach for generating that bytestream isn’t great, and has a number of limitations. In particular:

  • At the moment it fairly often generates duplicates
  • It cannot currently detect when it can just exhaustively enumerate all examples and then stop (e.g. if you do @given(booleans()) at the moment the result isn’t going to be great).
  • The API for specifying the byte distribution is fairly opaque to Hypothesis and as a result it limits a number of opportunities for improved performance.

So I’m exploring an alternative implementation of the core where you specify an explicit weighted distribution for what the next byte should be and that’s it.

This then opens it up to using something like the above for exploration. The nodes in the tree are prefixes of the bytestream, the leaves are complete test executions. Each non-leaf node has up to 256 (it might be fewer because not every byte can be emitted at every point) child nodes which correspond to which byte to emit next, and then we can treat it as a tree exploration problem to avoid all of the above issues.

I’ve currently got a branch exploring just changing the API and keeping the existing implementation to see how feasible rebuilding Hypothesis this way is. So far the answer is that it’ll be a fair bit of work but looks promising. If that promise pans out, one of the above (probably the shadow tree implementation) is likely coming to a Hypothesis near you at some point in 2017.

(Why yes, it is a programming post! I do still do those. I know these are a bit thin on the ground right now. If you want more, consider donating to my Patreon and encouraging me to write more).

This entry was posted in Hypothesis, programming, Python on by .

Declaring code bankruptcy for the rest of 2016

This is a small PSA.

It probably hasn’t been too visible from the outside, but I’ve not been doing very well recently.

In particular I’ve been finding my productivity has pretty much gone through the floor over the last couple months. 2016 is stressing me out for a whole pile of reasons (about 80% the same ones it’s stressing everyone else out), and I’m not dealing with it very well. This is making it very difficult to stay on track and motivated.

It’s not a problem when I’ve got something external to focus me (e.g. a client), but when I have to self-motivate on a programming project I find that I can’t right now, and the result is that I’m unproductive, which makes me depressed, which makes me even less productive.

So I’ve decided to stop. If I’m not getting any programming done and feeling bad about it, it’s clearly better to not get any programming done and not feel bad about it. Being depressed isn’t doing anyone any favours, let alone me. So, for the rest of the year I will not be writing any code unless someone is explicitly paying me to write it.

I currently have one client and a few potential clients, and my obligations to them will absolutely still be met (and probably be met a lot better than they would have in my previous mood). I am happy to accept new clients, but I probably won’t be actively seeking them out until the new year.

I’m also going to keep reviewing pull requests on Hypothesis and doing Hypothesis releases with other people’s stuff (there will probably be some Hypothesis releases with paid work from me too).

I’m probably also going to keep coding when it’s required to solve an immediate problem I have, and maybe when I need it to answer a question for a blog post or something.

So it’s not a complete cessation, but what it is is a freedom from a sense of obligation: If it doesn’t fall into one of these categories then I shouldn’t be doing it, and as a result I should be looking for something else to do if I don’t have any programming to do in one of those categories rather than procrastinating to avoid some vague sense of obligation to be coding.

This is going to give me a lot of time that I’m currently filling with failing to program in, so I’ll probably end up with some non-programming projects. I don’t yet know what these are going to be, but I’ve got a couple candidates:

  • When I first started working on Hypothesis I was taking a work break in which I’d intended to brush off some mathematics textbooks. This didn’t happen. It might happen this time. I started working through some of the exercises in Bollobas’s Combinatorics earlier and I’m finding it surprisingly enjoyable. I may keep this up.
  • I’ve been working on getting in shape the last couple of months. I don’t want to throw myself into this too vigorously because I’m more likely to just do myself an injury, but I’m likely to step this up at least moderately.
  • The War On Sleep must still be waged.
  • I’ve been thinking of doing NaNoWriMo, but I probably won’t.

Other than that, I don’t know. Watch this space while I found out.

This entry was posted in Hypothesis, life, Python on by .

Brand split

Short version

Expect less directly Hypothesis related content here. If you want lots of Hypothesis related content, go over to hypothesis.works and check out the articles.

There will still be brain dumps of research, reflections on the social aspects of Hypothesis, etc. but most of the more directly expository content will be over there.

Note: If you read this from Planet Python, all of the Python specific articles will already be coming in automatically. There are a bunch of non Python related ones you may also wish to check out.

What’s going on?

Basically, I need to make money. I’m making some, but I’m not making enough. I am currently paying myself a salary (yay!) but it’s a salary that if any friend of mine was getting paid it I would tell them that their employer was screwing them over and they should fire them into the sun (note: This is not qualified by “any friend in the software industry”). Some of that is to build up liquidity for my business bank account (e.g. so I can pay contractors), some of it is just my company’s current revenue stream is not very large. Books are nice, and I’m very grateful to the small number of contributors on bountysource, but really all of the money comes from training, consulting and development contracts and right now I am not taking on even close to enough of them.

If I continue making money at the current rate then at some point later this year (Pencilled in at end of August if my current revenue doesn’t increase by then and end of 2016 if my savings don’t start going up rather than down by then) I really will have to throw in the towel and go get a day job. Ideally a part time one, and either way Hypothesis R&D will continue and I will probably still be available for contracting, training etc (and of course any existing commitments will be honoured),  but it is going to have to get downgraded to a part time thing.

What am I going to do about it?

The contracts I’ve had so far have gone very well, and both I and my customers have come away from them pretty happy. Some of them will be turning into repeat custom, though unfortunately not on a sufficiently small cycle that they’re enough to be a path to sustainability on their own yet.

The problem is not that the contracts don’t pay enough (they do) or that I don’t know what sort of services bring real value to clients, but that I am simply not getting enough of them.

Why am I not getting enough of them? It turns out that the answer to this is quite straightforward: Very few people are aware they’re an option, and most of those who are aware that it is an option aren’t really sure it’s an option they need. I think they’re wrong, but clearly my arguments to date have been insufficiently persuasive.

Well, it turns out that there are two closely related disciplines that help you with the problem “I have a great product that not enough people know about and the people who know about aren’t buying”. They are of course, marketing and sales respectively.

These are things that historically I have been completely crap at. A lot of why nobody knows this is a thing I do is because I haven’t told them.

Oh, there’s been stuff on this site, but lets be honest: This site is very obviously just some guy’s blog. If you’re someone with budget and come here you don’t see a vendor, you see some guy ranting about social dynamics or game design. The best thing for really selling people on Hypothesis has been the documentation, and as well as being very developer focused it’s also doing a rubbish job of marketing.

So as well as being a generally great place to go for Hypothesis content, the goal of hypothesis.works is to have a single place which is just about Hypothesis and how great it is, and to make it very obvious that it is a thing you can pay for.

What can you do to help me?

The Hypothesis community is full of lovely people and I do get lots of offers of help. Unfortunately I haven’t made it easy to be helped. Some of this is because it’s intrinsically difficult – a small community of lovely people can’t and shouldn’t donate enough to support me – and some of it is just the aforementioned lack of skill at marketing that I’m working on.

If you don’t want to help, that’s OK. I won’t hate you. If you can’t help that’s definitely OK. But if you want to help, you can do some of the following…

Easy things:

  1. Donate on bounty source.
  2. Buy me books. This won’t actually help in the financial sense, but it’ll sure make me feel better :-)
  3. Buy my book. OK. You can’t actually do that yet. But watch this space.
  4. Share links to hypothesis.works articles, both on social media and internally to your company.
  5. Slightly harder variation: Write posts about how great Hypothesis is and link to hypothesis.works. If you’d like to guest post on hypothesis.works about how great Hypothesis is, let me know.

Hard things:

The reality is that the best way to help me is by giving me contracts. I don’t want charity, I want to exchange goods and services for currency. As an individual you’re not well placed to help with that.

As an employee (or contractor/consultant) you might be. If you think your company, or a company you know, would find either the services or the training I offer useful, talk to them about it. Also, if I can do anything to improve the site to make that job easier, get in touch and tell me about it.

This entry was posted in Admin, Hypothesis on by .

Language reconstruction based fuzzing without reconstructing languages

Note: This idea has not been implemented yet so may turn out not to work. It all clicked together while I was walking around Paris this afternoon and I’m writing it down while it’s fresh in my mind. I think it will work though.

Secondary note: This post may be too inside baseball for you.

An idea that I’ve been banging my head against recently is that of reconstructing languages corresponding to tests based on some bounded data. I’ve posted a bit about its use for reducing, but I’ve also been interested in its use for fuzzing.

Here’s a problem Hypothesis experiences: The | operator is currently not associative. x | (y | z) will produce a different data distribution than (x | y) | z. This is suboptimal. I could special case it (and for a long time I thought I had, but apparently I dreamed that code), but it’s easy to come up with trivial ways to defeat that. e.g. (x | y).map(lambda x: x) | z should really have the same distribution as x | y | z but doesn’t if you special case |.

But under the hood Hypothesis just works with streams of bytes, so in theory you can represent its data distribution as a finite state machine. You could then randomly walk the state machine reweighting your choices by the number of reachable states and thus get much get better coverage of the set of available paths by reaching rare paths much more often.

Or, somewhat suggestively, here’s another algorithm that could work well and is a lot simpler than reweighting:

  1. Pick a state uniformly at random
  2. Take the shortest path to that state (or, indeed, any random path to that state).
  3. Continue the random walk from there

Unfortunately we run into a major problem with either of these approaches: Reconstructing regular languages for unknown predicates is really hard.

The insight I had this afternoon is this: We don’t need to! We can use the same trick as I used to produce self-optimizing boolean expressions to just store exemplars of distinct states we’ve found. We then just pick a random state we’ve already discovered and run the generation from there (Hypothesis can extend any string to a complete but possibly invalid example).

How does this work?

We maintain three sets of data:

  1. A corpus of examples such that we know that every element of this list definitely corresponds to a different state in the state machine for our predicate.
  2. A set of previously seen strings (this doesn’t have to be precise and can e.g. be a bloom filter)
  3. A binary search tree with branches labelled by strings and leaves labelled by indices into our list of examples.

We populate this as follows:

  1. The corpus starts as the empty string
  2. The set of previously seen strings contains just the empty string
  3. The binary tree is a single leaf pointing to 0 (the index of the empty string).

We then fuzz as follows:

  1. Pick a random element of our corpus.
  2. Generate an example such that that random element is a strict prefix of the example (it’s easy to do this in Hypothesis).
  3. For each proper prefix of that example, perform the incorporate operation.

The incorporate operation when run on a string s is:

  1. If the string is in our set of seen strings, stop. Else, add it to the set.
  2. Walk the binary tree until we hit a leaf. At each branch look at the label e. If the concatenation of s followed by e is valid, walk right. Else, walk left.
  3. Once we are at a node, find the string in our corpus index by it. Call that t. By repeatedly fuzzing extending s and t, attempt to find e such that s + e and t + e produce different results. If we do find such an e, add s to the corpus as a new element and split the current tree node, creating a branch labelled by e and with the left and right children being leaves containing s and t (which one is which depending on which way around the difference occurred). Otherwise, if s is a better example than t (shorter, or same length and lexicographically smaller), replace t in the corpus with s.

The experiments we split on witness that two nodes must be in different states (cf. the Myhill-Nerode theorem), so we guarantee that our corpus consists of unique states. The fuzzing step is probabilistic, so we can sometimes discard a state inappropriately, but this shouldn’t be a major problem in practice – states that we can’t distinguish with high probability might as well be the same from a fuzzing point of view.

Anyway, like I said, it’s currently unclear how useful this is because I haven’t actually implemented it! In particular I suspect it’s more useful for long running fuzzing processes than for the sort of fast fuzzing Hypothesis tries to do. Fortunately I’m planning how to incorporate such workflows into Hypothesis. However I definitely still think it’s a good idea and I’m looking forward to exploring it further.

This entry was posted in Hypothesis, programming on by .