Category Archives: Python

Randomized exhaustive exploration of a many branching tree

Attention conservation notice: This post is largely an obvious in retrospect solution to an obscure problem.

It’s a problem I spent a lot of time considering wrong or complicated solutions too though (this is what the persistent data structure I was considering was for, but it’s not needed in the end), so I thought it was worth writing up the correct solution.

The problem is this: I have a tree of unknown structure. Each node in the tree is either a leaf or has an ordered sequence of N (variable per node) children (technically the leaves are just a special case of this with N=0, but leaves have special properties as we’ll see in a moment).

I want to do an exhaustive search of the leaves of this tree, but there’s a limitation: when walking this tree I may only walk complete branches starting from the root and finishing at a leaf. I cannot back up and choose another branch.

The tree is potentially very large, to the point where an exhaustive search might be impossible. So this should actually be treated as an anytime algorithm, where we’re iteratively yielding new paths to a leaf which some external algorithm is then doing something with. When yielding those we want to guarantee:

  • We never yield the same path twice
  • We know when we have explored the entire tree and stop

It’s easy to do a lexicographic exploration of the branches in sorted order – just build a trail of the choices you took. At each step, increment the last choice. If that causes it to exceed the number of choices available, pop it from the end and increment the previous next choice. When making an unknown choice always choose zero.

But there’s a reason you might not want to do this: Under certain reasonable assumptions, you may end up spending a lot of time exploring very deep leaves.

Those reasonable assumptions are this:

  • All else being equal, we should prefer leaves which are closer to the root (e.g. because each walk to a child is expensive, so it takes less time to explore them and we can explore more leaves / unit time)
  • The expected number depth of a tree below any node given uniformly at random choices is usaully much lower than the maximum depth of the tree elow that point.

Under these assumptions the optimal solution for exploring the tree is instead to just randomly walk it: You will avoid the long paths most of the time because of the expectation assumption, and will do a better job of exploring close to the root.

Unfortunately, pure random exploration does not satisfy our requirements: We might generate duplicate paths, and we don’t know when we’ve explored all the nodes.

So what we want is a hybrid solution that exhaustively searches the tree without duplicates, but chooses its paths as randomly as possible.

Any way, there turns out to be a very simple way of doing this. It’s as follows:

We maintain a list of paths to unexplored nodes, bucketed by their length. This is initially the empty path leading to the root.

We then repeatedly iterate the following steps until there are no unexplored nodes in the list:

  1. Pick an unexplored node of minimal length uniformly at random and remove it from the list
  2. Walk to that node
  3. Walk randomly below that node. Every time we make a random decision on a child, add the path we took to get to this point plus each other decision we could have taken to the list of unexplored nodes.
  4. When we reach a leaf node, yield that path to the controlling algorithm.

Every path we walk goes through a node we have not previously explored, so must be unique. When the list is empty we’ve explored all the nodes and are done.

The correct storage format for the paths is just an immutable singly linked list stored with the last node in the path first – appending a node to the path is just consing it to the front of the list. You’re going to have to do O(n) work to reverse it before walking upwards, but that’s OK: You’re going to have to do O(n) work with n the same and a higher constant factor to walk back up the tree along that path anyway.

The reason for my original need for a data structure with better forward iteration than that was because the algorithm I was originally considering always picked the unexplored node with the shortlex minimal (lexicographically first amongst those with the shortest length) path to it, but there’s no real reason you need to do that and it makes things much more expensive.

If you want to instead do weighted random walks of the tree and are prepared to relax the requirements around exploring shorter paths first a bit, there’s an alternative algorithm you can use (and I’m actually more likely to use this one in practice because I’ll need to do weighted draws) which works as follows:

We build a shadow tree in parallel every time we walk the real tree. The shadow tree is mutable, and each node in it is in one of three states: Unexplored, Active or Finished. It starts with a single root node which is Unexplored.

We maintain the following invariant: Any node in the shadow tree which is not Finished has at least one child node that is not Finished. In particular, leaf nodes are always Finished.

We then decide how to walk the tree as follows: We always walk the shadow tree in parallel, matching our walk of the real tree. Whenever we walk to a node that is Unexplored, we do one of two things:

  1. If the corresponding real node is a leaf, we immediately mark it as Finished (and our walk is now done)
  2. If the corresponding real node has N children, we mark it as Active and create N Unexplored child nodes for it.

Once our walk has complete (i.e. we’ve reached a leaf), we look at the list of nodes in the shadow tree that we encountered  and walk it backwards. If all of the children of the current node are Finished, we mark the node as Finished as well. If not, we stop the iteration.

We can now use the shadow tree to guide our exploration of the real tree as follows: Given some random process for picking which child of the real tree we want to navigate to, we simply modify the process to never pick a child whose corresponding shadow node is Finished.

We can also use the shadow tree to determine when to stop: Once the root node is marked as Finished, all nodes have been explored and we’re done.

Note that these two approaches don’t produce at all the same distribution: If one child has many more branches below it than another, the initial algorithm will spend much more time below that child, while the shadow tree algorithm will explore both equally until it has fully exhausted one of them (even if this results in much deeper paths). It probably depends on the use case which of these are better.

You can use the shadow tree to guide the exploration too if you like. e.g. if there are any unexplored child nodes you could always choose one of them. You could also store additional information on it (e.g. you could track the average depth below an explored node and use that to guide the search).

Anyway, both of these algorithms are pretty obvious, but it took me a long time to see that they were pretty obvious so hopefully sharing this is useful.

As an aside: You might be wondering why I care about this problem.

The answer is that it’s for Hypothesis.

As described in How Hypothesis Works, Hypothesis is essentially an interactive byte stream fuzzer plus a library for doing structured testing on top of this. The current approach for generating that bytestream isn’t great, and has a number of limitations. In particular:

  • At the moment it fairly often generates duplicates
  • It cannot currently detect when it can just exhaustively enumerate all examples and then stop (e.g. if you do @given(booleans()) at the moment the result isn’t going to be great).
  • The API for specifying the byte distribution is fairly opaque to Hypothesis and as a result it limits a number of opportunities for improved performance.

So I’m exploring an alternative implementation of the core where you specify an explicit weighted distribution for what the next byte should be and that’s it.

This then opens it up to using something like the above for exploration. The nodes in the tree are prefixes of the bytestream, the leaves are complete test executions. Each non-leaf node has up to 256 (it might be fewer because not every byte can be emitted at every point) child nodes which correspond to which byte to emit next, and then we can treat it as a tree exploration problem to avoid all of the above issues.

I’ve currently got a branch exploring just changing the API and keeping the existing implementation to see how feasible rebuilding Hypothesis this way is. So far the answer is that it’ll be a fair bit of work but looks promising. If that promise pans out, one of the above (probably the shadow tree implementation) is likely coming to a Hypothesis near you at some point in 2017.

(Why yes, it is a programming post! I do still do those. I know these are a bit thin on the ground right now. If you want more, consider donating to my Patreon and encouraging me to write more).

This entry was posted in Hypothesis, programming, Python on by .

Declaring code bankruptcy for the rest of 2016

This is a small PSA.

It probably hasn’t been too visible from the outside, but I’ve not been doing very well recently.

In particular I’ve been finding my productivity has pretty much gone through the floor over the last couple months. 2016 is stressing me out for a whole pile of reasons (about 80% the same ones it’s stressing everyone else out), and I’m not dealing with it very well. This is making it very difficult to stay on track and motivated.

It’s not a problem when I’ve got something external to focus me (e.g. a client), but when I have to self-motivate on a programming project I find that I can’t right now, and the result is that I’m unproductive, which makes me depressed, which makes me even less productive.

So I’ve decided to stop. If I’m not getting any programming done and feeling bad about it, it’s clearly better to not get any programming done and not feel bad about it. Being depressed isn’t doing anyone any favours, let alone me. So, for the rest of the year I will not be writing any code unless someone is explicitly paying me to write it.

I currently have one client and a few potential clients, and my obligations to them will absolutely still be met (and probably be met a lot better than they would have in my previous mood). I am happy to accept new clients, but I probably won’t be actively seeking them out until the new year.

I’m also going to keep reviewing pull requests on Hypothesis and doing Hypothesis releases with other people’s stuff (there will probably be some Hypothesis releases with paid work from me too).

I’m probably also going to keep coding when it’s required to solve an immediate problem I have, and maybe when I need it to answer a question for a blog post or something.

So it’s not a complete cessation, but what it is is a freedom from a sense of obligation: If it doesn’t fall into one of these categories then I shouldn’t be doing it, and as a result I should be looking for something else to do if I don’t have any programming to do in one of those categories rather than procrastinating to avoid some vague sense of obligation to be coding.

This is going to give me a lot of time that I’m currently filling with failing to program in, so I’ll probably end up with some non-programming projects. I don’t yet know what these are going to be, but I’ve got a couple candidates:

  • When I first started working on Hypothesis I was taking a work break in which I’d intended to brush off some mathematics textbooks. This didn’t happen. It might happen this time. I started working through some of the exercises in Bollobas’s Combinatorics earlier and I’m finding it surprisingly enjoyable. I may keep this up.
  • I’ve been working on getting in shape the last couple of months. I don’t want to throw myself into this too vigorously because I’m more likely to just do myself an injury, but I’m likely to step this up at least moderately.
  • The War On Sleep must still be waged.
  • I’ve been thinking of doing NaNoWriMo, but I probably won’t.

Other than that, I don’t know. Watch this space while I found out.

This entry was posted in Hypothesis, life, Python on by .

It might be worth learning an ML-family language

It’s long been a popular opinion that learning Haskell or another ML-family language will make you a better programmer. I think this is probably true, but I think it’s an overly specific claim because learning almost anything will make you a better programmer, and I’ve not been convinced that Haskell is a much better choice than many other things in terms of reshaping your thinking. I’ve never thought that you shouldn’t learn Haskell of course, I’ve just not been convinced that learning Haskell purely for the sake of learning Haskell was the best use of time.

But… I’ve been noticing something recently when teaching Python programmers to use Hypothesis that has made me reconsider somewhat. Not so much a fundamental reshaping of the way you think as a highly useful microskill that people seem to struggle to learn in dynamically typed languages.

That skill is this: Keeping track of what the type of the value in a variable is.

That may not seem like an important skill in a dynamic language, but it really is: Although functions will tend to be more lenient about what type of value they accept (is it a list or a tuple? Who cares!), they will tend to go wrong in interesting and confusing ways when you get it too wrong, and you then waste valuable debugging time trying to figure out what you did wrong. A good development workflow will typically let you find the problem, but it will still take significantly longer than just not making the mistake in the first place.

In particular this seems to come up when the types are related but distinct. Hypothesis has a notion of a “strategy”, which is essentially a data generator, and people routinely seem to get confused as to whether something is a value of a type, a strategy for producing values of that type, or a function that returns a strategy for producing the type.

It might be that I’ve just created a really confusing API, but I don’t think that’s it – people generally seem to really like the API and this is by far the second most common basic usage error people make with it (the first most common is confusing the functions one_of and sampled_from, which do similar but distinct things. I’m still trying to figure out better names for them).

It took me a while to notice this because I just don’t think of it as a difficult thing to keep track of, but it’s definitely a common pattern. It also appears to be almost entirely absent from people who have used Haskell (and presumably other ML-family languages – any statically typed language with type-inference and a bias towards functional programming really) but I don’t know of anyone who has tried to use Hypothesis knowing an ML-family language without also knowing Haskell).

I think the reason for this is that in an ML family language, where the types are static but inferred, you are constantly playing a learning game with the compiler as your teacher (literally a typing tutor): Whenever you get this wrong, the compiler tells you immediately that you’ve done so and localises it to the place where you’ve made the mistake. The error messages aren’t always that easy to understand, but it’s a lot easier to figure out where you’ve made the mistake than when the error message is instead “AttributeError: ‘int’ object has no attribute ‘kittens'” in some function unrelated to where you made the actual error. In the dynamically typed context, there’s a larger separation between the observed problem and the solution, which makes it harder to learn from the experience.

This is probably a game worth playing. If people are making this error when using Hypothesis, they I’d expect them to be making it in many other places too. I don’t expect many of these errors are making it through to production (especially if your code is well tested), but they’re certainly wasting time while developing.

In terms of which ML-family language to choose for this, I’m not sure. I haven’t actually used it myself yet (I don’t really have anything I want to write in the space that it targets), but I suspect Elm is probably the best choice. They’ve done some really good work on making type errors clear and easy to understand, which is likely exactly what you need for this sort of learning exercise.

 

This entry was posted in programming, Python on by .

You’re now all on commission

I would like more customers. This requires me more sales and marketing. I am bad at sales and marketing.

The solution is of course to have someone else do it for me. Which is why you’re all now on commission.

Starting from today, I’m offering literally everyone the following deal:

If you introduce me (personally, rather than just a scattershot “You should talk to these people”, though I’ll probably figure out some way to reward the latter too) to a business or anyone else who becomes a customer of mine, I will pay you 20% commission on up to the first £5000 I invoice to that customer. i.e. 20% of whatever I invoice, capped to giving you a maximum of £1000 per customer you introduce me to. This will apply to all invoices to that customer up to the cap.

The main things I offer customers are listed over on hypothesis.works but in particular I offer:

I’m also potentially available for similar work for things that aren’t necessarily directly Hypothesis related. e.g . I have a ScalaCheck related contract coming up soon, and I can offer consulting on a variety of algorithmic or design problems. The commissions aren’t limited to anything in particular – it’s for any paid work (of course I may not accept any particular work just because you suggest it).

If you want to introduce me to someone or have any questions about this, please email me at [email protected].

Notes

  • For the sake of avoiding this being outright bribery, I’m not going to offer commission for introducing me to a company that you actually work for. I might change that later once I’m clearer on the implications.
  • If for some reason you are not able or willing to accept money from me I am happy to donate the commission to a charity of your choice.
  • What I invoice depends a lot on the project and the customer, but hitting that £1000 limit is definitely not implausible.
  • Neither this nor the 10% of each invoice that I donate to charity come out of each other. You get 20%, charity gets 10%, I get 70%.
This entry was posted in Business, Python on by .

Shrinking failing input using a SAT solver

Advance warning: This is mostly me writing up a negative to neutral result. The technique described in this post works OK, but it doesn’t seem to work better than a well optimized delta debugging and as such does not currently justify its complexity. It also doesn’t work very well when you start from large examples, but that’s fortunately not a huge problem using the ideas I outlined in my last post because you can work with smaller sequences until the example is small enough to be feasible when approached bytewise.

My hope was that it would give a system that was much better at finding longer intervals to delete (delta debugging only guarantees that you can not delete any byte. Guaranteeing that you cannot delete any interval provably requires a quadratic number of tests to check, but I was hoping that this would be able to exploit structure found in the string to find more intervals that pure delta debugging would have missed.

I’ve got some ideas I want to pursue to extend it that might make it worth it, but this post just describes the current state of work in progress. Also I might end up incorporating it into my shrinker anyway and hope that I can improve its working with more cleverness later (I already have a prototype of doing this).

Anyway, disclaimers over, on to the idea.

We are trying to take a string \(S\) and a predicate \(f\). We are trying to construct a string \(T\) of minimal length such that \(f(T)\) is true.

This builds on my previous idea of using regular language induction to aid shrinking. The version I described in that post appears to be too slow to make work.

The core idea is still that we are looking for a DFA with fewer than \(|S|\) nodes such that every accepted string satisfies \(f\). If we can find such a DFA then we can shrink \(S\) because the shortest path to an accepted node must have no more than the number of nodes, which is fewer than \(|S|\) by assumption.

The idea is to try to manufacture a DFA from \(S\) using a SAT solver (Disclaimer: I actually used the naive algorithm that they aimed to replace because I didn’t quite understand their work and wanted to prototype something quickly) to colour the nodes in a manner consistent with the transitions observed in \(S\). (i.e. if we colour the indices \(i\) and \(j\) the same and \(S[i] = S[j]\) then we colour \(i + 1\) and \(j + 1\) the same).

We maintain a set of pairs of indices which we know do not correspond to the same state in the DFA. We initialize it to \(\{(0, |S|\}\) at the start and will add to it as we go.

We now produce a colouring consistent with that set and the transition requirement.

If that colouring gives a unique colour to every index, stop. There is no DFA with fewer than \(|S|\) nodes that matches \(S\) and.

Otherwise, produce the quotient DFA and look at the minimal path to an accepting node. If it satisfies \(f\), we have shrunk the string. Start again from the beginning with the new string (this usually doesn’t work, but it’s worth checking).

If that didn’t work, we now produce either a strictly smaller string satisfying \(f\) or we increase the size of our set of known inconsistencies as follows:

For each colour, we look at the smallest and the largest index in it, \(i, j\). If deleting the interval \([i, j)\) from the  string produces a string satisfying \(f\) then we have shrunk the string. Start again from the beginning with the new string. Otherwise, we have proved that the subsequences of \(S\) up to \(i\) and \(j\) respectively must live in distinct states (because following the suffix of \(S\) starting at \(j\) produces different results). Add \((i, j)\) to the set of inconsistencies. Do this for each colour (starting from the ones which produce the largest gap so we maximize how much we shrink by), and then if none of them produced a shrink then try colouring again.

Eventually we’ll either hit the point where we need all \(|S| + 1\) colours to colour the indices, or we’ll have shrunk the string.

Note: I think this doesn’t actually guarantee even 1-minimality like delta debugging does. The problem is that you might have an index that is not equivalent to its successor index even though you can delete the character there. I’m not currently sure if this algorithm can witness that possibility or not.

Anyway, I’ve been running the example I used to prototype structureshrink (the C++ hello world) with this as the list minimization algorithm. It seems to do alright, but it’s nothing to write home about, but it’s been an interesting experiment and I’m glad to have got the excuse to play with minisat.

This entry was posted in programming, Python on by .