Attention conservation notice: This post is largely an obvious in retrospect solution to an obscure problem.
It’s a problem I spent a lot of time considering wrong or complicated solutions too though (this is what the persistent data structure I was considering was for, but it’s not needed in the end), so I thought it was worth writing up the correct solution.
The problem is this: I have a tree of unknown structure. Each node in the tree is either a leaf or has an ordered sequence of N (variable per node) children (technically the leaves are just a special case of this with N=0, but leaves have special properties as we’ll see in a moment).
I want to do an exhaustive search of the leaves of this tree, but there’s a limitation: when walking this tree I may only walk complete branches starting from the root and finishing at a leaf. I cannot back up and choose another branch.
The tree is potentially very large, to the point where an exhaustive search might be impossible. So this should actually be treated as an anytime algorithm, where we’re iteratively yielding new paths to a leaf which some external algorithm is then doing something with. When yielding those we want to guarantee:
- We never yield the same path twice
- We know when we have explored the entire tree and stop
It’s easy to do a lexicographic exploration of the branches in sorted order – just build a trail of the choices you took. At each step, increment the last choice. If that causes it to exceed the number of choices available, pop it from the end and increment the previous next choice. When making an unknown choice always choose zero.
But there’s a reason you might not want to do this: Under certain reasonable assumptions, you may end up spending a lot of time exploring very deep leaves.
Those reasonable assumptions are this:
- All else being equal, we should prefer leaves which are closer to the root (e.g. because each walk to a child is expensive, so it takes less time to explore them and we can explore more leaves / unit time)
- The expected number depth of a tree below any node given uniformly at random choices is usaully much lower than the maximum depth of the tree elow that point.
Under these assumptions the optimal solution for exploring the tree is instead to just randomly walk it: You will avoid the long paths most of the time because of the expectation assumption, and will do a better job of exploring close to the root.
Unfortunately, pure random exploration does not satisfy our requirements: We might generate duplicate paths, and we don’t know when we’ve explored all the nodes.
So what we want is a hybrid solution that exhaustively searches the tree without duplicates, but chooses its paths as randomly as possible.
Any way, there turns out to be a very simple way of doing this. It’s as follows:
We maintain a list of paths to unexplored nodes, bucketed by their length. This is initially the empty path leading to the root.
We then repeatedly iterate the following steps until there are no unexplored nodes in the list:
- Pick an unexplored node of minimal length uniformly at random and remove it from the list
- Walk to that node
- Walk randomly below that node. Every time we make a random decision on a child, add the path we took to get to this point plus each other decision we could have taken to the list of unexplored nodes.
- When we reach a leaf node, yield that path to the controlling algorithm.
Every path we walk goes through a node we have not previously explored, so must be unique. When the list is empty we’ve explored all the nodes and are done.
The correct storage format for the paths is just an immutable singly linked list stored with the last node in the path first – appending a node to the path is just consing it to the front of the list. You’re going to have to do O(n) work to reverse it before walking upwards, but that’s OK: You’re going to have to do O(n) work with n the same and a higher constant factor to walk back up the tree along that path anyway.
The reason for my original need for a data structure with better forward iteration than that was because the algorithm I was originally considering always picked the unexplored node with the shortlex minimal (lexicographically first amongst those with the shortest length) path to it, but there’s no real reason you need to do that and it makes things much more expensive.
If you want to instead do weighted random walks of the tree and are prepared to relax the requirements around exploring shorter paths first a bit, there’s an alternative algorithm you can use (and I’m actually more likely to use this one in practice because I’ll need to do weighted draws) which works as follows:
We build a shadow tree in parallel every time we walk the real tree. The shadow tree is mutable, and each node in it is in one of three states: Unexplored, Active or Finished. It starts with a single root node which is Unexplored.
We maintain the following invariant: Any node in the shadow tree which is not Finished has at least one child node that is not Finished. In particular, leaf nodes are always Finished.
We then decide how to walk the tree as follows: We always walk the shadow tree in parallel, matching our walk of the real tree. Whenever we walk to a node that is Unexplored, we do one of two things:
- If the corresponding real node is a leaf, we immediately mark it as Finished (and our walk is now done)
- If the corresponding real node has N children, we mark it as Active and create N Unexplored child nodes for it.
Once our walk has complete (i.e. we’ve reached a leaf), we look at the list of nodes in the shadow tree that we encountered and walk it backwards. If all of the children of the current node are Finished, we mark the node as Finished as well. If not, we stop the iteration.
We can now use the shadow tree to guide our exploration of the real tree as follows: Given some random process for picking which child of the real tree we want to navigate to, we simply modify the process to never pick a child whose corresponding shadow node is Finished.
We can also use the shadow tree to determine when to stop: Once the root node is marked as Finished, all nodes have been explored and we’re done.
Note that these two approaches don’t produce at all the same distribution: If one child has many more branches below it than another, the initial algorithm will spend much more time below that child, while the shadow tree algorithm will explore both equally until it has fully exhausted one of them (even if this results in much deeper paths). It probably depends on the use case which of these are better.
You can use the shadow tree to guide the exploration too if you like. e.g. if there are any unexplored child nodes you could always choose one of them. You could also store additional information on it (e.g. you could track the average depth below an explored node and use that to guide the search).
Anyway, both of these algorithms are pretty obvious, but it took me a long time to see that they were pretty obvious so hopefully sharing this is useful.
As an aside: You might be wondering why I care about this problem.
The answer is that it’s for Hypothesis.
As described in How Hypothesis Works, Hypothesis is essentially an interactive byte stream fuzzer plus a library for doing structured testing on top of this. The current approach for generating that bytestream isn’t great, and has a number of limitations. In particular:
- At the moment it fairly often generates duplicates
- It cannot currently detect when it can just exhaustively enumerate all examples and then stop (e.g. if you do @given(booleans()) at the moment the result isn’t going to be great).
- The API for specifying the byte distribution is fairly opaque to Hypothesis and as a result it limits a number of opportunities for improved performance.
So I’m exploring an alternative implementation of the core where you specify an explicit weighted distribution for what the next byte should be and that’s it.
This then opens it up to using something like the above for exploration. The nodes in the tree are prefixes of the bytestream, the leaves are complete test executions. Each non-leaf node has up to 256 (it might be fewer because not every byte can be emitted at every point) child nodes which correspond to which byte to emit next, and then we can treat it as a tree exploration problem to avoid all of the above issues.
I’ve currently got a branch exploring just changing the API and keeping the existing implementation to see how feasible rebuilding Hypothesis this way is. So far the answer is that it’ll be a fair bit of work but looks promising. If that promise pans out, one of the above (probably the shadow tree implementation) is likely coming to a Hypothesis near you at some point in 2017.
(Why yes, it is a programming post! I do still do those. I know these are a bit thin on the ground right now. If you want more, consider donating to my Patreon and encouraging me to write more).