Faster random sampling by handling rejection well

Attention conservation notice: You probably don’t have the problem this post solves. It might be interesting anyway.

As part of some ongoing development for Hypothesis, I find myself in need of a well optimised solution to the discrete sampling problem: We have weights $$w_1, \ldots, w_n$$ and we want to produce a number in $$\{1, \ldots, n\}$$ such that $$i$$ is drawn with probability proportional to $$w_i$$.

All solutions to this problem are intrinsically $$O(n)$$: If you have $$w_i = 1$$ for $$i \neq j$$ and $$w_j = 10^6$$ then you need to inspect $$j$$ to know whether to generate $$i \neq j$$ often or rarely.

You can however use that $$O(n)$$ work once to build a data structure for repeated drawing. For example you can use the alias method to build a data structure that lets you do future draws in $$O(1)$$.

Which is great and all if you’re expecting to repeatedly draw from the same distribution over and over again. Unfortunately that’s not what my workload looks like – the distributions used are many and varied. Distributions will be reused, but not in any easy to predict manner.

You can of course do cached lookup to an alias method sampler. This works, but if your cache hit rate gets too high (which it often will be in my workload) this doesn’t actually improve matters. Additionally, depending on how good your caching is and how much is in there, it might not be much cheaper to look up something in the cache  than it is to just create a new one from scratch.

So what we’d like is a method that is significantly cheaper to initialise. Fortunately there is one, it’s called rejection sampling!

You just need to calculate $$W = \max w_i$$. Now you iterate the following procedure:

1. Pick $$i$$ uniformly at random
2. With probability $$\frac{w_i}{W}$$ return $$i$$
3. Go to 1

This is very cheap to initialise but not so great in terms of run time: If the weights are uniform then it’s optimal (because you always return $$i$$ in step 2), but if they’re very far from uniform then you’re going to make a lot of trips (potentially $$O(n)$$ in the case where one weight dominates everything else) through that loop.

But there turns out to be a hybrid solution that works pretty well and gets us most of the cheapness of initialising rejection sampling with most of the performance of our better sampling methods, with a tuning parameter deciding how much of each we want.

Pick an integer parameter $$R > 0$$ (good choices for $$R$$ are probably $$1$$ or $$2$$) and then run the following algorithm:

1. Pick $$i$$ uniformly at random
2. With probability $$\frac{w_i}{W}$$ return $$i$$
3. If we’ve reached this point fewer than $$R$$ times, go to 1.
4. Look up or initialise a new sampler for these weights and draw from that.

So we perform the cheap rejection sampling until it’s obvious that it’s not helping much and then move to a proper sampling method.

For cases where we’re near the optimum case for rejection sampling we’ll do well and mostly use it, for the rest of the cases we’ll spend a bit of an extra cost. Potentially we could also do some extra book keeping – if we tend to find rejection sampling never works we could just turn off that phase of the algorithm entirely.

Experiments so far in the thing I’m implementing suggests that this idea provides a modest improvement compared to initializing the sampler (I’m actually using a data structure from Succinct Sampling from Discrete Distributions, on grounds of it being substantially cheaper to initialize at the cost of being slightly more expensive to run. Plausibly this would give a much bigger improvement if I were using the alias method).

This probably isn’t that useful a trick. The work load I’m optimising for is a bit odd, and the performance improvements under that workload are only modest. Additionally, the baseline is already fairly fast. But it’s not something I’d seen before and it seemed worth noting down.

(Something something Patreon something. It feels a bit cheap to plug it here given the relatively niche nature of this post, but hey I was doing the Hypothesis research that lead to it for free, so I guess it balances out?)

This entry was posted in programming on by .

Proving or refuting regular expression equivalence

Suppose we are interested in the following problem: We have two regular expressions. Do these match exactly the same strings, and if not what is the smallest (shortlex minimal) string that one matches but not the other?

i.e. we either want to prove they are equivalent or provide the best possible refutation that they are not.

How do we do this?

More specifically, let’s consider this problem for extended regular expressions. That is, we have the following types of expressions:

• $$\epsilon$$, which matches only the empty string
• $$c$$, for $$c$$ in our alphabet, which matches only the character $$c$$
• $$\emptyset$$, which matches no strings
• $$\omega$$ which matches every string
• $$x^*$$ which matches $$a_1\ldots a_n$$ for any strings $$a_i$$ matched by $$x$$.
• $$xy$$ where $$x$$ and $$y$$ are extended regular expressions, which matches any strings $$ab$$ where $$x$$ matches $$a$$ and $$y$$ matches $$b$$
• $$x \vee y$$ which matches any string matched by either or both of $$x$$ or $$y$$
• $$x \wedge y$$ which matches any string matched by both $$x$$ and $$y$$
• $$\neg x$$ which matches any string not matched by $$x$$.

That is, we extend the normal regular expressions with intersection and negation. This doesn’t actually increase the set of languages we can match, but it can lead to significantly more concise representations (I think, but am not sure, there are extended regular expressions which require exponentially larger regular expressions to represent).

This reduces to the problem of finding the lexmin smallest member of the regular language for an expression or showing that it’s empty: Given two extended regular expressions $$x, y$$, we can define $$x \oplus y = (x \vee y) \wedge (\neg (x \wedge y))$$ – the symmetric difference. This matches any string which is matched by one but not both of $$x$$ and $$y$$, so a string is a member of this regular language precisely if it is a refutation of the claim that $$x$$ and $$y$$ are equivalent.

If we can compile extended regular expressions into a deterministic finite automaton (which we can), then the problem is easy:  To find the lexmin smallest element of a regular language or show that it’s empty given a deterministic finite automaton for it, you just use Dijkstra’s Algorithm to search the graph for an accepting node. You stop when you either find one (in which case you’ve constructed a lexmin path to it along the way) or you’ve explored all the states (in which case there is no path to an accepting node and the language is empty). Job’s done.

That was easy. Are we done?

Well, no. There are a lot of evils hidden in that sentence “just construct a deterministic finite automaton”. The corresponding automaton to a regular language may be exponentially larger than the language, and even when it’s not it’s still an expensive operation. Can we avoid that?

First, let’s take a detour: There are a number of obvious rewrite rules we can use to simplify our lives, by replacing regular expressions with straightforwardly equivalent versions of them. We use $$x \approx y$$ to indicate that we should consider $$x$$ and $$y$$ the same under these rewrite rules:

• $$x \wedge x \approx x$$
• $$x \wedge y \approx y \wedge x$$
• $$x \wedge (y \wedge z) \approx (x \wedge y) \wedge z$$
• $$x \wedge \emptyset \approx \emptyset$$
• $$x \wedge \omega \approx x$$
• $$x \vee x \approx x$$
• $$x \vee y \approx y \vee x$$
• $$x \vee (y \vee z) \approx (x \vee y) \vee z$$
• $$x \vee \emptyset \approx x$$
• $$x \vee \omega \approx \omega$$
• $$x(yz) \approx (xy)z$$
• $$x\emptyset \approx \emptyset$$
• $$\emptyset x \approx \emptyset$$
• $$x \epsilon \approx x$$
• $$\epsilon x \approx x$$
• $$(x^*)^* \approx x^*$$
• $$\epsilon^* \approx \epsilon$$
• $$\emptyset^* \approx \epsilon$$
• $$\omega^* \approx \omega$$
• $$\neg \emptyset \approx \omega$$
• $$\neg \omega \approx \emptyset$$
• $$\neg(\neg x) \approx x$$

If $$x \approx y$$ we say that $$x$$ and $$y$$ are similar.

All of these correspond to rules that apply for the resulting languages, so if $$x \approx y$$ then $$x$$ and $$y$$ match the same language. In general though $$x$$ and $$y$$ might match the same languages without being similar.

Why do we care about this? Well the first reason is that it lets us have a fast test for when two regular expressions definitely are equivalent: If they are similar then they are equivalent. If they’re not we have to do more work to test this.

In particular, we can use smart constructors for our regular expression data type to apply all of these rewrite rules at construction time, which then makes similarity simply a matter of testing for structural equality. If we use hash consing (essentially: memoizing all constructors so that structurally equal values are represented by the exact same object) we can even make testing for similarity O(1).

Which is good, because we’re going to need it a lot: The real reason to care about similarity of regular expressions is that we can construct the a deterministic finite automaton using the Brzozowski derivative of regular expressions, with each of our states labelled by a regular expression – if we have some character in our alphabet $$c$$ and a regular language $$L$$ then the Brzozowski derivative is $$d(L, c) = \{v: cv \in L\}$$ – the set of suffixes of strings starting with $$c$$ in $$L$$. We can extend this to regular expressions with the following set of rules:

• $$d(\epsilon, c) = \emptyset$$
• $$d(b, c) = \emptyset$$ if $$b \neq c$$, else $$d(c, c) = \epsilon$$
• $$d(\emptyset, c) = \emptyset$$
• $$d(\omega, c) = \omega$$
• $$d(x^*, c) = d(x, c) x^*$$
• $$d(xy, c) = d(x, c)y \vee \nu(x) d(y, c)$$
• $$d(x \vee y, c) = d(x, c) \vee d(y, c)$$
• $$d(x \wedge y, c) = d(x, c) \wedge d(y, c)$$
• $$d(\neg x, c) = \neg d(x, c)$$

Where $$\nu(x)$$ is $$\emptyset$$ if the corresponding regular language doesn’t match the empty string, or $$\epsilon$$ otherwise (this can be calculated directly from the expressions by similar rules).

The idea is that we can construct an automaton for a regular language by labelling states by expressions. States are accepting if the corresponding expression matches the empty string, and the transition from state $$x$$ given character $$c$$ goes to $$d(x, c)$$.

If our similarity rule was enough to guarantee equivalence then this would be a minimal deterministic finite automaton for the language. However, even without that there are only finitely many states reachable from a given regular expression (this is where similarity becomes important! Without it this is not true. e.g. if you consider iterating the derivatives of $$c^*$$ under $$c$$ you’ll get infinitely deep nested chains of similar but unequal expressions).

So far it isn’t obvious why this is interesting, but it has one very important property: You can walk this automaton without having to compute any of the states in advance. This means that you can build it as you explore, so we can repeat the above approach of just using Dijkstra’s algorithm to find a refutation but we don’t have to build the automaton in advance.

It’s particularly easy to calculate the derivatives of the disjoint sum language too:

$$d(x \oplus y, c) = d(x \vee y, c) \wedge d(\neg (x \wedge y), c) = (d(x, c) \vee (y, c)) \wedge (\neg (d(x, c) \wedge d(y, c))) = d(x, c) \oplus d(y, c)$$.

So if we’ve got the derivatives of our original regular expressions cached in some way it’s very cheap to calculate derivatives of our compound regular expression.

So that’s a major improvement on our original algorithm – in a number of cases it will be exponentially faster!

There are a number of ways you can improve it further: You can prune down the possible set of starting characters for a string in an imprecise way to reduce the number of branches you explore when doing Dijkstra. You can also precalculate when two characters will lead to the same state and always consider only the smallest. This is a worthwhile set of optimisations to the algorithm, but doesn’t affect it in any really fundamental way.

Can we do better yet?

The problem is that although it’s easy to calculate, our new automaton potentially has a lot more states than we want. In general it will be fewer, but if there are $$M$$ states in $$x$$ and $$N$$ in $$y$$ then there might be as many as $$MN$$ in the automaton for the symmetric difference regular expression. We’ll likely not have to explore all of them, but it’s unclear how many you might have to explore.

But you can do equivalence testing in nearly O(M + N) using an algorithm due to Hopcroft and Karp based on union-find (Note: the algorithm called the Hopcroft-Karp algorithm is confusingly something completely different). The idea is that assume that they are equivalent and we merge states until we find a contradiction, and that by aggressively merging states we can consider a lot fewer of them:

Given two automata $$S$$ and $$T$$ with initial states $$s$$ and $$t$$, we test equivalence as follows:

def equivalent(S, T): s = S.start t = T.start merges = UnionFind() merges.union(s, t)   stack = [(s, t)] while stack: p, q = stack.pop() for a in ALPHABET: pa = merges.find(S.transition(p, a)) qa = merges.find(S.transition(q, a)) if pa.accepting != qa.accepting: return False if pa != qa: merges.union(pa, qa) stack.append((pa, qa)) return True

The idea is that if two states are equivalent then the states they transition to under the same character must be equivalent, and if we’ve proved that an accepting and non-accepting state are equivalent then we’ve reached a contradiction and our initial assumption must have been false. This merges two states if and only if there is some string that leads to one in one automaton and the other in the other, so if there is some string that one matches but not the other it will eventually merge those two states (if it doesn’t stop earlier).

Because of the way we aggressively merge states, this runs in time O(k(M + N)) where k is the size of the alphabet and M and N are the size of the respective states (there’s actually an inverse Ackermann factor in there too, but the inverse Ackermann function is so close to constant it might as well be ignored).

The problem with this algorithm is that it doesn’t produce a refutation, only a yes or no. We can trivially modify it by pushing paths along with the states, but the paths generated aren’t necessarily useful and in particular don’t necessarily go to a state that is accepting in one and not in the other – we’ve replaced paths to states we believed to be equivalent, but because the core assumption is wrong this didn’t necessarily work. In particular there’s no guarantee that either of the states that we use to refute the result are accepting, only that we falsely believe them to be equivalent to two states where one is accepting and the other is not.

We can still make use of this by using this as a fast initial test; Use Hopcroft and Karp’s algorithm as a fast initial test for equivalence and if it returns false then construct a refutation once we know one must exist. This algorithm also still works well with the lazy construction.

But it would be nice to be able to produce a witness as an integrated part of running this algorithm.

And it turns out we can: The idea is to hybridise the two algorithms – use Dijkstra’s algorithm to build the witness, but instead of operating on a graph corresponding to a single automaton, we build a graph on pairs of states, one from each automaton – an edge then involves following the corresponding edge in each automaton. Rather than searching for an accepting node, we are now searching for a node where one state is accepting and the other is not.

The key thing we then borrow from the algorithm of Hopcroft and Karp is that we can skip states where we have merged the pairs, because there is nothing interesting down that route even if we haven’t seen it.

Here’s some code for the resulting algorithm:

def witness_difference(left, right): if left is right: return None if left.matches_empty != right.matches_empty: return b'' merges = UnionFind() merges.union(left, right)   queue = deque() queue.append((left, right, ())) while len(queue) > 0: p, q, s = queue.popleft() for a in ALPHABET: pa = derivative(p, a) qa = derivative(q, a) if pa is qa: continue t = (a, s) if merges.find(pa) != merges.find(qa): if pa.matches_empty != qa.matches_empty: result = bytearray() while t: a, t = t result.append(a) result.reverse() return bytes(result) merges.union(pa, qa) queue.append((pa, qa, t)) return None

One important difference here is that we always put the actual reached pairs of states onto the queue rather than the collapsed pair. This ensures that the paths we take are real paths to real states in the corresponding automaton. Also, because we stop as soon as we ever would merge an accepting and no-accepting state, the problem we worried about above can’t happen.

Also it’s worth noting that I’m not sure this algorithm produces a minimal refutation. It should produce a pretty small one, but there be a smaller one reachable through two states which the algorithm has merged but that are not actually equivalent.

The requirement for witnesses is a slightly unusual one – most algorithms just focus on testing for equivalence – so it would be reasonable to ask why we care.

The reason I care is that I’m interested in the following more general question: I want to maintain a set of N regular languages as witnessed by regular expressions, and I want to be able to test for membership of this set and add new elements to it.

I’m still not sure what the best way to do this is, but all of the best ideas I have require refutations to prevent you from having to do O(N) equivalence tests. For example, the following works: Maintain a binary tree, where each branch node has a string and each child node is a regular expression. When testing for membership, walk the tree as follows: At a branch, if the language contains the string then go to the left child else go to the right child. At a leaf, perform an equivalence test. If it returns true then the language is a member.

Insertion is then nearly the same: We walk until we get a leaf. At the leaf we perform an equivalence test with a refutation. If the two languages are equivalent then there’s no insertion to perform. If they’re not equivalent, then we create a split node with the refutation string as the key and put whichever one contains it as a leaf for the left child and the other one as a right.

This isn’t a great algorithm, in that it doesn’t have any easy way to balance the tree, but it’s always better than doing N equivalence checks: Matching a short string is very cheap, and we always perform at most N – 1 matches (potentially many fewer) and exactly one equivalence check.

In fact, it is not in general possible to balance the tree; Consider N strings where string k matches only the string consisting of a sequence of k 0s. Then each test string has exactly one language which matches it, so every split must have one of its children be a leaf and the look up problem degenerates to always testing N strings.

Another algorithm is to maintain a trie of all the refutations found so far. You can then simultaneously walk the regular expression’s deterministic finite automaton and the trie to produce a bit vector where bit i is set if the language contains the i’th string. This bitvector is then the same as that of at most one existing language, which can be looked up using e.g. a hash table. So to test membership you produce the bitvector and check for presence in the hash table. If it’s not present then the language isn’t present. If it is present, then check equivalence with the unique existing language with that bitvector. Insertion is similar: If the bitvector isn’t present, add the language to the hash table and leave the set of refutations alone. Else, check for equivalence with a refutation to the unique existing language with the same bitvector. If they’re equivalent, do nothing. If they aren’t, add the refutation to the trie and update the bitvector for all existing languages (you can actually do this last step incrementally as you go, but it may not be worth it).

The second algorithm is probably strictly worse than the first because it involves an O(N) computations on a lot of inserts, as well as testing for membership being more expensive. It really depends on how much cheaper equivalence is than matching for your language workload, and how likely existing refutations are to witness future ones.

One reason why being able to do the above is interesting is that it allows us to extend the lazy construction of a deterministic finite automaton to a lazy construction of a minimal deterministic finite automaton: If we can maintain such a set relatively efficiently, then instead of merely conflating similar expressions into the same states we can label states with regular languages, giving us the proper derivative automaton rather than just our approximation to it.

So that’s where my ability to answer this leaves off a bit. There may be a good data structure for this, but if there is I haven’t found the right corner of the literature to tell me what it is.

if you want to have a play with some of this, here’s some (mostly untested so probably badly wrong in places) Python code implementing a lot of these ideas:

If you’d like to learn more about this, most of the information in this post either came from or was based on material from Regular-expression derivatives reexamined,  Testing the Equivalence of Regular Languages and Algorithms for testing equivalence of finite automata, with a grading tool for JFLAP (which are in turn based on a lot of other papers you can get to from their citations).

(And, as always, if you’d like to see more posts like this and feel able to spare the price of less than a coffee per month, I encourage you to donate to my Patreon for supporting my writing.)

This entry was posted in Automata Theory, programming on by .

Randomized exhaustive exploration of a many branching tree

Attention conservation notice: This post is largely an obvious in retrospect solution to an obscure problem.

It’s a problem I spent a lot of time considering wrong or complicated solutions too though (this is what the persistent data structure I was considering was for, but it’s not needed in the end), so I thought it was worth writing up the correct solution.

The problem is this: I have a tree of unknown structure. Each node in the tree is either a leaf or has an ordered sequence of N (variable per node) children (technically the leaves are just a special case of this with N=0, but leaves have special properties as we’ll see in a moment).

I want to do an exhaustive search of the leaves of this tree, but there’s a limitation: when walking this tree I may only walk complete branches starting from the root and finishing at a leaf. I cannot back up and choose another branch.

The tree is potentially very large, to the point where an exhaustive search might be impossible. So this should actually be treated as an anytime algorithm, where we’re iteratively yielding new paths to a leaf which some external algorithm is then doing something with. When yielding those we want to guarantee:

• We never yield the same path twice
• We know when we have explored the entire tree and stop

It’s easy to do a lexicographic exploration of the branches in sorted order – just build a trail of the choices you took. At each step, increment the last choice. If that causes it to exceed the number of choices available, pop it from the end and increment the previous next choice. When making an unknown choice always choose zero.

But there’s a reason you might not want to do this: Under certain reasonable assumptions, you may end up spending a lot of time exploring very deep leaves.

Those reasonable assumptions are this:

• All else being equal, we should prefer leaves which are closer to the root (e.g. because each walk to a child is expensive, so it takes less time to explore them and we can explore more leaves / unit time)
• The expected number depth of a tree below any node given uniformly at random choices is usaully much lower than the maximum depth of the tree elow that point.

Under these assumptions the optimal solution for exploring the tree is instead to just randomly walk it: You will avoid the long paths most of the time because of the expectation assumption, and will do a better job of exploring close to the root.

Unfortunately, pure random exploration does not satisfy our requirements: We might generate duplicate paths, and we don’t know when we’ve explored all the nodes.

So what we want is a hybrid solution that exhaustively searches the tree without duplicates, but chooses its paths as randomly as possible.

Any way, there turns out to be a very simple way of doing this. It’s as follows:

We maintain a list of paths to unexplored nodes, bucketed by their length. This is initially the empty path leading to the root.

We then repeatedly iterate the following steps until there are no unexplored nodes in the list:

1. Pick an unexplored node of minimal length uniformly at random and remove it from the list
2. Walk to that node
3. Walk randomly below that node. Every time we make a random decision on a child, add the path we took to get to this point plus each other decision we could have taken to the list of unexplored nodes.
4. When we reach a leaf node, yield that path to the controlling algorithm.

Every path we walk goes through a node we have not previously explored, so must be unique. When the list is empty we’ve explored all the nodes and are done.

The correct storage format for the paths is just an immutable singly linked list stored with the last node in the path first – appending a node to the path is just consing it to the front of the list. You’re going to have to do O(n) work to reverse it before walking upwards, but that’s OK: You’re going to have to do O(n) work with n the same and a higher constant factor to walk back up the tree along that path anyway.

The reason for my original need for a data structure with better forward iteration than that was because the algorithm I was originally considering always picked the unexplored node with the shortlex minimal (lexicographically first amongst those with the shortest length) path to it, but there’s no real reason you need to do that and it makes things much more expensive.

If you want to instead do weighted random walks of the tree and are prepared to relax the requirements around exploring shorter paths first a bit, there’s an alternative algorithm you can use (and I’m actually more likely to use this one in practice because I’ll need to do weighted draws) which works as follows:

We build a shadow tree in parallel every time we walk the real tree. The shadow tree is mutable, and each node in it is in one of three states: Unexplored, Active or Finished. It starts with a single root node which is Unexplored.

We maintain the following invariant: Any node in the shadow tree which is not Finished has at least one child node that is not Finished. In particular, leaf nodes are always Finished.

We then decide how to walk the tree as follows: We always walk the shadow tree in parallel, matching our walk of the real tree. Whenever we walk to a node that is Unexplored, we do one of two things:

1. If the corresponding real node is a leaf, we immediately mark it as Finished (and our walk is now done)
2. If the corresponding real node has N children, we mark it as Active and create N Unexplored child nodes for it.

Once our walk has complete (i.e. we’ve reached a leaf), we look at the list of nodes in the shadow tree that we encountered  and walk it backwards. If all of the children of the current node are Finished, we mark the node as Finished as well. If not, we stop the iteration.

We can now use the shadow tree to guide our exploration of the real tree as follows: Given some random process for picking which child of the real tree we want to navigate to, we simply modify the process to never pick a child whose corresponding shadow node is Finished.

We can also use the shadow tree to determine when to stop: Once the root node is marked as Finished, all nodes have been explored and we’re done.

Note that these two approaches don’t produce at all the same distribution: If one child has many more branches below it than another, the initial algorithm will spend much more time below that child, while the shadow tree algorithm will explore both equally until it has fully exhausted one of them (even if this results in much deeper paths). It probably depends on the use case which of these are better.

You can use the shadow tree to guide the exploration too if you like. e.g. if there are any unexplored child nodes you could always choose one of them. You could also store additional information on it (e.g. you could track the average depth below an explored node and use that to guide the search).

Anyway, both of these algorithms are pretty obvious, but it took me a long time to see that they were pretty obvious so hopefully sharing this is useful.

As an aside: You might be wondering why I care about this problem.

The answer is that it’s for Hypothesis.

As described in How Hypothesis Works, Hypothesis is essentially an interactive byte stream fuzzer plus a library for doing structured testing on top of this. The current approach for generating that bytestream isn’t great, and has a number of limitations. In particular:

• At the moment it fairly often generates duplicates
• It cannot currently detect when it can just exhaustively enumerate all examples and then stop (e.g. if you do @given(booleans()) at the moment the result isn’t going to be great).
• The API for specifying the byte distribution is fairly opaque to Hypothesis and as a result it limits a number of opportunities for improved performance.

So I’m exploring an alternative implementation of the core where you specify an explicit weighted distribution for what the next byte should be and that’s it.

This then opens it up to using something like the above for exploration. The nodes in the tree are prefixes of the bytestream, the leaves are complete test executions. Each non-leaf node has up to 256 (it might be fewer because not every byte can be emitted at every point) child nodes which correspond to which byte to emit next, and then we can treat it as a tree exploration problem to avoid all of the above issues.

I’ve currently got a branch exploring just changing the API and keeping the existing implementation to see how feasible rebuilding Hypothesis this way is. So far the answer is that it’ll be a fair bit of work but looks promising. If that promise pans out, one of the above (probably the shadow tree implementation) is likely coming to a Hypothesis near you at some point in 2017.

(Why yes, it is a programming post! I do still do those. I know these are a bit thin on the ground right now. If you want more, consider donating to my Patreon and encouraging me to write more).

This entry was posted in Hypothesis, programming, Python on by .

A simple purely functional data structure for append and forward traversal

For reasons I’ll get into in a later post I’m interested in a data structure with the following properties:

• It represents a sequence of values
• It is extremely cheap to append a value to the end in a way that returns an entirely new sequence. The same sequence will typically be extended many times, so no copy on write or append tricks allowed here (no vlists for you) it has to be genuinely cheap
• Forward traversal is very cheap, especially when the traversal stops after only a small number of elements.
• It needs to be very simple to implement in a variety of languages including C

The fact that you’re appending to the end and traversing from the beginning is the annoying bit.

The data structure to beat here is to just use a normal linked list, interpreted as storing the list backwards, and then reverse on traversal. If we expected to do a full traversal each time that would probably be pretty close to optimal, but the fact that we often want to traverse only a small number of elements makes it annoying.

(Spoilers for later post: these represent partial paths through a rose tree. The reason why we want fast traversal forwards over a couple of elements is that we want to compare them lexicographically. The reason why we want fast appends is that every time we make a choice we create path prefixes for all the choices we could have taken and didn’t. The reason it needs to be simple is that I might want to use this in Hypothesis, which has “be easily portable to new languages” as an explicit design goal).

The “right” data structure here is probably a finger tree. I don’t want to use a finger tree for one very simple reason: Finger trees are a complete pain to implement in Haskell, let alone another language.

It turns out there’s a data structure that achieves these goals. I haven’t benchmarked it, so unclear at present how useful it is in practice, but for the specific requirements I have it’s probably pretty good. It’s also pretty simple to implement.

The initial idea is that we store our elements as a left-leaning binary tree. That is, a binary tree where the left hand branch always points to a complete binary tree.

Appending then consists of either creating a new split node with the current tree on the left and the new leaf on the right (if the existing tree is already complete), or appending the element to the right hand branch (if the tree is not yet complete then the right hand branch must not yet be complete). i.e. we just do the natural thing for adding it to the rightmost side of a left-leaning binary tree.

This is already pretty good. Append is log(n) in principle, but in practice that side of the binary tree will usually be very cheap. Traversing a binary tree from left to right is pretty straightforward.

There are only two problems with it:

• Forward traversal isn’t that good, because we have to descend into the deepest part of the tree to get to the first element.
• We need to decide what sort of book keeping we want to do to check the size of the tree.

We’ll solve the second problem first. We could do this by size annotating every tree node (that was my original idea), but we actually don’t need to because we only need to know the total size of the tree to work out the number of elements below any node: The left hand node always contains the largest power of two strictly smaller than the total number of elements in the tree, the right hand node contains whatever is left.

So instead of size annotating every node we work with bare trees and define a wrapper type which is simply a tree and a count of elements. For some work loads this won’t always be an improvement (because it may reduce sharing), but for most it should.

The first problem is also straightforward to solve (the solution is not one I had seen before, but is certainly not novel to me): We “hoist” the first element of the left subtree up to the split node. So where a binary tree is normally either key list or stores an element between its two subtrees, here we store the leftmost element that “should” be in the left subtree. Note that this means that our balance condition is actually that the number of elements in the left subtree is one fewer than the number of elements in our right subtree (because the element on the current node is implicitly part of the left subtree).

The nice feature of this is that it means that the first element is accessible in O(1). Indeed, the i’th element is accessible in O(min(i, log(n))) – if i < log(n) then finding the i’th element is just a question of reading down the left hand side of the tree, same as it would be if this were a normal cons list.

The following Haskell code (yes I can still write Haskell, thank you very much) provides a minimal implementation of this concept:

This turns out to be very close to an existing purely functional data structure which I hadn’t previously known about – Okasaki’s Purely Functional Random-Access Lists. It’s somewhat simpler in places, mostly because of the operations it doesn’t need to bother supporting, but the binary nodes which contain the first child element is the same and the power-of-two structure is similar but different (in ways where Okasaki’s might just be better. I’m not totally sure, but this is definitely simpler).

I’m still hoping that the CS Theory Stack Exchange will give me a better answer here, but to be honest this is probably more than good enough for my purposes and any further work on this question is basically yak shaving.

This entry was posted in programming on by .

Review of a book that reviews (code) reviewing

In an earlier version of my recent software development advice post, I said the following:

I also think this will not significantly reduce the number of bugs in your software. Code review is just not a cost effective bug finding tool compared to almost anything else.

I appear to be wrong in this claim, if the literature is to be believed. I don’t have a strong reason to doubt the literature, and on examination I’m not actually certain why I believed the contrary (it appears to be cached knowledge), so I’m going to change my opinion on this: Code review is probably a very cost effective way of finding bugs if you do it well, and may be reasonable at it even if you do it badly.

The thing that’s updated my opinion on the subject is reading “Best Kept Secrets of Peer Code Review“, after Richard Barrell sent me a link to a chapter from the middle of it reviewing the literature.

With a couple caveats (the title being terrible definitely counted amongst them), the book is pretty good. There are two chapters worth skipping, but that’s more a function of the fact that the book is from 2006 than anything else (GitHub’s pull requests, while far from a stellar code review system, are in all ways better than what was widely available back in 2006 and for the common path probably not much worse than the state of the art).

The authors clearly have an agenda and the book is there to promote that, but that’s OK. I have an agenda too – it’s very hard to have expertise without also having an agenda in support of that expertise. And even if I believe they’re overstating the case, the case they present is reasonably convincing. It’s also refreshingly empirical – most software engineering advice is full of anecdotes and opinions and, while this book absolutely contains some of those, it’s also got a really pleasant amount of actual research backed by actual data. The data is often not of amazing quality in that many of the studies are quite small scale so the statistics are likely under powered, but that’s still a better standard of evidence than I’m used to.

It’s also pleasantly short. The whole book is only around 160 pages, and I read it over the course of an afternoon.

Anyway, that’s the review section of this post done. Now it’s time for the review section.

I have a couple take homes from this book which are currently in the state of “Plausible sounding statements that I will think about further and probably upgrade most of to actual beliefs”.

• Code review is a very cost effective way of finding defects compared to manual QA of the application – it both has a higher rate of finding them per unit of time and also finds bugs that QA were probably never going to find.
• As your notion of “defect” broadens to include things like your coworkers having no idea what you wrote down, the above statement becomes stronger.
• Code review should be done by individuals (possibly multiple individuals working in parallel) rather than teams working together. As well as being more expensive per unit time, team code review seems to be less effective in absolute terms at finding defects.
• The generally believed opinion that a 10 line change gets more useful review than a 500 line change seems to be actually true but too generous – the actual cut off seems to be more like 200 lines. It’s unclear to me whether this means 200 lines added or a 200 line diff, but I believe it means the former.
• Code review effectiveness drops off dramatically as the time taken approaches about an hour. It’s unclear to me if you can fix this by walking away from the review and coming back later. This may also be the origin of the 200 lines limit – it may be impossible for most people to effectively review code at a rate much higher than 200 lines / hour.
• Code review benefits greatly from a careful first reading of code before you do any reviewing (and doing this stage more slowly is better. Presumably this hits diminishing returns at some point).
• Code review benefits greatly from being structured with checklists.

The checklist one is interesting, and I am actually pretty convinced by it: The overall literature on checklists helping construct reliable processes seems to be reasonably convincing (but my opinion here is very trend driven and I have not actually read any primary research on this subject. It conforms to my biases about the world though, and thus must be true), and the specific data presented in favour of checklists for reviewing is moderately convincing. This is the advice that is the most likely to make an actual practical difference to how I review code in future.

The advice on checklists is interesting in particular in how it’s constructed: They strongly recommend not having just a “standard” checklist, but actively curating one over time: When something is found in review that seems like it would make a good check list item, add it to the list! When an item hasn’t found anything useful in review in a while (say 6 months), think about removing it.

Here are a couple example check list items from the book that look generally useful to me (the example list this is from is much longer):

• Documentation: All subroutines are commented in clear language
• Documentation: Describe what happens with corner-case input.
• Testing: Unit tests are added for new code paths or behaviours.
• Testing: Unit tests cover errors and invalid parameters.
• Testing: Don’t write new code that is already implemented in an existing, tested API.
• Error handling: Invalid parameter values are handled properly early in the subroutine.

There are plenty more. The book suggests that checklists should really only have about the top 10-20 items that you’ve found most useful over time (the sample checklist has 25 items, so apparently they didn’t test the invalid input case here).

One interesting thing worth highlighting is that at least some checklist items may be worth automating out of the review entirely. e.g. “Testing: Unit tests are added for new code paths or behaviours” would mostly be better handled by enforcing a code coverage metric I think.

As well as improving the quality of the review itself, the book highlights another interesting application of checklists: It doesn’t have to just be the reviewer who goes over them! Doing a sort of self review by following the checklist yourself before submitting seems to be nearly as effective at removing bugs as the review itself would be (it is unclear to me if that remains true if you skip the review altogether – my feeling is that people are likely to get sloppy at the checklist if they imagine someone is not going to be checking it later). This presumably speeds up the review process a lot by reducing the number of back and forth exchanges required to pass it.

One suggestion off the back of this self review that they made (which seems to come from the Personal Software Process world) is that as well as maintaining a global checklist for reviews it might be worth maintaining individual checklists for reviews, where people maintain their own list of things that it’s worth them pre-checking because they often make a mistake in that area which then comes up in review.

Anyway, I’m really glad I read the book. I can’t swear to its correctness, but it’s an interesting and plausible perspective that I’ll be thinking about further.

PS. If you’d like to fuel my book habit, I have a public Amazon wishlist. I’ll happily review any books you buy me…

This entry was posted in Books, programming on by .