Author Archives: david

Proving or refuting regular expression equivalence

Suppose we are interested in the following problem: We have two regular expressions. Do these match exactly the same strings, and if not what is the smallest (shortlex minimal) string that one matches but not the other?

i.e. we either want to prove they are equivalent or provide the best possible refutation that they are not.

How do we do this?

More specifically, let’s consider this problem for extended regular expressions. That is, we have the following types of expressions:

  • \(\epsilon\), which matches only the empty string
  • \(c\), for \(c\) in our alphabet, which matches only the character \(c\)
  • \(\emptyset\), which matches no strings
  • \(\omega\) which matches every string
  • \(x^*\) which matches \(a_1\ldots a_n\) for any strings \(a_i\) matched by \(x\).
  • \(xy\) where \(x\) and \(y\) are extended regular expressions, which matches any strings \(ab\) where \(x\) matches \(a\) and \(y\) matches \(b\)
  • \(x \vee y\) which matches any string matched by either or both of \(x\) or \(y\)
  • \(x \wedge y\) which matches any string matched by both \(x\) and \(y\)
  • \(\neg x\) which matches any string not matched by \(x\).

That is, we extend the normal regular expressions with intersection and negation. This doesn’t actually increase the set of languages we can match, but it can lead to significantly more concise representations (I think, but am not sure, there are extended regular expressions which require exponentially larger regular expressions to represent).

This reduces to the problem of finding the lexmin smallest member of the regular language for an expression or showing that it’s empty: Given two extended regular expressions \(x, y\), we can define \(x \oplus y = (x \vee y) \wedge (\neg (x \wedge y))\) – the symmetric difference. This matches any string which is matched by one but not both of \(x\) and \(y\), so a string is a member of this regular language precisely if it is a refutation of the claim that \(x\) and \(y\) are equivalent.

If we can compile extended regular expressions into a deterministic finite automaton (which we can), then the problem is easy:  To find the lexmin smallest element of a regular language or show that it’s empty given a deterministic finite automaton for it, you just use Dijkstra’s Algorithm to search the graph for an accepting node. You stop when you either find one (in which case you’ve constructed a lexmin path to it along the way) or you’ve explored all the states (in which case there is no path to an accepting node and the language is empty). Job’s done.

That was easy. Are we done?

Well, no. There are a lot of evils hidden in that sentence “just construct a deterministic finite automaton”. The corresponding automaton to a regular language may be exponentially larger than the language, and even when it’s not it’s still an expensive operation. Can we avoid that?

First, let’s take a detour: There are a number of obvious rewrite rules we can use to simplify our lives, by replacing regular expressions with straightforwardly equivalent versions of them. We use \(x \approx y\) to indicate that we should consider \(x\) and \(y\) the same under these rewrite rules:

  • \(x \wedge x \approx x\)
  • \(x \wedge y \approx y \wedge x\)
  • \(x \wedge (y \wedge z) \approx (x \wedge y) \wedge z\)
  • \(x \wedge \emptyset \approx \emptyset\)
  • \(x \wedge \omega \approx x\)
  • \(x \vee x \approx x\)
  • \(x \vee y \approx y \vee x\)
  • \(x \vee (y \vee z) \approx (x \vee y) \vee z\)
  • \(x \vee \emptyset \approx x \)
  • \(x \vee \omega \approx \omega \)
  • \(x(yz) \approx (xy)z\)
  • \(x\emptyset \approx \emptyset\)
  • \(\emptyset x \approx \emptyset\)
  • \(x \epsilon \approx x\)
  • \(\epsilon x \approx x\)
  • \((x^*)^* \approx x^*\)
  • \(\epsilon^* \approx \epsilon\)
  • \(\emptyset^* \approx \epsilon\)
  • \(\omega^* \approx \omega\)
  • \(\neg \emptyset \approx \omega\)
  • \(\neg \omega \approx \emptyset\)
  • \(\neg(\neg x) \approx x\)

If \(x \approx y\) we say that \(x\) and \(y\) are similar.

All of these correspond to rules that apply for the resulting languages, so if \(x \approx y\) then \(x\) and \(y\) match the same language. In general though \(x\) and \(y\) might match the same languages without being similar.

Why do we care about this? Well the first reason is that it lets us have a fast test for when two regular expressions definitely are equivalent: If they are similar then they are equivalent. If they’re not we have to do more work to test this.

In particular, we can use smart constructors for our regular expression data type to apply all of these rewrite rules at construction time, which then makes similarity simply a matter of testing for structural equality. If we use hash consing (essentially: memoizing all constructors so that structurally equal values are represented by the exact same object) we can even make testing for similarity O(1).

Which is good, because we’re going to need it a lot: The real reason to care about similarity of regular expressions is that we can construct the a deterministic finite automaton using the Brzozowski derivative of regular expressions, with each of our states labelled by a regular expression – if we have some character in our alphabet \(c\) and a regular language \(L\) then the Brzozowski derivative is \(d(L, c) = \{v: cv \in L\}\) – the set of suffixes of strings starting with \(c\) in \(L\). We can extend this to regular expressions with the following set of rules:

  • \(d(\epsilon, c) = \emptyset\)
  • \(d(b, c) = \emptyset\) if \(b \neq c\), else \(d(c, c) = \epsilon\)
  • \(d(\emptyset, c) = \emptyset\)
  • \(d(\omega, c) = \omega\)
  • \(d(x^*, c) = d(x, c) x^*\)
  • \(d(xy, c) = d(x, c)y \vee \nu(x) d(y, c)\)
  • \(d(x \vee y, c) = d(x, c) \vee d(y, c)\)
  • \(d(x \wedge y, c) = d(x, c) \wedge d(y, c)\)
  • \(d(\neg x, c) = \neg d(x, c)\)

Where \(\nu(x)\) is \(\emptyset\) if the corresponding regular language doesn’t match the empty string, or \(\epsilon\) otherwise (this can be calculated directly from the expressions by similar rules).

The idea is that we can construct an automaton for a regular language by labelling states by expressions. States are accepting if the corresponding expression matches the empty string, and the transition from state \(x\) given character \(c\) goes to \(d(x, c)\).

If our similarity rule was enough to guarantee equivalence then this would be a minimal deterministic finite automaton for the language. However, even without that there are only finitely many states reachable from a given regular expression (this is where similarity becomes important! Without it this is not true. e.g. if you consider iterating the derivatives of \(c^*\) under \(c\) you’ll get infinitely deep nested chains of similar but unequal expressions).

So far it isn’t obvious why this is interesting, but it has one very important property: You can walk this automaton without having to compute any of the states in advance. This means that you can build it as you explore, so we can repeat the above approach of just using Dijkstra’s algorithm to find a refutation but we don’t have to build the automaton in advance.

It’s particularly easy to calculate the derivatives of the disjoint sum language too:

\(d(x \oplus y, c) = d(x \vee y, c) \wedge d(\neg (x \wedge y), c) =  (d(x, c) \vee (y, c)) \wedge (\neg (d(x, c) \wedge d(y, c))) = d(x, c) \oplus d(y, c)\).

So if we’ve got the derivatives of our original regular expressions cached in some way it’s very cheap to calculate derivatives of our compound regular expression.

So that’s a major improvement on our original algorithm – in a number of cases it will be exponentially faster!

There are a number of ways you can improve it further: You can prune down the possible set of starting characters for a string in an imprecise way to reduce the number of branches you explore when doing Dijkstra. You can also precalculate when two characters will lead to the same state and always consider only the smallest. This is a worthwhile set of optimisations to the algorithm, but doesn’t affect it in any really fundamental way.

Can we do better yet?

The problem is that although it’s easy to calculate, our new automaton potentially has a lot more states than we want. In general it will be fewer, but if there are \(M\) states in \(x\) and \(N\) in \(y\) then there might be as many as \(MN\) in the automaton for the symmetric difference regular expression. We’ll likely not have to explore all of them, but it’s unclear how many you might have to explore.

But you can do equivalence testing in nearly O(M + N) using an algorithm due to Hopcroft and Karp based on union-find (Note: the algorithm called the Hopcroft-Karp algorithm is confusingly something completely different). The idea is that assume that they are equivalent and we merge states until we find a contradiction, and that by aggressively merging states we can consider a lot fewer of them:

Given two automata \(S\) and \(T\) with initial states \(s\) and \(t\), we test equivalence as follows:

def equivalent(S, T):
    s = S.start
    t = T.start
    merges = UnionFind()
    merges.union(s, t)
 
    stack = [(s, t)]
    while stack:
        p, q = stack.pop()
        for a in ALPHABET:
            pa = merges.find(S.transition(p, a))
            qa = merges.find(S.transition(q, a))
            if pa.accepting != qa.accepting:
               return False
            if pa != qa:
                merges.union(pa, qa)
                stack.append((pa, qa))
    return True

The idea is that if two states are equivalent then the states they transition to under the same character must be equivalent, and if we’ve proved that an accepting and non-accepting state are equivalent then we’ve reached a contradiction and our initial assumption must have been false. This merges two states if and only if there is some string that leads to one in one automaton and the other in the other, so if there is some string that one matches but not the other it will eventually merge those two states (if it doesn’t stop earlier).

Because of the way we aggressively merge states, this runs in time O(k(M + N)) where k is the size of the alphabet and M and N are the size of the respective states (there’s actually an inverse Ackermann factor in there too, but the inverse Ackermann function is so close to constant it might as well be ignored).

The problem with this algorithm is that it doesn’t produce a refutation, only a yes or no. We can trivially modify it by pushing paths along with the states, but the paths generated aren’t necessarily useful and in particular don’t necessarily go to a state that is accepting in one and not in the other – we’ve replaced paths to states we believed to be equivalent, but because the core assumption is wrong this didn’t necessarily work. In particular there’s no guarantee that either of the states that we use to refute the result are accepting, only that we falsely believe them to be equivalent to two states where one is accepting and the other is not.

We can still make use of this by using this as a fast initial test; Use Hopcroft and Karp’s algorithm as a fast initial test for equivalence and if it returns false then construct a refutation once we know one must exist. This algorithm also still works well with the lazy construction.

But it would be nice to be able to produce a witness as an integrated part of running this algorithm.

And it turns out we can: The idea is to hybridise the two algorithms – use Dijkstra’s algorithm to build the witness, but instead of operating on a graph corresponding to a single automaton, we build a graph on pairs of states, one from each automaton – an edge then involves following the corresponding edge in each automaton. Rather than searching for an accepting node, we are now searching for a node where one state is accepting and the other is not.

The key thing we then borrow from the algorithm of Hopcroft and Karp is that we can skip states where we have merged the pairs, because there is nothing interesting down that route even if we haven’t seen it.

Here’s some code for the resulting algorithm:

def witness_difference(left, right):
    if left is right:
        return None
    if left.matches_empty != right.matches_empty:
        return b''
    merges = UnionFind()
    merges.union(left, right)
 
    queue = deque()
    queue.append((left, right, ()))
    while len(queue) > 0:
        p, q, s = queue.popleft()
        for a in ALPHABET:
            pa = derivative(p, a)
            qa = derivative(q, a)
            if pa is qa:
                continue
            t = (a, s)
            if merges.find(pa) != merges.find(qa):
                if pa.matches_empty != qa.matches_empty:
                    result = bytearray()
                    while t:
                        a, t = t
                        result.append(a)
                    result.reverse()
                    return bytes(result)
                merges.union(pa, qa)
                queue.append((pa, qa, t))
    return None

One important difference here is that we always put the actual reached pairs of states onto the queue rather than the collapsed pair. This ensures that the paths we take are real paths to real states in the corresponding automaton. Also, because we stop as soon as we ever would merge an accepting and no-accepting state, the problem we worried about above can’t happen.

Also it’s worth noting that I’m not sure this algorithm produces a minimal refutation. It should produce a pretty small one, but there could be a smaller one reachable through two states which the algorithm has merged but that are not actually equivalent.

The requirement for witnesses is a slightly unusual one – most algorithms just focus on testing for equivalence – so it would be reasonable to ask why we care.

The reason I care is that I’m interested in the following more general question: I want to maintain a set of N regular languages as witnessed by regular expressions, and I want to be able to test for membership of this set and add new elements to it.

I’m still not sure what the best way to do this is, but all of the best ideas I have require refutations to prevent you from having to do O(N) equivalence tests. For example, the following works: Maintain a binary tree, where each branch node has a string and each child node is a regular expression. When testing for membership, walk the tree as follows: At a branch, if the language contains the string then go to the left child else go to the right child. At a leaf, perform an equivalence test. If it returns true then the language is a member.

Insertion is then nearly the same: We walk until we get a leaf. At the leaf we perform an equivalence test with a refutation. If the two languages are equivalent then there’s no insertion to perform. If they’re not equivalent, then we create a split node with the refutation string as the key and put whichever one contains it as a leaf for the left child and the other one as a right.

This isn’t a great algorithm, in that it doesn’t have any easy way to balance the tree, but it’s always better than doing N equivalence checks: Matching a short string is very cheap, and we always perform at most N – 1 matches (potentially many fewer) and exactly one equivalence check.

In fact, it is not in general possible to balance the tree; Consider N strings where string k matches only the string consisting of a sequence of k 0s. Then each test string has exactly one language which matches it, so every split must have one of its children be a leaf and the look up problem degenerates to always testing N strings.

Another algorithm is to maintain a trie of all the refutations found so far. You can then simultaneously walk the regular expression’s deterministic finite automaton and the trie to produce a bit vector where bit i is set if the language contains the i’th string. This bitvector is then the same as that of at most one existing language, which can be looked up using e.g. a hash table. So to test membership you produce the bitvector and check for presence in the hash table. If it’s not present then the language isn’t present. If it is present, then check equivalence with the unique existing language with that bitvector. Insertion is similar: If the bitvector isn’t present, add the language to the hash table and leave the set of refutations alone. Else, check for equivalence with a refutation to the unique existing language with the same bitvector. If they’re equivalent, do nothing. If they aren’t, add the refutation to the trie and update the bitvector for all existing languages (you can actually do this last step incrementally as you go, but it may not be worth it).

The second algorithm is probably strictly worse than the first because it involves an O(N) computations on a lot of inserts, as well as testing for membership being more expensive. It really depends on how much cheaper equivalence is than matching for your language workload, and how likely existing refutations are to witness future ones.

One reason why being able to do the above is interesting is that it allows us to extend the lazy construction of a deterministic finite automaton to a lazy construction of a minimal deterministic finite automaton: If we can maintain such a set relatively efficiently, then instead of merely conflating similar expressions into the same states we can label states with regular languages, giving us the proper derivative automaton rather than just our approximation to it.

So that’s where my ability to answer this leaves off a bit. There may be a good data structure for this, but if there is I haven’t found the right corner of the literature to tell me what it is.

if you want to have a play with some of this, here’s some (mostly untested so probably badly wrong in places) Python code implementing a lot of these ideas:

If you’d like to learn more about this, most of the information in this post either came from or was based on material from Regular-expression derivatives reexamined,  Testing the Equivalence of Regular Languages and Algorithms for testing equivalence of finite automata, with a grading tool for JFLAP (which are in turn based on a lot of other papers you can get to from their citations).

(And, as always, if you’d like to see more posts like this and feel able to spare the price of less than a coffee per month, I encourage you to donate to my Patreon for supporting my writing.)

This entry was posted in Automata Theory, programming on by .

Randomized exhaustive exploration of a many branching tree

Attention conservation notice: This post is largely an obvious in retrospect solution to an obscure problem.

It’s a problem I spent a lot of time considering wrong or complicated solutions too though (this is what the persistent data structure I was considering was for, but it’s not needed in the end), so I thought it was worth writing up the correct solution.

The problem is this: I have a tree of unknown structure. Each node in the tree is either a leaf or has an ordered sequence of N (variable per node) children (technically the leaves are just a special case of this with N=0, but leaves have special properties as we’ll see in a moment).

I want to do an exhaustive search of the leaves of this tree, but there’s a limitation: when walking this tree I may only walk complete branches starting from the root and finishing at a leaf. I cannot back up and choose another branch.

The tree is potentially very large, to the point where an exhaustive search might be impossible. So this should actually be treated as an anytime algorithm, where we’re iteratively yielding new paths to a leaf which some external algorithm is then doing something with. When yielding those we want to guarantee:

  • We never yield the same path twice
  • We know when we have explored the entire tree and stop

It’s easy to do a lexicographic exploration of the branches in sorted order – just build a trail of the choices you took. At each step, increment the last choice. If that causes it to exceed the number of choices available, pop it from the end and increment the previous next choice. When making an unknown choice always choose zero.

But there’s a reason you might not want to do this: Under certain reasonable assumptions, you may end up spending a lot of time exploring very deep leaves.

Those reasonable assumptions are this:

  • All else being equal, we should prefer leaves which are closer to the root (e.g. because each walk to a child is expensive, so it takes less time to explore them and we can explore more leaves / unit time)
  • The expected number depth of a tree below any node given uniformly at random choices is usaully much lower than the maximum depth of the tree elow that point.

Under these assumptions the optimal solution for exploring the tree is instead to just randomly walk it: You will avoid the long paths most of the time because of the expectation assumption, and will do a better job of exploring close to the root.

Unfortunately, pure random exploration does not satisfy our requirements: We might generate duplicate paths, and we don’t know when we’ve explored all the nodes.

So what we want is a hybrid solution that exhaustively searches the tree without duplicates, but chooses its paths as randomly as possible.

Any way, there turns out to be a very simple way of doing this. It’s as follows:

We maintain a list of paths to unexplored nodes, bucketed by their length. This is initially the empty path leading to the root.

We then repeatedly iterate the following steps until there are no unexplored nodes in the list:

  1. Pick an unexplored node of minimal length uniformly at random and remove it from the list
  2. Walk to that node
  3. Walk randomly below that node. Every time we make a random decision on a child, add the path we took to get to this point plus each other decision we could have taken to the list of unexplored nodes.
  4. When we reach a leaf node, yield that path to the controlling algorithm.

Every path we walk goes through a node we have not previously explored, so must be unique. When the list is empty we’ve explored all the nodes and are done.

The correct storage format for the paths is just an immutable singly linked list stored with the last node in the path first – appending a node to the path is just consing it to the front of the list. You’re going to have to do O(n) work to reverse it before walking upwards, but that’s OK: You’re going to have to do O(n) work with n the same and a higher constant factor to walk back up the tree along that path anyway.

The reason for my original need for a data structure with better forward iteration than that was because the algorithm I was originally considering always picked the unexplored node with the shortlex minimal (lexicographically first amongst those with the shortest length) path to it, but there’s no real reason you need to do that and it makes things much more expensive.

If you want to instead do weighted random walks of the tree and are prepared to relax the requirements around exploring shorter paths first a bit, there’s an alternative algorithm you can use (and I’m actually more likely to use this one in practice because I’ll need to do weighted draws) which works as follows:

We build a shadow tree in parallel every time we walk the real tree. The shadow tree is mutable, and each node in it is in one of three states: Unexplored, Active or Finished. It starts with a single root node which is Unexplored.

We maintain the following invariant: Any node in the shadow tree which is not Finished has at least one child node that is not Finished. In particular, leaf nodes are always Finished.

We then decide how to walk the tree as follows: We always walk the shadow tree in parallel, matching our walk of the real tree. Whenever we walk to a node that is Unexplored, we do one of two things:

  1. If the corresponding real node is a leaf, we immediately mark it as Finished (and our walk is now done)
  2. If the corresponding real node has N children, we mark it as Active and create N Unexplored child nodes for it.

Once our walk has complete (i.e. we’ve reached a leaf), we look at the list of nodes in the shadow tree that we encountered  and walk it backwards. If all of the children of the current node are Finished, we mark the node as Finished as well. If not, we stop the iteration.

We can now use the shadow tree to guide our exploration of the real tree as follows: Given some random process for picking which child of the real tree we want to navigate to, we simply modify the process to never pick a child whose corresponding shadow node is Finished.

We can also use the shadow tree to determine when to stop: Once the root node is marked as Finished, all nodes have been explored and we’re done.

Note that these two approaches don’t produce at all the same distribution: If one child has many more branches below it than another, the initial algorithm will spend much more time below that child, while the shadow tree algorithm will explore both equally until it has fully exhausted one of them (even if this results in much deeper paths). It probably depends on the use case which of these are better.

You can use the shadow tree to guide the exploration too if you like. e.g. if there are any unexplored child nodes you could always choose one of them. You could also store additional information on it (e.g. you could track the average depth below an explored node and use that to guide the search).

Anyway, both of these algorithms are pretty obvious, but it took me a long time to see that they were pretty obvious so hopefully sharing this is useful.

As an aside: You might be wondering why I care about this problem.

The answer is that it’s for Hypothesis.

As described in How Hypothesis Works, Hypothesis is essentially an interactive byte stream fuzzer plus a library for doing structured testing on top of this. The current approach for generating that bytestream isn’t great, and has a number of limitations. In particular:

  • At the moment it fairly often generates duplicates
  • It cannot currently detect when it can just exhaustively enumerate all examples and then stop (e.g. if you do @given(booleans()) at the moment the result isn’t going to be great).
  • The API for specifying the byte distribution is fairly opaque to Hypothesis and as a result it limits a number of opportunities for improved performance.

So I’m exploring an alternative implementation of the core where you specify an explicit weighted distribution for what the next byte should be and that’s it.

This then opens it up to using something like the above for exploration. The nodes in the tree are prefixes of the bytestream, the leaves are complete test executions. Each non-leaf node has up to 256 (it might be fewer because not every byte can be emitted at every point) child nodes which correspond to which byte to emit next, and then we can treat it as a tree exploration problem to avoid all of the above issues.

I’ve currently got a branch exploring just changing the API and keeping the existing implementation to see how feasible rebuilding Hypothesis this way is. So far the answer is that it’ll be a fair bit of work but looks promising. If that promise pans out, one of the above (probably the shadow tree implementation) is likely coming to a Hypothesis near you at some point in 2017.

(Why yes, it is a programming post! I do still do those. I know these are a bit thin on the ground right now. If you want more, consider donating to my Patreon and encouraging me to write more).

This entry was posted in Hypothesis, programming, Python on by .

Aligning incentives

As I mentioned in my recurring revenue goal post I now have a Patreon for supporting my blogging here. I’ve only got a couple supporters on it so far, but huge thanks to all of them!

The current rewards are mostly access to the Patreon feed where I have some additional commentary, post a few ideas for posts that I’ve not fully formed yet, etc.

But there’s also a tacit “if I make money from this I’ll blog more” promise that I thought might be worth turning into an explicit promise (I like systems. If you’re someone who would support me on Patreon this probably isn’t a surprise).

I already have a Beeminder goal for my blogging, committing to an average number of blog posts a week. It’s not actually what keeps me blogging most of the time – the rate it’s been on for ages is 1/week which is a very easy goal for me to achieve and I currently could fail to blog for 66 days without failing the goal as a result – but it’s there to keep me blogging even when life conspires against me doing so.

And it also makes a pretty good way of providing a system for my Patreon.

So here are the rules I’m going to be trying:

  • The rate on my Beeminder goal is 1 + (money I get from patreon each month) / 200 per week. This gets up to daily blogging if the Patreon somehow hits the ridiculous amount of $1200/month, but I’m not really anticipating that happening. I’ve correspondingly raised my Beeminder rate to 1.19/week now (note for those unfamiliar with Beeminder: Rate changes take a week to take effect)
  • Every time I go past a new $50/month boundary I’ll clear the safety buffer down to a week. e.g.  if I get another $12/month from where I currently am then that 66 days worth of safety buffer goes bye.

(Side note: Depending on exactly what I post, this isn’t really a very good rate of “pay” for writing blog posts and might not even be minimum wage, so it’s best for me if I don’t think of it that way, hence thinking about this as an incentive scheme instead)

If I ever feel like the promised rate gets high enough that it’s started to impact post quality I’ll declare a cutoff and drop it back down and see if I can figure out a new way to reward patrons. I’m pretty confident that I’m nowhere near that though given that my actual weekly blogging rate is about 3 a week, which I’m not even 10% of the way towards reaching yet…

So, if you want more blogging, or even if you want the current amount of blogging to continue come what may, do consider supporting me on Patreon.

This entry was posted in Admin on by .

A simple purely functional data structure for append and forward traversal

For reasons I’ll get into in a later post I’m interested in a data structure with the following properties:

  • It represents a sequence of values
  • It is extremely cheap to append a value to the end in a way that returns an entirely new sequence. The same sequence will typically be extended many times, so no copy on write or append tricks allowed here (no vlists for you) it has to be genuinely cheap
  • Forward traversal is very cheap, especially when the traversal stops after only a small number of elements.
  • It needs to be very simple to implement in a variety of languages including C

The fact that you’re appending to the end and traversing from the beginning is the annoying bit.

The data structure to beat here is to just use a normal linked list, interpreted as storing the list backwards, and then reverse on traversal. If we expected to do a full traversal each time that would probably be pretty close to optimal, but the fact that we often want to traverse only a small number of elements makes it annoying.

(Spoilers for later post: these represent partial paths through a rose tree. The reason why we want fast traversal forwards over a couple of elements is that we want to compare them lexicographically. The reason why we want fast appends is that every time we make a choice we create path prefixes for all the choices we could have taken and didn’t. The reason it needs to be simple is that I might want to use this in Hypothesis, which has “be easily portable to new languages” as an explicit design goal).

The “right” data structure here is probably a finger tree. I don’t want to use a finger tree for one very simple reason: Finger trees are a complete pain to implement in Haskell, let alone another language.

It turns out there’s a data structure that achieves these goals. I haven’t benchmarked it, so unclear at present how useful it is in practice, but for the specific requirements I have it’s probably pretty good. It’s also pretty simple to implement.

The initial idea is that we store our elements as a left-leaning binary tree. That is, a binary tree where the left hand branch always points to a complete binary tree.

Appending then consists of either creating a new split node with the current tree on the left and the new leaf on the right (if the existing tree is already complete), or appending the element to the right hand branch (if the tree is not yet complete then the right hand branch must not yet be complete). i.e. we just do the natural thing for adding it to the rightmost side of a left-leaning binary tree.

This is already pretty good. Append is log(n) in principle, but in practice that side of the binary tree will usually be very cheap. Traversing a binary tree from left to right is pretty straightforward.

There are only two problems with it:

  • Forward traversal isn’t that good, because we have to descend into the deepest part of the tree to get to the first element.
  • We need to decide what sort of book keeping we want to do to check the size of the tree.

We’ll solve the second problem first. We could do this by size annotating every tree node (that was my original idea), but we actually don’t need to because we only need to know the total size of the tree to work out the number of elements below any node: The left hand node always contains the largest power of two strictly smaller than the total number of elements in the tree, the right hand node contains whatever is left.

So instead of size annotating every node we work with bare trees and define a wrapper type which is simply a tree and a count of elements. For some work loads this won’t always be an improvement (because it may reduce sharing), but for most it should.

The first problem is also straightforward to solve (the solution is not one I had seen before, but is certainly not novel to me): We “hoist” the first element of the left subtree up to the split node. So where a binary tree is normally either key list or stores an element between its two subtrees, here we store the leftmost element that “should” be in the left subtree. Note that this means that our balance condition is actually that the number of elements in the left subtree is one fewer than the number of elements in our right subtree (because the element on the current node is implicitly part of the left subtree).

The nice feature of this is that it means that the first element is accessible in O(1). Indeed, the i’th element is accessible in O(min(i, log(n))) – if i < log(n) then finding the i’th element is just a question of reading down the left hand side of the tree, same as it would be if this were a normal cons list.

The following Haskell code (yes I can still write Haskell, thank you very much) provides a minimal implementation of this concept:

This turns out to be very close to an existing purely functional data structure which I hadn’t previously known about – Okasaki’s Purely Functional Random-Access Lists. It’s somewhat simpler in places, mostly because of the operations it doesn’t need to bother supporting, but the binary nodes which contain the first child element is the same and the power-of-two structure is similar but different (in ways where Okasaki’s might just be better. I’m not totally sure, but this is definitely simpler).

I’m still hoping that the CS Theory Stack Exchange will give me a better answer here, but to be honest this is probably more than good enough for my purposes and any further work on this question is basically yak shaving.

This entry was posted in programming on by .

Randomly productive

Note: This is me pre-registering a Beeminder based productivity experiment. I don’t yet know if it’s going to work, but it feels like a good idea.

You know that thing where you have multiple projects and you use them all to procrastinate on each other? And you know how it results in some projects getting sidelined even though you actually really do want/need to work on them?

I’ve got a lot of that going on right now, and I’m not very happy about it.

I’ve been trying to figure out how to do better about it. Given my proclivities, this really means that I want to figure out how to define a Beeminder goal that will reduce the incidence of it.

Well, I think I’ve figured one out.

The idea is to structure the goal in a way that forces me to keep on the wagon for my active projects. It’s not there to make me “do enough” on them, only to keep the amount of time since I last worked on them roughly bounded. This should be sufficient to end the procrastination loop where I keep putting them off – once I’ve done some work on them I’m more likely to do more.

And the way to achieve this is the productivity deck.

The productivity deck is a deck of index cards each of which has a project name on it. They’re not necessarily disjoint projects – some may overlap, and the same project may even be in the deck multiple times.

The productivity deck is used to select today’s nominated project by randomly shuffling the deck and picking out a card. That is now today’s project. There is then a beeminder goal (using tagtime, so this is doubly random) which is “time spent working on today’s project”.

Rules:

  1. If I pick a card and go “This is a stupid project I no longer want to work on it” then I may rip up the card and pick a new one. Note that the requirement is “I no longer want to work on it at all”, not “I no longer want this project to be in the deck”.
  2. If I just don’t want the project in the deck I may accept it for the day but not put it back in the deck.
  3. Cards don’t really have to be “projects” per se. e.g. I’ve got “brainstorm” and “Start a new project” as cards in the deck. I also have a “Pick a card of your choice” card.
  4. I do not have to pick a card each day, but on days where I don’t pick a card I just don’t get to record any progress on the goal.
  5. I may not add new cards to the deck until I have picked the day’s card.
  6. If there is legitimately nothing I can do on this project today because I am blocked on something external I may put the card aside and pick a new one. I may also do this after I have done some work on the nominated project if I get blocked.
  7. If I can’t think of anything to do on today’s nominated project then I should be planning new things to do on it (which counts as working on it).
  8. I am of course allowed to work on projects that are in the deck but not today’s nominated project, I just don’t get any points for them.

Attempts to describe this as “gamifying productivity” will be met with swift and harsh retribution.

I have already started this experiment and today’s card was “blogging”. Tagtime pinged while I was writing this, so I get to record 24 minutes of progress for today, yay (update: And pinged again while I was crafting the tweet to post this, which I’m totally counting)! Points for me! In a totally not gamified sense at all!

As I said, this is mostly a pre-registering of this experiment. We’ll see whether it works – it has the characteristic complexity of my trying stupid things with Beeminder that turn out not to work, but I’m still reasonably optimistic and think it’s a good idea. Of course, that is also characteristic of things that end up not working. But hey, it might work and hopefully it’ll be fun finding out whether it does.