For reasons I’ll get into in a later post I’m interested in a data structure with the following properties:
- It represents a sequence of values
- It is extremely cheap to append a value to the end in a way that returns an entirely new sequence. The same sequence will typically be extended many times, so no copy on write or append tricks allowed here (no vlists for you) it has to be genuinely cheap
- Forward traversal is very cheap, especially when the traversal stops after only a small number of elements.
- It needs to be very simple to implement in a variety of languages including C
The fact that you’re appending to the end and traversing from the beginning is the annoying bit.
The data structure to beat here is to just use a normal linked list, interpreted as storing the list backwards, and then reverse on traversal. If we expected to do a full traversal each time that would probably be pretty close to optimal, but the fact that we often want to traverse only a small number of elements makes it annoying.
(Spoilers for later post: these represent partial paths through a rose tree. The reason why we want fast traversal forwards over a couple of elements is that we want to compare them lexicographically. The reason why we want fast appends is that every time we make a choice we create path prefixes for all the choices we could have taken and didn’t. The reason it needs to be simple is that I might want to use this in Hypothesis, which has “be easily portable to new languages” as an explicit design goal).
The “right” data structure here is probably a finger tree. I don’t want to use a finger tree for one very simple reason: Finger trees are a complete pain to implement in Haskell, let alone another language.
It turns out there’s a data structure that achieves these goals. I haven’t benchmarked it, so unclear at present how useful it is in practice, but for the specific requirements I have it’s probably pretty good. It’s also pretty simple to implement.
The initial idea is that we store our elements as a left-leaning binary tree. That is, a binary tree where the left hand branch always points to a complete binary tree.
Appending then consists of either creating a new split node with the current tree on the left and the new leaf on the right (if the existing tree is already complete), or appending the element to the right hand branch (if the tree is not yet complete then the right hand branch must not yet be complete). i.e. we just do the natural thing for adding it to the rightmost side of a left-leaning binary tree.
This is already pretty good. Append is log(n) in principle, but in practice that side of the binary tree will usually be very cheap. Traversing a binary tree from left to right is pretty straightforward.
There are only two problems with it:
- Forward traversal isn’t that good, because we have to descend into the deepest part of the tree to get to the first element.
- We need to decide what sort of book keeping we want to do to check the size of the tree.
We’ll solve the second problem first. We could do this by size annotating every tree node (that was my original idea), but we actually don’t need to because we only need to know the total size of the tree to work out the number of elements below any node: The left hand node always contains the largest power of two strictly smaller than the total number of elements in the tree, the right hand node contains whatever is left.
So instead of size annotating every node we work with bare trees and define a wrapper type which is simply a tree and a count of elements. For some work loads this won’t always be an improvement (because it may reduce sharing), but for most it should.
The first problem is also straightforward to solve (the solution is not one I had seen before, but is certainly not novel to me): We “hoist” the first element of the left subtree up to the split node. So where a binary tree is normally either key list or stores an element between its two subtrees, here we store the leftmost element that “should” be in the left subtree. Note that this means that our balance condition is actually that the number of elements in the left subtree is one fewer than the number of elements in our right subtree (because the element on the current node is implicitly part of the left subtree).
The nice feature of this is that it means that the first element is accessible in O(1). Indeed, the i’th element is accessible in O(min(i, log(n))) – if i < log(n) then finding the i’th element is just a question of reading down the left hand side of the tree, same as it would be if this were a normal cons list.
The following Haskell code (yes I can still write Haskell, thank you very much) provides a minimal implementation of this concept:
This turns out to be very close to an existing purely functional data structure which I hadn’t previously known about – Okasaki’s Purely Functional Random-Access Lists. It’s somewhat simpler in places, mostly because of the operations it doesn’t need to bother supporting, but the binary nodes which contain the first child element is the same and the power-of-two structure is similar but different (in ways where Okasaki’s might just be better. I’m not totally sure, but this is definitely simpler).
I’m still hoping that the CS Theory Stack Exchange will give me a better answer here, but to be honest this is probably more than good enough for my purposes and any further work on this question is basically yak shaving.