Alexander Shorin’s work on more configurable unicode generation in Hypothesis has to do some interesting slicing of ranges of unicode categories. Doing both generation and shrinking in particular either required two distinct representations of the data or something clever. Fortunately I’d previously figured out the details of the sort of data structure that would let you do the clever thing a while ago and it was just a matter of putting the pieces together.
The result is an interesting purely functional data structure based on Okasaki and Gill’s “Fast Mergeable Integer Maps”. I’m not totally sure we’ll end up using it, but the data structure is still interesting in its own right.
The original data structure, which is the basis of Data.IntMap in Haskell, is essentially a patricia trie treating fixed size machine words as strings of 0s and 1s (effectively a crit-bit trie). It’s used for implementing immutable mappings of integers with fast operations on them (O(log(n)) insert, good expected complexity on union).
With some small twists on the data structure you can do some interesting things with it.
- Ditch the values (i.e. we’re just representing sets)
- Instead of tips being a single key, tips are a range of keys start <= x < end.
- Split nodes are annotated with their size and the smallest interval [start, end) containing them.
When using this to represent sets of unicode letters this is extremely helpful – most of the time what we’re doing is we’re just removing one or two categories, or restricting the range, which results in a relatively small number of intervals covering a very large number of codepoints.
Let T be the number of intervals and W the word size. The data structure has the following nice properties:
- Getting the size of a set is O(1) (because everything is size annotated or can have its size calculated with a single arithmetic operation)
- Indexing to an element in sorted order is O(log(T)) because you can use the size annotation of nodes to index directly into it – when indexing a split node, check the size of the left and right subtrees and choose which one to recurse to.
- The tree can be automatically collapse tointervals in many cases, because a split node is equivalent to an interval if end = start + size, which is a cheap O(1) check
- Boolean operations are generally O(min(W, T)), like with the standard IntSet (except with intervals instead of values)
- Range restriction is O(log(T)).
Note that it isn’t necessarily the case that a tree with intervals [x, y) and [y, z) in it will compress this into the interval [x, z) because their common parent might be further up the tree.
An extension I have considered but not implemented is that you could potentially store very small subtrees as arrays in order to flatten it out and reduce indirection.
In particular the efficient indexing is very useful for both simplification and generation, and the fact that merging efficiently is possible means that we can keep two representations around: One for each permitted category (which helps give a better distribution when generating) and one for the full range (which makes it much easier to simplify appropriately).
Here is an implementation in Python. It’s not as fast as I’d like, but it’s not unreasonably slow. A C implementation would probably be a nice thing to have and is not too difficult to do (no, really. I’ve actually got a C implementation of something similar lying around), but wouldn’t actually be useful for the use case of inclusion in Hypothesis because I don’t want to add a C dependency to Hypothesis just for this.