David R. MacIver's Blog: Compressed ranges for intmap

Compressed ranges for intmap

4 January 2014

A feature I’ve been wondering about which is an extension to the original datastructure that intmap is based on is whether one can reasonably do it in a way that has a more efficient representation for contiguous ranges - rather than having multiple very large objects for every single work in the range, maybe we could just store the start and end of the range? Essentially, a hybrid between this very trivial integer set idea and intmap’s patricia trie, which added an additional node type for a contiguous range.

A complicating factor for this is that intmap has not just keys but also values, and you don’t want to store a contiguous range with distinct values as an array, because then you lose a lot of the cheap update properties that are nice about intmap in the first place - e.g. insertion becomes O(n) instead of O(log(n)).

But there turns out to be a relatively easy and nice compromise: Storing compressed ranges mapping to a single value. For the use case where intmap is basically an intset this works out pretty well, and for use cases where you have a lot of distinct values there is essentially no performance degradation (you get a little bit due to the extra checks, but not much).

The way ranges are incorporated into intmap is relatively simple: Most operations have some special treatment for ranges. For example, if you intersect two ranges with the same value then the result is immediately a new range with that value. Similarly for unioning two ranges if they overlap. Subtracting ranges from things is usually much more efficient because it’s very easy to tell if all keys in a map lie in a range (the prefix of the patricia trie defines an upper and lower bound), and a whole bunch more operations.

And when there’s nothing sensible to do with a range - e.g. we’re removing a hole from the middle of a range, or unioning together two ranges that won’t coalesce, etc - we just turn it into a split node with two ranges as children. Often we’ll then want to recurse and will end up expanding more range nodes, but for many operations if we start with a range we’ll still only end up with a tree with O(log(n)) nodes for O(n) keys - e.g. if we insert or delete a key from a range.

The code for this has ended up a bit hairy in places (honestly, intmap could use a bit of a cleanup already I think), but it appears to work rather well, so I’m quite pleased with the result.

In writing this testmachine has been an absolute godsend. I found some many bugs in the implementation with it.

The process for writing things against testmachine is that I provide a parallel implementation of the function in python, only that implementation can be as basic as I like. In the parallel python representation, intmaps are just dicts mapping ints to strings, and the operations are correspondingly simple, with little concern for the efficiency of the operations.

Then it finds bugs. Almost inevitably. This is a combination of two factors:

intmap code is quite tricky to get right. I’m emulating a lot of things C isn’t designed to do in C, and C is tricky to get right in the first place, and frankly datastructure implementations are always a bit of a pain.
testmachine is almost magically good at finding bugs.

I’m not kidding about point 2. I’ve more than doubled the size of the deterministic test suite as part of this chunk of work, and they’re all from things testmachine has found. As I opined on Twitter earlier: When you first encounter Quickcheck style testing it feels like you’ve been given a magical bug finding tool. This is like that experience all over again where the baseline has been moved to quickcheck.