David R. MacIver's Blog: Integer sets with O(1) range allocation and O(log(n)) deletion

Integer sets with O(1) range allocation and O(log(n)) deletion

30 January 2012

This is another post brought to you by the school of the bloody obvious. Or at least it should be, but despite having implemented a trivial variation on this data structure before it still took me far too long to remember it existed.

My previous use case was the original version of IntAllocator in the rabbitmq Java client API (it’s an internal class used for allocating channel numbers). In looking it up I’ve noticed that it no longer does, for reasons that do make sense, but those reasons don’t apply for my current use case (basically the new version uses an internal bitset, trading space for time. RabbitMQ only needs to allocate one of these, and the range isn’t that large, so trading space for time is completely acceptable).

My current use case was for building separation graphs for metric spaces. Part of an algorithm I’m playing with requires keeping a candidate list for each element: That is, a list of elements which might be > t away but we’re not sure. This needs to not take up O(n) space when the answer is “the whole set” and we need to be able to remove elements from it efficiently. Hence the constraints in the title.

The resulting structure is basically a range compressed bitset built out of a balanced binary tree. It works as follows:

A node in the tree may be either empty, an inclusive range represented by its endpoints or a split around a pivot of m such that the left hand side of the tree contains elements <= m and the right contains elements > m. We enforce the invariants that neither child of a split may be empty, and that ranges must have endpoints which admit at least one element.

Allocating an entire range is then just a matter of creating a range node holding two integers, i.e. O(1).

Deleting an element is harder, but not much.

Deleting an element from an empty node does nothing.

Deleting an element from a range has three distinct cases:

If the element is not contained within the range, do nothing
If the element is one of the endpoints, decrement or increment the relevant endpoint. If this results in the range being empty, replace this with an empty node
If the element is internal to the range, split this range around its midpoint and delete from the split instead

Deleting an element from a split is even more straightforward: You determine which child the element should belong in and delete from that. If this results in that child being empty, replace this node with the other child.

Notes

I’ve put together a Haskell implementation to demonstrate the algorithm
There are at least two sensible ways to split a range. You can either do it around the element being deleted (which guarantees you’ll not have to recurse - you just allocate a split and two new ranges) or you can do it around the midpoint of the range. The former is cheaper but can result in the tree becoming unbalanced, the latter guarantees you’ll never need more than log(n) levels of recursion but also means most of the time you’ll use log(n) levels of recursion. You could also implement a hybrid where you split around the deleted point unless you’re below a certain depth in the tree, but I’m not sure it’s worth the added implementation cost
It is useful for split nodes to keep track of the smallest interval containing them as you can then shortcut if the element to be deleted lies outside that interval
You can also implement deleteRange efficiently (I think in O(log(n), though there’s a detail I haven’t checked there). This is left as an exercise for the interested reader
When I implemented IntAllocator I added a cache for this: A bounded size stack which you pushed elements to delete onto. When the stack grew full you sorted it into ranges and deleted the ranges all at once. There was a specific reason for this that doesn’t apply (sometimes you wanted to say “pop an element, any element” and that came from the stack if it was non-empty), but this might still be a useful thing to do for imperative versions of this code (it’s a bad idea for pure ones)

Comments

David R. MacIver on 2012-01-30 15:57:20:

[...] which is cheap to allocate an instance containing the set and cheap to delete from. This is where the previous post about integer sets comes in (we represent a node by its [...]

Tony Garnock-Jones on 2012-01-31 14:42:28:

I’d be interested to hear if you come up with sensible approaches to union, intersection, and set subtraction for this datastructure as well. I’ve attempted to come up with a solution before, but it was difficult and messy and the resulting code ended up with weird time and space behaviour.

david on 2012-01-31 15:12:06:

Hm. That’s a good question. I’ve not needed to do that, so haven’t worried about it yet.

Set removal and intersection don’t look like they’d be *too* awkward, but I’m not sure you can do them with sensible bounds because you can’t easily ensure both argument shrink when recursing. I’m not even sure what union looks like at the moment.

I’ll have a think about it.

Compressed ranges for intmap | David R. MacIver on 2014-01-04 22:56:40:

[…] the range, maybe we could just store the start and end of the range? Essentially, a hybrid between this very trivial integer set idea and intmap’s patricia trie, which added an additional node type for a contiguous […]