David R. MacIver's Blog: Keeping track of the Kth smallest element

Keeping track of the Kth smallest element

15 December 2011

This post is brought to you by the school of bloody obvious, but the details seemed just odd enough compared to normal selection that I thought it might be worth mentioning.

For the Java vantage tree thing I found myself needing to do a select smallest k elements when doing a nearest neighbour search. selection algorithms are pretty well covered, but this wasn’t quite that straightforward.

The key thing that made this different is that I wanted to keep track of the largest of the k values I’d found so far, because this meant I could prune entire subtrees from the search immediately rather than descending into them (if it’s obvious there’s nothing closer than the largest value found so far in the subtree then there’s no point in searching it).

So what I wanted was a data structure that supported the following operations:

Give me the largest current element in the structure
Add an element if there are currently fewer than k elements or if it’s smaller than one of the current elements
Give me a list of all the elements in the structure

What turns out to work well here is to just use a binary heap, but rather than using the min-heap you’d expect to use for selecting the smallest elements you use a max-heap. The operations are then implemented as follows:

The largest element is just the head of the heap (i.e. the zero index in a heap array).
Adding an element is a normal heap add if the heap has fewer than k elements. If the heap has k elements, you check if the element is < the head and if it is replace the head with it and rebalance the heap
Getting a list is just a matter of taking everything in the heap

Some Java code implementing this is here, though it doesn’t exactly follow what I’ve described (rather than comparing elements directly it has double scores, it returns infinity as the max element if there are fewer than k elements and it doesn’t heapify until the heap is full).

Not rocket science, but it took me a few minutes to get the details right (I was initially trying to use a min heap and it wasn’t working for all the obvious reasons), so hopefully this post will be useful or interesting to someone.

Comments

David R. MacIver on 2012-01-30 13:23:41:

[...] is another post brought to you by the school of the bloody obvious. Or at least it should be, but despite having implemented a trivial variation on this data [...]