Really simple spatial indices

It’s something of a truism that when N is small asymptotic complexity doesn’t matter. As a result, for small data sets it’s often the case that doing the dumbest thing that could possibly work is faster than using a really clever data structure or algorithm. For small N insertion sort probably beats quicksort. For small N you may be better off just storing everything in an array of key value pairs and sort on demand rather than using a binary tree. etc.

It is however often the case that you can take this observation and incorporate it into your clever approach by bucketing the base case: You can do insertion sort if your quicksort is over a short range (and indeed this happens recursively), you can make the leaves in your binary tree contain lists and only split them when the leaves get too large, etc.

Similarly when I as playing around with metric search implementations, I found that using a vantage point tree there really wasn’t much point in letting your leaf nodes be of size 1. At some point in the recursion the vantage tree stops being a win and you’re really better off just storing a list of points and doing a distance calculation on each one.

But then I started thinking: Distance calculations are potentially quite expensive. In my test cases they were only expensive-ish (standard euclidean space, albeit high dimensional). In principle if by storing a little bit of extra information along with the list we could save even a few distance calculations on average we’re probably doing quite well.

This article is about some ideas I have in this space. I’ve only tried some of them, and I don’t have sensible benchmarks of any of this, but anecdotally they do seem helpful.

Tracking diameters

By storing a single float, you potentially save yourself a lot of distance computations. The caveat is that this is only helpful in the case where your result is going to be one of two trivial values.

Suppose I have a set of elements A and I know that \(diam(A) = t\). I want to query A for all elements a such that \(d(x, a) \leq e\). We’ll be doing the simple thing of querying every point for its distance.

But we can shortcut. Suppose we calculate \(d(a, x)\). Because we know the diameter, we may be able to use this to do one of two things:

  1. If \(d(a, x) > t + e\) then the triangle inequality gives us that the query completely misses A and the result is empty
  2. If \(d(a, x) + t \leq e\) then every point in A is within e of x and we can just return all of A

Neither of these are terribly likely cases, but they’re very easy to check for and potentially save you a goodly number of queries in the cases where you hit them.

It’s also worth noting that this idea can potentially also be applied further up the tree. If you track the diameter of the current subtree while you’re querying you can potentially eliminate entire subtrees very quickly. I’m unsure however as to whether that is worth the maintenance overhead (it’s a little difficult to figure out where you can backtrack to).

Ring searching

This one can be slightly regarded as a vantage point tree lite (it’s not, but there are similarities).

The idea is this: We pick a distinguished center point, c. I’m not entirely sure what the best way to do this is. Picking at random may be good enough, picking to maximize some spread score such as \( S(c) = Var(\{ d(a, c) : a \in A \}) \) might be better, or to minimize the radius around c. I think experimentation is needed. We now store A in order of increasing \(d(c, a)\), along with those distance values.

Why is this useful?

Well, we can change our query pattern as follows:

First we calculate \(d(x, c)\). We need to do this anyway in order to determine whether c is in the query, so no loss.

Now, we know the radius around c (it’s just the value of the first distance), let’s call it r. So we may be able to shortcut in the same way we used the diameter: We return A if \(d(x, c) + r \leq \epsilon \) or \(\emptyset\) if \(e + r > d(x, c)\).

But that’s a fairly trivial case. It might be useful to just store a center and smallest radius to do that (I don’t know if it is), but it’s certainly not necessary to store all the distances.

So, why store all the distances?

Because you know by the triangle inequality that all hits in a must satisfy

$$d(x, c) – e \leq d(a, c) \leq d(x, c) + e$$

Which gives you a filter that may potentially cut out a lot of the points immediately if e is small compared to r. Further, because we’ve stored the points in sorted order, rather than scanning the whole list and using this as a first pass we can use two binary searches to find the exact interval of interest and just search that.

Conclusion

Like I said, I don’t know if either of these are helpful. I’ve implemented the ring searching as a metric search structure in its own right, and it seems pretty competitive with the vantage point tree on highish dimensional spaces – for a lot of queries you can use the ring to cut down the search space to about 10% or less of the whole thing immediately, then just do a scan on those points. The lower constant overheads seem to be a win. I haven’t tried the diameter based pruning yet, so I don’t know how much it helps, but it would be surprising if it’s not at least a bit of a win.

This entry was posted in programming on by .