Category Archives: programming

An algorithm for incrementally building separation graphs

You know that annoying thing when you’re reading a paper and it’s fairly clear they’ve not actually tried that algorithm in practice? Don’t you hate that?

Anyway, here’s an algorithm I’ve not tried in practice. I haven’t even written code for it yet. This post is as much to get the ideas clear in my head as anything else. This algorithm may turn out to be worse than doing it the stupid way in practice.

This is the problem we’re trying to solve:
x
We have n points, \(X = \{x_1, \ldots, x_n\}\), and a metric function d. We are trying to build the unordered graph with edges \(E = \{ (i, j) : d(x_i, x_j) > t \} \).

This is motivated by trying to find small diameter partitions of X, which we will be doing by trying to either prove (X, E) is not bipartite or finding a bipartition of it (see this previous post)

The question is how to do this efficiently? The obvious brute force approach is \(O(n^2)\), and indeed any solution must have \(O(n^2)\) worst-case. We’d like to do well in practice by applying the following two key ideas:

  1. By using the triangle inequality, we may often be able to escape having to do distance calculations because we can infer the distance will be too large / too small
  2. We are likely not to need the entire graph – a proof that a graph is not bipartite may involve only a very small subset of the nodes, so if we build the graph incrementally we may be able to stop early

The way we do this will be as follows:

For each node x, we keep track of two sets, Definite(x) and Candidates(x). We will also index distances as we find them by tracking Distances(x), which will be track all points we’ve already calculated the distance from x to.

Roles and implementation requirements

Definite(x) is a set containing all nodes we know are > t away from x and starts empty. We require efficient addition of new elements to it without introducing duplicates and the creation of an iterator which is guaranteed to iterate over all elements in the set even if new ones are added during iteration, even after it has previously claimed there are no elements left. One way to do this would be to have it consist of both an array and a hash set.

Candidates(x) contains all points which we don’t yet know for sure whether or not they are > t from x and starts equal to X. As a result we want a set implementation which is cheap to allocate an instance containing the set and cheap to delete from. This is where the previous post about integer sets comes in (we represent a node by its index).

Distances(x) is some sort of ordered map from distances to lists of points. It needs to support efficient range queries (i.e. give me everything nearer than s or further than s).

The Algorithm

Our two basic operations for manipulating the data we’re tracking are Close(x, y) and Far(x, y). Close(x, y) removes x from Candidates(y) and y from Candidates(x). Far does the same thing but also adds x to Definite(y) and y to Definite(x).

Our iterator protocol is that an iterator has an operation Next which returns either None or an Element. We will construct Neighbours(x), which creates an iterator that incrementally returns all points > t away from the x and uses information discovered whilst doing so to flesh out the information we know about other nodes too.

Here is the algorithm, written in some bizarre pseudocode language that looks like nothing I actually write (no, I don’t know why I’m using it either):

 
  Neighbours(X) = NI(x, Iterator(Definite(x)))

  Next(NI(x, found))
    
    # If we 
    while ((result = Next(found)) == None && not IsEmpty(Candidates)) do
      candidate = Pop(Candidates(x))    
      s = d(x, candidate)

      if s > t then Far(x, candidate) else Near(x, candidate)
      
      # Because d(y, candidate) <= d(y, x) + d(candidate, x) <= t - s + s = t
      for(y in Distances(x) where d(y, x) <= t - s) Near(candidate, y)

      # Because if d(y, candidate) <= t then d(y, x) <= d(candidate, x) + t <= s + t
      for(y in Distances(x) where d(y, x) > t + s) Far(candidate, y)

      # Because if d(candidate, y) <= t then d(candidate, x) <= d(candidate, x) + d(y, x) <= t - s + s = t
      if s > t then for(y in Distances(x) where d(y, x) <= t - s) Far(candidate, y)

      # If we have no more candidates left then Distances(x) will never be used again and is wasted space
      if IsEmpty(Candidates(x)) then delete Distances(x) else Put(Distances(x), s, y))
    done

    return result

The idea being that as we're searching for edges for a given node we're also using this to fill out the edges for the other nodes. This should work particularly well for doing depth first search, because in particular it often means that the nodes we're going to transition to will have some of their neighbours filled in already, and we may in fact be able to spend most of our time just chasing through Definite sets rather than having to do new distance calculations. Even where we haven't been able to do that, hopefully we've managed to prune the lists of elements to consider significantly before then.

Notes

  • It should be fairly evident that this is not only \(O(n^2)\) worst case, it's probably \(O(n^2)\) expected case for exploring the whole graph, even in reasonable metric spaces - it's likely that one or the other of the two sets of points we're looking at during each iteration step will be O(n) - so the only possible win over the brute force algorithm is if it does significantly fewer expensive distance calculations
  • I've not yet written any real code for this, so I'm sure some of the details are wrong or awkward
  • Even if it's not wrong or awkward it may well turn out to have sufficiently bad constant factors, or save few enough distance invocations, that it's worthless in practice
This entry was posted in Code, programming on by .

A cute algorithm for a problem you’ve never cared to solve

As part of my recent playing with metric search structures, I was wondering about the following problem: Given a metric space X I want to partition X into two sets A and B such that each of A and B has a small diameter. Equivalently, I’m trying to find A that minimizes max(diam(A), diam(\(A^c\))).

The brute force solution for this would be fairly awful: There are \(2^n\) partitions, calculating the score of each is \(O(n^2)\), so the brute force solution takes \(O(2^n n^2)\). So let’s not do that.

I was expecting that I’d have to find an approximate solution, as this looked like a classic hairy NP-hard optimization problem. I’d started contemplating local optimizations, simulated annealing, etc. Then after doing a bit more searching I happened across this paper, which in fact has a polynomial time exact solution to the problem: Diameter Partitioning by David Avis.

The algorithm is incredibly cute – it uses absolutely no advanced ideas, just two clever observations which reduce the entire problem to elementary algorithms. It’s the sort of thing which as soon as you see it you just kick yourself for not having thought of it. So I thought I’d write about it to share how much I enjoyed it, even if no one else is actually likely to care about solving this problem.

The first observation is this:

We can find a partition A, B which has max(diam(A), diam(B))) \(\leq\) t if and only if we can find two sets A, B such that if \(d(x, y) > t\) they are in different halves of the partition (this is restating the definition).

That is, if we draw an edge between x and y if and only if \(d(x, y) > t\), then a partition with max(diam(A), diam(B))) \(\leq\) t is precisely a bipartite matching for this graph.

But bipartite matchings are easy to find! Or at least, \(O(n^2)\) to find, which is no worse than calculating the diameter of a set in the first place. You just do a graph search on each of the components, marking nodes alternatively black and white (i.e. if you’re coming from a white node you mark this node black, if you’re coming from a black you mark this node white) as you find them, and yell “Impossible!” if you ever try to paint a node a different colour than you’ve already painted it. Done.

So for any given t we have an \(O(n^2)\) algorithm that either finds us a partition at least that good or tells us that none exists. Now how do we find a best possible partition?

The obvious answer is binary search, but binary searching on floats is never going to give us the exact answer, only a series of successively better approximations. This then motivates the second observation: Our distances are discrete.

If we work out the distance between every pair (we already have to do that for building the graph) then you can put these pairs + their distances in an array and sort that array. Then it’s just a simple matter of binary searching that array, and additionally you can use it to build the graph without calculating any more distances – just take the pairs to the right of your current point in the array.

Building the array is \(O(n^2 log(n))\), and we will do \(log(n)\) binary searches, each performing an \(O(n^2)\) operation, so in total the cost of this algorithm is \(O(n^2 log(n))\). A lot better than brute force.

Notes

  • It apparently is NP-hard to do this if you want a partition of more than two elements
  • Any algorithm which guarantees within a factor of two of the right answer has to examine every pair of points and thus must be \(O(n^2)\). You can see this by constructing a metric space of n points, picking a distinguished pair x, y and setting \(d(x, y) = 2\) whilst setting all other distances for 1. Any partition which separates x and y will have a score of 1, any other will have a score of 2, and by randomly picking x and y you can see that there’s no way to know this except examining every pair
  • The above example also demonstrates that it’s \(O(n^2)\) to even calculate a diameter, which means that this is also a lower bound on even knowing for sure how good your partition is
  • If you don’t care about solving this exactly and only care that the result is “good” in some fuzzy sense I think you can do a lot better in practice. There are good approximate diameter algorithms, and we didn’t take advantage of the triangle equality at all in our algorithm for building the separation graph, so I think it’s possible to adjust the partition testing code in a way that is much better than \(O(n^2)\) in many nice cases, even if it necessarily degenerates to \(O(n^2)\) in the worst cases. I’ll write more about this once I’ve worked out the details.
This entry was posted in Code, programming on by .

Really simple spatial indices

It’s something of a truism that when N is small asymptotic complexity doesn’t matter. As a result, for small data sets it’s often the case that doing the dumbest thing that could possibly work is faster than using a really clever data structure or algorithm. For small N insertion sort probably beats quicksort. For small N you may be better off just storing everything in an array of key value pairs and sort on demand rather than using a binary tree. etc.

It is however often the case that you can take this observation and incorporate it into your clever approach by bucketing the base case: You can do insertion sort if your quicksort is over a short range (and indeed this happens recursively), you can make the leaves in your binary tree contain lists and only split them when the leaves get too large, etc.

Similarly when I as playing around with metric search implementations, I found that using a vantage point tree there really wasn’t much point in letting your leaf nodes be of size 1. At some point in the recursion the vantage tree stops being a win and you’re really better off just storing a list of points and doing a distance calculation on each one.

But then I started thinking: Distance calculations are potentially quite expensive. In my test cases they were only expensive-ish (standard euclidean space, albeit high dimensional). In principle if by storing a little bit of extra information along with the list we could save even a few distance calculations on average we’re probably doing quite well.

This article is about some ideas I have in this space. I’ve only tried some of them, and I don’t have sensible benchmarks of any of this, but anecdotally they do seem helpful.

Tracking diameters

By storing a single float, you potentially save yourself a lot of distance computations. The caveat is that this is only helpful in the case where your result is going to be one of two trivial values.

Suppose I have a set of elements A and I know that \(diam(A) = t\). I want to query A for all elements a such that \(d(x, a) \leq e\). We’ll be doing the simple thing of querying every point for its distance.

But we can shortcut. Suppose we calculate \(d(a, x)\). Because we know the diameter, we may be able to use this to do one of two things:

  1. If \(d(a, x) > t + e\) then the triangle inequality gives us that the query completely misses A and the result is empty
  2. If \(d(a, x) + t \leq e\) then every point in A is within e of x and we can just return all of A

Neither of these are terribly likely cases, but they’re very easy to check for and potentially save you a goodly number of queries in the cases where you hit them.

It’s also worth noting that this idea can potentially also be applied further up the tree. If you track the diameter of the current subtree while you’re querying you can potentially eliminate entire subtrees very quickly. I’m unsure however as to whether that is worth the maintenance overhead (it’s a little difficult to figure out where you can backtrack to).

Ring searching

This one can be slightly regarded as a vantage point tree lite (it’s not, but there are similarities).

The idea is this: We pick a distinguished center point, c. I’m not entirely sure what the best way to do this is. Picking at random may be good enough, picking to maximize some spread score such as \( S(c) = Var(\{ d(a, c) : a \in A \}) \) might be better, or to minimize the radius around c. I think experimentation is needed. We now store A in order of increasing \(d(c, a)\), along with those distance values.

Why is this useful?

Well, we can change our query pattern as follows:

First we calculate \(d(x, c)\). We need to do this anyway in order to determine whether c is in the query, so no loss.

Now, we know the radius around c (it’s just the value of the first distance), let’s call it r. So we may be able to shortcut in the same way we used the diameter: We return A if \(d(x, c) + r \leq \epsilon \) or \(\emptyset\) if \(e + r > d(x, c)\).

But that’s a fairly trivial case. It might be useful to just store a center and smallest radius to do that (I don’t know if it is), but it’s certainly not necessary to store all the distances.

So, why store all the distances?

Because you know by the triangle inequality that all hits in a must satisfy

$$d(x, c) – e \leq d(a, c) \leq d(x, c) + e$$

Which gives you a filter that may potentially cut out a lot of the points immediately if e is small compared to r. Further, because we’ve stored the points in sorted order, rather than scanning the whole list and using this as a first pass we can use two binary searches to find the exact interval of interest and just search that.

Conclusion

Like I said, I don’t know if either of these are helpful. I’ve implemented the ring searching as a metric search structure in its own right, and it seems pretty competitive with the vantage point tree on highish dimensional spaces – for a lot of queries you can use the ring to cut down the search space to about 10% or less of the whole thing immediately, then just do a scan on those points. The lower constant overheads seem to be a win. I haven’t tried the diameter based pruning yet, so I don’t know how much it helps, but it would be surprising if it’s not at least a bit of a win.

This entry was posted in programming on by .

Keeping track of the Kth smallest element

This post is brought to you by the school of bloody obvious, but the details seemed just odd enough compared to normal selection that I thought it might be worth mentioning.

For the Java vantage tree thing I found myself needing to do a select smallest k elements when doing a nearest neighbour search. selection algorithms are pretty well covered, but this wasn’t quite that straightforward.

The key thing that made this different is that I wanted to keep track of the largest of the k values I’d found so far, because this meant I could prune entire subtrees from the search immediately rather than descending into them (if it’s obvious there’s nothing closer than the largest value found so far in the subtree then there’s no point in searching it).

So what I wanted was a data structure that supported the following operations:

  • Give me the largest current element in the structure
  • Add an element if there are currently fewer than k elements or if it’s smaller than one of the current elements
  • Give me a list of all the elements in the structure

What turns out to work well here is to just use a binary heap, but rather than using the min-heap you’d expect to use for selecting the smallest elements you use a max-heap. The operations are then implemented as follows:

  • The largest element is just the head of the heap (i.e. the zero index in a heap array).
  • Adding an element is a normal heap add if the heap has fewer than k elements. If the heap has k elements, you check if the element is < the head and if it is replace the head with it and rebalance the heap
  • Getting a list is just a matter of taking everything in the heap

Some Java code implementing this is here, though it doesn’t exactly follow what I’ve described (rather than comparing elements directly it has double scores, it returns infinity as the max element if there are fewer than k elements and it doesn’t heapify until the heap is full).

Not rocket science, but it took me a few minutes to get the details right (I was initially trying to use a min heap and it wasn’t working for all the obvious reasons), so hopefully this post will be useful or interesting to someone.

This entry was posted in programming on by .

A Vantage Point Tree implementation in (shudder) Java

My friend Dave Stark has an ongoing project to write an OCR engine in Java called Longan. It’s still very much a work in progress, but I’ve been enjoying hearing about that progress and it sounds like it’s going well.

He’s currently experimenting with a new backend for the recognition based on a metric search approach instead of the neural networks he’s been using so far and mentioned that it was currently rather slow. So I was all “Ooh! Vantage Point Trees! You should use Vantage Point Trees! They’re fun and awesome. Here, I will write you one!”

Err, so I did. It took me about two hours all told, plus some tinkering afterwards to fix the bits I was ashamed of. It’s a reasonably complete implementation – it does sensible center selection when building the tree, it buckets leaves into small numbers of points rather than making you do distance comparisons all the way down to the bottom and it tracks the radius of subtrees to let you quickly eliminate trees which aren’t going to hit at all.

Currently supported API:

  • Given a metric and a collection of list of points, build a vantage tree
  • Given a vantage tree and a point p find all points in the tree within epsilon of p
  • Given a vantage tree and a point p find the n nearest points to p in the tree

In particular there are no mutation operations on the tree to speak of, mostly because I couldn’t be bothered.

It’s currently what you might call “lightly tested” in the sense that there isn’t even the pretence of a test suite. I’ve tried it and it seems to work (and my trying it was thorough enough that I found and fixed various bugs) but I haven’t got around to writing proper tests for it. The performance is surprisingly good – on the order of a few ms for querying out of a million (reasonably well distributed – I haven’t tried it on less favourable cases) points.

Although it works, it should probably be considered more of a tech demo than a real library. I’m happy to accept patches, I may fix reported bugs, but I’m not likely to otherwise maintain it and I’m certainly not going to sort out packaging for it, but if you would find such a thing useful be my guest to take it, run with it and otherwise do whatever you want with it.

This entry was posted in programming on by .