An algorithm for incrementally building separation graphs

You know that annoying thing when you’re reading a paper and it’s fairly clear they’ve not actually tried that algorithm in practice? Don’t you hate that?

Anyway, here’s an algorithm I’ve not tried in practice. I haven’t even written code for it yet. This post is as much to get the ideas clear in my head as anything else. This algorithm may turn out to be worse than doing it the stupid way in practice.

This is the problem we’re trying to solve:
x
We have n points, \(X = \{x_1, \ldots, x_n\}\), and a metric function d. We are trying to build the unordered graph with edges \(E = \{ (i, j) : d(x_i, x_j) > t \} \).

This is motivated by trying to find small diameter partitions of X, which we will be doing by trying to either prove (X, E) is not bipartite or finding a bipartition of it (see this previous post)

The question is how to do this efficiently? The obvious brute force approach is \(O(n^2)\), and indeed any solution must have \(O(n^2)\) worst-case. We’d like to do well in practice by applying the following two key ideas:

  1. By using the triangle inequality, we may often be able to escape having to do distance calculations because we can infer the distance will be too large / too small
  2. We are likely not to need the entire graph – a proof that a graph is not bipartite may involve only a very small subset of the nodes, so if we build the graph incrementally we may be able to stop early

The way we do this will be as follows:

For each node x, we keep track of two sets, Definite(x) and Candidates(x). We will also index distances as we find them by tracking Distances(x), which will be track all points we’ve already calculated the distance from x to.

Roles and implementation requirements

Definite(x) is a set containing all nodes we know are > t away from x and starts empty. We require efficient addition of new elements to it without introducing duplicates and the creation of an iterator which is guaranteed to iterate over all elements in the set even if new ones are added during iteration, even after it has previously claimed there are no elements left. One way to do this would be to have it consist of both an array and a hash set.

Candidates(x) contains all points which we don’t yet know for sure whether or not they are > t from x and starts equal to X. As a result we want a set implementation which is cheap to allocate an instance containing the set and cheap to delete from. This is where the previous post about integer sets comes in (we represent a node by its index).

Distances(x) is some sort of ordered map from distances to lists of points. It needs to support efficient range queries (i.e. give me everything nearer than s or further than s).

The Algorithm

Our two basic operations for manipulating the data we’re tracking are Close(x, y) and Far(x, y). Close(x, y) removes x from Candidates(y) and y from Candidates(x). Far does the same thing but also adds x to Definite(y) and y to Definite(x).

Our iterator protocol is that an iterator has an operation Next which returns either None or an Element. We will construct Neighbours(x), which creates an iterator that incrementally returns all points > t away from the x and uses information discovered whilst doing so to flesh out the information we know about other nodes too.

Here is the algorithm, written in some bizarre pseudocode language that looks like nothing I actually write (no, I don’t know why I’m using it either):

 
  Neighbours(X) = NI(x, Iterator(Definite(x)))

  Next(NI(x, found))
    
    # If we 
    while ((result = Next(found)) == None && not IsEmpty(Candidates)) do
      candidate = Pop(Candidates(x))    
      s = d(x, candidate)

      if s > t then Far(x, candidate) else Near(x, candidate)
      
      # Because d(y, candidate) <= d(y, x) + d(candidate, x) <= t - s + s = t
      for(y in Distances(x) where d(y, x) <= t - s) Near(candidate, y)

      # Because if d(y, candidate) <= t then d(y, x) <= d(candidate, x) + t <= s + t
      for(y in Distances(x) where d(y, x) > t + s) Far(candidate, y)

      # Because if d(candidate, y) <= t then d(candidate, x) <= d(candidate, x) + d(y, x) <= t - s + s = t
      if s > t then for(y in Distances(x) where d(y, x) <= t - s) Far(candidate, y)

      # If we have no more candidates left then Distances(x) will never be used again and is wasted space
      if IsEmpty(Candidates(x)) then delete Distances(x) else Put(Distances(x), s, y))
    done

    return result

The idea being that as we're searching for edges for a given node we're also using this to fill out the edges for the other nodes. This should work particularly well for doing depth first search, because in particular it often means that the nodes we're going to transition to will have some of their neighbours filled in already, and we may in fact be able to spend most of our time just chasing through Definite sets rather than having to do new distance calculations. Even where we haven't been able to do that, hopefully we've managed to prune the lists of elements to consider significantly before then.

Notes

  • It should be fairly evident that this is not only \(O(n^2)\) worst case, it's probably \(O(n^2)\) expected case for exploring the whole graph, even in reasonable metric spaces - it's likely that one or the other of the two sets of points we're looking at during each iteration step will be O(n) - so the only possible win over the brute force algorithm is if it does significantly fewer expensive distance calculations
  • I've not yet written any real code for this, so I'm sure some of the details are wrong or awkward
  • Even if it's not wrong or awkward it may well turn out to have sufficiently bad constant factors, or save few enough distance invocations, that it's worthless in practice
This entry was posted in Code, programming on by .