An algorithm for incrementally building separation graphs

You know that annoying thing when you’re reading a paper and it’s fairly clear they’ve not actually tried that algorithm in practice? Don’t you hate that?

Anyway, here’s an algorithm I’ve not tried in practice. I haven’t even written code for it yet. This post is as much to get the ideas clear in my head as anything else. This algorithm may turn out to be worse than doing it the stupid way in practice.

This is the problem we’re trying to solve:
x
We have n points, \(X = \{x_1, \ldots, x_n\}\), and a metric function d. We are trying to build the unordered graph with edges \(E = \{ (i, j) : d(x_i, x_j) > t \} \).

This is motivated by trying to find small diameter partitions of X, which we will be doing by trying to either prove (X, E) is not bipartite or finding a bipartition of it (see this previous post)

The question is how to do this efficiently? The obvious brute force approach is \(O(n^2)\), and indeed any solution must have \(O(n^2)\) worst-case. We’d like to do well in practice by applying the following two key ideas:

  1. By using the triangle inequality, we may often be able to escape having to do distance calculations because we can infer the distance will be too large / too small
  2. We are likely not to need the entire graph – a proof that a graph is not bipartite may involve only a very small subset of the nodes, so if we build the graph incrementally we may be able to stop early

The way we do this will be as follows:

For each node x, we keep track of two sets, Definite(x) and Candidates(x). We will also index distances as we find them by tracking Distances(x), which will be track all points we’ve already calculated the distance from x to.

Roles and implementation requirements

Definite(x) is a set containing all nodes we know are > t away from x and starts empty. We require efficient addition of new elements to it without introducing duplicates and the creation of an iterator which is guaranteed to iterate over all elements in the set even if new ones are added during iteration, even after it has previously claimed there are no elements left. One way to do this would be to have it consist of both an array and a hash set.

Candidates(x) contains all points which we don’t yet know for sure whether or not they are > t from x and starts equal to X. As a result we want a set implementation which is cheap to allocate an instance containing the set and cheap to delete from. This is where the previous post about integer sets comes in (we represent a node by its index).

Distances(x) is some sort of ordered map from distances to lists of points. It needs to support efficient range queries (i.e. give me everything nearer than s or further than s).

The Algorithm

Our two basic operations for manipulating the data we’re tracking are Close(x, y) and Far(x, y). Close(x, y) removes x from Candidates(y) and y from Candidates(x). Far does the same thing but also adds x to Definite(y) and y to Definite(x).

Our iterator protocol is that an iterator has an operation Next which returns either None or an Element. We will construct Neighbours(x), which creates an iterator that incrementally returns all points > t away from the x and uses information discovered whilst doing so to flesh out the information we know about other nodes too.

Here is the algorithm, written in some bizarre pseudocode language that looks like nothing I actually write (no, I don’t know why I’m using it either):

  Neighbours(X) = NI(x, Iterator(Definite(x)))

  Next(NI(x, found))

    # If we
    while ((result = Next(found)) == None && not IsEmpty(Candidates)) do
      candidate = Pop(Candidates(x))
      s = d(x, candidate)

      if s > t then Far(x, candidate) else Near(x, candidate)

      # Because d(y, candidate) <= d(y, x) + d(candidate, x) <= t - s + s = t
      for(y in Distances(x) where d(y, x) <= t - s) Near(candidate, y)

      # Because if d(y, candidate) <= t then d(y, x) <= d(candidate, x) + t <= s + t
      for(y in Distances(x) where d(y, x) > t + s) Far(candidate, y)

      # Because if d(candidate, y) <= t then d(candidate, x) <= d(candidate, x) + d(y, x) <= t - s + s = t
      if s > t then for(y in Distances(x) where d(y, x) <= t - s) Far(candidate, y)

      # If we have no more candidates left then Distances(x) will never be used again and is wasted space
      if IsEmpty(Candidates(x)) then delete Distances(x) else Put(Distances(x), s, y))
    done

    return result

The idea being that as we're searching for edges for a given node we're also using this to fill out the edges for the other nodes. This should work particularly well for doing depth first search, because in particular it often means that the nodes we're going to transition to will have some of their neighbours filled in already, and we may in fact be able to spend most of our time just chasing through Definite sets rather than having to do new distance calculations. Even where we haven't been able to do that, hopefully we've managed to prune the lists of elements to consider significantly before then.

Notes

  • It should be fairly evident that this is not only \(O(n^2)\) worst case, it's probably \(O(n^2)\) expected case for exploring the whole graph, even in reasonable metric spaces - it's likely that one or the other of the two sets of points we're looking at during each iteration step will be O(n) - so the only possible win over the brute force algorithm is if it does significantly fewer expensive distance calculations
  • I've not yet written any real code for this, so I'm sure some of the details are wrong or awkward
  • Even if it's not wrong or awkward it may well turn out to have sufficiently bad constant factors, or save few enough distance invocations, that it's worthless in practice

Integer sets with O(1) range allocation and O(log(n)) deletion

This is another post brought to you by the school of the bloody obvious. Or at least it should be, but despite having implemented a trivial variation on this data structure before it still took me far too long to remember it existed.

My previous use case was the original version of IntAllocator in the rabbitmq Java client API (it’s an internal class used for allocating channel numbers). In looking it up I’ve noticed that it no longer does, for reasons that do make sense, but those reasons don’t apply for my current use case (basically the new version uses an internal bitset, trading space for time. RabbitMQ only needs to allocate one of these, and the range isn’t that large, so trading space for time is completely acceptable).

My current use case was for building separation graphs for metric spaces. Part of an algorithm I’m playing with requires keeping a candidate list for each element: That is, a list of elements which might be > t away but we’re not sure. This needs to not take up O(n) space when the answer is “the whole set” and we need to be able to remove elements from it efficiently. Hence the constraints in the title.

The resulting structure is basically a range compressed bitset built out of a balanced binary tree. It works as follows:

A node in the tree may be either empty, an inclusive range represented by its endpoints or a split around a pivot of m such that the left hand side of the tree contains elements <= m and the right contains elements > m. We enforce the invariants that neither child of a split may be empty, and that ranges must have endpoints which admit at least one element.

Allocating an entire range is then just a matter of creating a range node holding two integers, i.e. O(1).

Deleting an element is harder, but not much.

  • Deleting an element from an empty node does nothing.
  • Deleting an element from a range has three distinct cases:
    1. If the element is not contained within the range, do nothing
    2. If the element is one of the endpoints, decrement or increment the relevant endpoint. If this results in the range being empty, replace this with an empty node
    3. If the element is internal to the range, split this range around its midpoint and delete from the split instead
  • Deleting an element from a split is even more straightforward: You determine which child the element should belong in and delete from that. If this results in that child being empty, replace this node with the other child.
  • Notes

    • I’ve put together a Haskell implementation to demonstrate the algorithm
    • There are at least two sensible ways to split a range. You can either do it around the element being deleted (which guarantees you’ll not have to recurse – you just allocate a split and two new ranges) or you can do it around the midpoint of the range. The former is cheaper but can result in the tree becoming unbalanced, the latter guarantees you’ll never need more than log(n) levels of recursion but also means most of the time you’ll use log(n) levels of recursion. You could also implement a hybrid where you split around the deleted point unless you’re below a certain depth in the tree, but I’m not sure it’s worth the added implementation cost
    • It is useful for split nodes to keep track of the smallest interval containing them as you can then shortcut if the element to be deleted lies outside that interval
    • You can also implement deleteRange efficiently (I think in O(log(n), though there’s a detail I haven’t checked there). This is left as an exercise for the interested reader
    • When I implemented IntAllocator I added a cache for this: A bounded size stack which you pushed elements to delete onto. When the stack grew full you sorted it into ranges and deleted the ranges all at once. There was a specific reason for this that doesn’t apply (sometimes you wanted to say “pop an element, any element” and that came from the stack if it was non-empty), but this might still be a useful thing to do for imperative versions of this code (it’s a bad idea for pure ones)

Butternut squash, coconut and chickpea curry

We had our housewarming party last night, and I’d made this as food. A friend asked me for the recipe, and I was surprised it wasn’t already on here due to it being one of my standards, so I’m now fixing that.

I should warn you that this recipe is largely a lie, in that the details change every time I make it. This is the version I made last night though (well, this is a halved version of that).

Ingredients

  • 1 white onion
  • A smallish knob of fresh ginger
  • Fresh chilli to taste (depends on type of chilli and spiciness you want. Do use fresh rather than dried though)
  • One medium butternut squash
  • One can chickpeas
  • One small can of coconut cream
  • A bit of sugar
  • Quite a lot of salt
  • Quite a lot of garam massala
  • Plenty of sunflower oil

I can’t really give precise measurements on the salt and garam massala because I was adding them continuously throughout until I got the taste right. My guess is several tea spoons of salt and several table spoons of garam massala.

Cooking

  1. Dice the ginger, onion and chilli finely.
  2. Peel the squash and cut it into roughly 1cm cubes
  3. Fry the ginger, onion and chilli with salt and sugar until the onion starts to go transparent
  4. Add the squash and garam massala. Mix it all up thoroughly and fry for a little bit longer
  5. Add boiling water until the squash is nearly covered. Allow to simmer for a while, stirring occasionally
  6. When the squash is starting to go soft, add the coconut cream. Continue simmering and stirring
  7. Eventually the squash should be breaking up, causing the sauce to thicken. Once it’s starting to taste nearly cooked, add the chickpeas and continue simmering

At this point it’s basically ready to serve whenever, but it improves significantly by leaving it to cook on a low heat for longer. Also, keep tasting it whilst it’s cooking and add more salt and garam massala to taste.

A cute algorithm for a problem you’ve never cared to solve

As part of my reason playing with metric search structures, I was wondering about the following problem: Given a metric space X I want to partition X into two sets A and B such that each of A and B has a small diameter. Equivalently, I’m trying to find A that minimizes max(diam(A), diam(\(A^c\))).

The brute force solution for this would be fairly awful: There are \(2^n\) partitions, calculating the score of each is \(O(n^2)\), so the brute force solution takes \(O(2^n n^2)\). So let’s not do that.

I was expecting that I’d have to find an approximate solution, as this looked like a classic hairy NP-hard optimization problem. I’d started contemplating local optimizations, simulated annealing, etc. Then after doing a bit more searching I happened across this paper, which in fact has a polynomial time exact solution to the problem: Diameter Partitioning by David Avis.

The algorithm is incredibly cute – it uses absolutely no advanced ideas, just two clever observations which reduce the entire problem to elementary algorithms. It’s the sort of thing which as soon as you see it you just kick yourself for not having thought of it. So I thought I’d write about it to share how much I enjoyed it, even if no one else is actually likely to care about solving this problem.

The first observation is this:

We can find a partition A, B which has max(diam(A), diam(B))) \(\leq\) t if and only if we can find two sets A, B such that if \(d(x, y) > t\) they are in different halves of the partition (this is restating the definition).

That is, if we draw an edge between x and y if and only if \(d(x, y) > t\), then a partition with max(diam(A), diam(B))) \(\leq\) t is precisely a bipartite matching for this graph.

But bipartite matchings are easy to find! Or at least, \(O(n^2)\) to find, which is no worse than calculating the diameter of a set in the first place. You just do a graph search on each of the components, marking nodes alternatively black and white (i.e. if you’re coming from a white node you mark this node black, if you’re coming from a black you mark this node white) as you find them, and yell “Impossible!” if you ever try to paint a node a different colour than you’ve already painted it. Done.

So for any given t we have an \(O(n^2)\) algorithm that either finds us a partition at least that good or tells us that none exists. Now how do we find a best possible partition?

The obvious answer is binary search, but binary searching on floats is never going to give us the exact answer, only a series of successively better approximations. This then motivates the second observation: Our distances are discrete.

If we work out the distance between every pair (we already have to do that for building the graph) then you can put these pairs + their distances in an array and sort that array. Then it’s just a simple matter of binary searching that array, and additionally you can use it to build the graph without calculating any more distances – just take the pairs to the right of your current point in the array.

Building the array is \(O(n^2 log(n))\), and we will do \(log(n)\) binary searches, each performing an \(O(n^2)\) operation, so in total the cost of this algorithm is \(O(n^2 log(n))\). A lot better than brute force.

Notes

  • It apparently is NP-hard to do this if you want a partition of more than two elements
  • Any algorithm which guarantees within a factor of two of the right answer has to examine every pair of points and thus must be \(O(n^2)\). You can see this by constructing a metric space of n points, picking a distinguished pair x, y and setting \(d(x, y) = 2\) whilst setting all other distances for 1. Any partition which separates x and y will have a score of 1, any other will have a score of 2, and by randomly picking x and y you can see that there’s no way to know this except examining every pair
  • The above example also demonstrates that it’s \(O(n^2)\) to even calculate a diameter, which means that this is also a lower bound on even knowing for sure how good your partition is
  • If you don’t care about solving this exactly and only care that the result is “good” in some fuzzy sense I think you can do a lot better in practice. Tere are good approximate diameter algorithms, and we didn’t take advantage of the triangle equality at all in our algorithm for building the separation graph, so I think it’s possible to adjust the partition testing code in a way that is much better than \(O(n^2)\) in many nice cases, even if it necessarily degenerates to \(O(n^2)\) in the worst cases. I’ll write more about this once I’ve worked out the details.

Story concept: Dead man walking

I often sketch out stories in my head, just because it’s something I enjoy doing. Occasionally I turn them into shorts, but I’ve neither the patience nor the talent to really turn them into anything longer so most of the concepts just wither and die.

This seems like a shame, so I thought I might post some of them here as I come up with them. Feel free to take them and run with them if the idea grabs you, though I imagine that’s fairly unlikely as I suspect most people with the will to write are not short of ideas to write about.

Concept:

Protagonist is some sort of research into personal software assistance and/or cybernetics. He has a lot of software which is basically designed to emulate him in mundane tasks – dealing with phone calls, doing drudge research, etc. Over time, both with new software being written and the software developing better models of his behaviour, it gets to the point where the software can fake him to a very convincing degree. Additionally the more he relies on it the more it becomes an integrated part of his personality to the point where it’s difficult to tell exactly where he ends and the software begins (there’s probably no full mind-brain link so much as a very good wearable interface).

At some point he realises that he’s having to deal with a lot of boring face to face stuff while the software is automating a lot of things he’d rather be doing. He takes advantage of latest developments in cybernetic prosthetics for dealing with e.g. spine damage and enables it so that the software is actually fully capable of running his body while his mind wanders. Division blurs further.

Then he dies. Severe brain aneurysm or some such – leaves the body basically intact but brain dead.

…but the software is still happily running and, as per above, is quite capable of controlling the body. Moreover it basically behaves like him in most cases. So despite being brain dead, he is perfectly capable of carrying on as he always did in most circumstances. What does he do now?

Themes:

  • Social issues. Friends and family are going to be severely freaked out – it’s unlikely they had any idea how much of their interactions with protagonist were already just the software, and it will be very hard to figure out how they should treat him now
  • Legal issues. By all medical definitions our protagonist is legally brain dead and unable to make his own decisions. Except that he appears to be walking around, talking and carrying about his daily business
  • Personal issues. The protagonist is having a pretty serious identity crisis. Is he dead? Is he an upload? Is he an AI or merely a really good non-sentient fake? Additionally, he has to cope with the fact that he feels like he’s become stupid. The biological brain was the real creative problem solver – even the software that was designed to cope with doing research and problem solving was really there to do the drudge work while he sorted out the real thing

And that’s about as far as I’ve got with the concept. I like the idea, but I’m not sure how I’d flesh it out into a full story even if I wanted to.