As part of my recent playing with metric search structures, I was wondering about the following problem: Given a metric space X I want to partition X into two sets A and B such that each of A and B has a small diameter. Equivalently, I’m trying to find A that minimizes max(diam(A), diam(\(A^c\))).
The brute force solution for this would be fairly awful: There are \(2^n\) partitions, calculating the score of each is \(O(n^2)\), so the brute force solution takes \(O(2^n n^2)\). So let’s not do that.
I was expecting that I’d have to find an approximate solution, as this looked like a classic hairy NP-hard optimization problem. I’d started contemplating local optimizations, simulated annealing, etc. Then after doing a bit more searching I happened across this paper, which in fact has a polynomial time exact solution to the problem: Diameter Partitioning by David Avis.
The algorithm is incredibly cute – it uses absolutely no advanced ideas, just two clever observations which reduce the entire problem to elementary algorithms. It’s the sort of thing which as soon as you see it you just kick yourself for not having thought of it. So I thought I’d write about it to share how much I enjoyed it, even if no one else is actually likely to care about solving this problem.
The first observation is this:
We can find a partition A, B which has max(diam(A), diam(B))) \(\leq\) t if and only if we can find two sets A, B such that if \(d(x, y) > t\) they are in different halves of the partition (this is restating the definition).
That is, if we draw an edge between x and y if and only if \(d(x, y) > t\), then a partition with max(diam(A), diam(B))) \(\leq\) t is precisely a bipartite matching for this graph.
But bipartite matchings are easy to find! Or at least, \(O(n^2)\) to find, which is no worse than calculating the diameter of a set in the first place. You just do a graph search on each of the components, marking nodes alternatively black and white (i.e. if you’re coming from a white node you mark this node black, if you’re coming from a black you mark this node white) as you find them, and yell “Impossible!” if you ever try to paint a node a different colour than you’ve already painted it. Done.
So for any given t we have an \(O(n^2)\) algorithm that either finds us a partition at least that good or tells us that none exists. Now how do we find a best possible partition?
The obvious answer is binary search, but binary searching on floats is never going to give us the exact answer, only a series of successively better approximations. This then motivates the second observation: Our distances are discrete.
If we work out the distance between every pair (we already have to do that for building the graph) then you can put these pairs + their distances in an array and sort that array. Then it’s just a simple matter of binary searching that array, and additionally you can use it to build the graph without calculating any more distances – just take the pairs to the right of your current point in the array.
Building the array is \(O(n^2 log(n))\), and we will do \(log(n)\) binary searches, each performing an \(O(n^2)\) operation, so in total the cost of this algorithm is \(O(n^2 log(n))\). A lot better than brute force.
- It apparently is NP-hard to do this if you want a partition of more than two elements
- Any algorithm which guarantees within a factor of two of the right answer has to examine every pair of points and thus must be \(O(n^2)\). You can see this by constructing a metric space of n points, picking a distinguished pair x, y and setting \(d(x, y) = 2\) whilst setting all other distances for 1. Any partition which separates x and y will have a score of 1, any other will have a score of 2, and by randomly picking x and y you can see that there’s no way to know this except examining every pair
- The above example also demonstrates that it’s \(O(n^2)\) to even calculate a diameter, which means that this is also a lower bound on even knowing for sure how good your partition is
- If you don’t care about solving this exactly and only care that the result is “good” in some fuzzy sense I think you can do a lot better in practice. There are good approximate diameter algorithms, and we didn’t take advantage of the triangle equality at all in our algorithm for building the separation graph, so I think it’s possible to adjust the partition testing code in a way that is much better than \(O(n^2)\) in many nice cases, even if it necessarily degenerates to \(O(n^2)\) in the worst cases. I’ll write more about this once I’ve worked out the details.