I posted the following problem on Twitter yesterday: I have intervals \(L_1, \ldots, L_m\) which are represented as half-open intervals \(L_k = [u_k, v_k)\). I know that the intervals are all distinct and that for any two intervals \(L_i, L_j\) with \(i \neq j\) either one is contained in the other or two are disjoint.

I want to sample uniformly from all pairs \(\{(i, j) : L_i \cap L_j = \emptyset \}\), or determine that there are no such pairs. How?

A solution now follows. It might not be the optimal solution, but it’s pretty close.

The two observations that make the solution work are:

- If every interval is disjoint from at least one other interval, then rejection sampling runs in expected time at most \(O(m)\), because the probability of accepting a random pair is at least the lowest probability of selecting \(j\) disjoint from \(i\) over all \(i\).
- A root interval (one containing everything else) is contained in no disjoint pair and thus can be removed from the list without changing the distribution. If there is no root interval then every interval is disjoint from at least one other.

To unpack the second observation: If there is no root interval, take some \(L_i\). Then \(L_i\) is contained in some maximal length interval. This is not a root interval, so there is some \(L_j\) that is not contained in it (and thus is disjoint from it). Such an \(L_j\) is necessarily disjoint from \(L_i\).

It’s easy to find a root interval in \(O(m)\), but removing it might introduce a new root-interval (in the degenerate case the whole list could be a chain of the form \(L_i = [i, m)\)).

Here’s how to simultaneously remove *all* root intervals in \(O(m \log(m))\):

First note that a root interval is necessarily of maximum length. Thus if we sort the intervals in order of decreasing length (that’s the \(O(m \log(m) \) bit. You can do it in \(O(m)\) if the interval boundaries are not-too-large intervals using bucket sort), the root intervals must form some initial prefix of the list. By renumbering where necessary lets assume that the \(L_i\) are sorted in that order.

Now calculate the intervals \(B_i = [\min\limits_{k \geq i} u_i, \max\limits_{k \geq i} v_i) \). These are the bounding intervals of the tail sets and can easily be worked out in \(O(m)\) by just iterating backwards over the list.

\(L_i\) contains all \(k \geq i\) if and only if it contains \(U_i\). So now we find the first index \(k\) that is not a root index just by iterating forward through the \(L_i\) until \(U_i\) is not contained in \(L_i\).

We now take all intervals from that point and discard everything earlier. If the set is now empty, we raise an error because there are no disjoint pairs. If it’s not empty, we can now run the rejection sampling algorithm on it as desired and have it terminate in \(O(m)\).

I *think* there’s another solution that may sometimes be better involving explicitly tracking the tree structure and counting the number of disjoint pairs that live in each sub-interval, but the solution fell apart the more I looked at it and I wasn’t convinced it was ever actually better.