Category Archives: Numbers are hard

A nicer way to complete metric spaces

It’s a standard result that you learn almost as soon as you learn about metric spaces that every metric space has a completion.

The normal way to prove this is to construct a pseudometric on the space of Cauchy sequences on the metric space and then quotient out pairs with zero distance to get a metric space. I think this is a bit of a waste of time – the details are fiddly and annoying and it’s not terribly enlightening. Here’s a nicer way.

Let \(X\) be a set. It’s a standard result that \(l^\infty(X)\), the set of bounded functions \(X \to \mathbb{R}\) together with the uniform metric, is a complete metric space.

Now let \(X\) be a metric space and fix arbitrary \(c \in X\).

Claim: The function \(\iota : x \to f_x\) where \(f_x(y) = d(x, y) – d(c, y)\) is an isometric embedding of \(X\) into \(l^\infty(X)\).

If this claim is true we’re done. We’ve got an isometric embedding of \(X\) into a complete metric space, so the closure of the image is its completion.

In order to prove this we’ll need the following elementary result:

Lemma: Let \(x, y, z \in X\). Then \(|d(x, z) – d(y, z)| \leq d(x, y)\).
Proof:

Simple application of the triangle inequality. \(d(x, z) \leq d(x, y) + d(y, z)\) so \(d(x, z) – d(y, z) \leq d(x, y) \). Similarly \(d(x, z) – d(y, z) \geq -d(x, y) \). QED

Now to prove that \(\iota\) is an isometric embedding.

First we must show that \(f_x \in l^\infty(X)\). This is a simple consequence of the above lemma: $$|f_x(y)| = |d(x, c) – d(y, c)| \leq d(x, c)$$

Hence \(f_x\) is bounded

Now for isometry:

Consider \(x, w \in X\). Then $$|f_x(y) – f_w(y)| = |d(x, y) – d(y, c) + d(y, c) – d(w, w)| = |d(x, y) – d(w, y)| \leq d(x, y)|$$

So, taking the supremum over \(y\), we have \(||f_x – f_w|| \leq d(x, w)\). We now need only show equality.

For this, consider \(|f_x(w) – f_w(w)| = |d(x, w) – d(x, x)| = d(x, w)\). Hence we must have \(||f_x – f_w|| \geq d(x, w)\) and the result is proved.

Notes

  • I’m afraid I don’t remember where I first saw this so I can’t provide a reference, but it’s definitely not original to me. I probably encountered it in some textbook long ago
  • The thing I like about this is that it’s quite explicit, and at each step along the way you do the only thing that could possibly work once you’ve had the initial idea of trying to embed it into \(l^\infty(X)\): The only function you know about on it is the metric, so you have to use that. It’s not bounded, so what’s a non-artificial way to make it so, etc. The details are also much less fiddly than the normal proof because we did all the hard work of showing completeness when we proved that \(l^\infty(X)\) was complete.
This entry was posted in Numbers are hard on by .

The power of fixed point theorems in Set Theory

For reasons that aren’t relevant here, I was reminded of the power of fixed point theorems on partially ordered sets to prove some fundamental results in set theory earlier. This is an expository post about that. If you don’t care about set theory you probably won’t care about this.

I’m going to mention two fixed point theorems and use each of them to give a simple proof of a well known result in set theory.

Preliminaries

A poset is a set \(X\) with a relation \(\leq\) on it that is

  1. Reflexive: \( \forall x. x \leq x \)
  2. Antisymmetric: \(\forall x, y. x \leq y\) and \(y \leq x \implies x = y\)
  3. Transitive: \(\forall x, y, z. x \leq y\) and \(y \leq z \implies x \leq z\)

Additional reminders:

  • A poset is said to be totally ordered if additionally it is Total: \( \forall x, y. x \leq y\) or \(y \leq x\).
  • A chain is a subset of a poset which is totally ordered under the restriction of the relationship to it.
  • \(x\) is an upper bound for a set \(A\) if \(\forall a \in A. a \leq x \)
  • \(x\) is a least upper bound for a set \(A\) if it is an upper bound and \(x \leq y\) for all upper bounds of A \(y\). When such an upper bound exists it is unique (by anti-symmetry) and we write it as \(\sup A\)
  • A poset is said to be complete (resp. chain-complete) if every subset (resp chain) has a least upper bound
  • Every chain complete poset has a least element (the least upper bound of the empty set).

Examples of posets are any of the normal sets of numbers – e.g. \(\mathbb{N}, \mathbb{Q}, \mathbb{R}\). All of these are total orders, none of them are complete. \([0, 1] \subseteq \mathbb{R}\) is a complete total order.

For any \(X\), the power set \(\mathcal{P}(X)\) is a complete poset with the order being set inclusion.
Our fixed point theorems will concern functions

Our fixed point theorems are both about particular types of mappings of posets to themselves.

Knaster–Tarski fixed point theorem

(This is actually a slightly weakened form of the theorem because it’s slightly easier to prove and we don’t need the full thing)

Theorem: Let \(X\) be a complete poset and \(f : X \to X\) be order presering. That is \(x \leq y \implies f(x) \leq f(y)\). Then \(f\) has a fixed point.

Proof:

Let \(A = \{ x : x \leq f(x) \} \).

If \(x \in A\) then \(f(x) \in A\) as if \(x \leq f(x)\) then \(f(x) \leq f(f(x))\) due to \(f\) being order preserving.

Suppose now that \(y = \sup A\). Then if \(x \in A\) we have \(x \leq y\) and so \(f(x) \leq f(y)\). But \(x \leq f(x)\) by hypothesis of \(x \in A\), so we have \(x \leq f(x) \leq f(y)\). Therefore \(f(y)\) is also an upper bound of \(A\) and thus, because \(y\) is the least upper bound, \(y \leq f(y)\) and thus \(y \in A\).

But then as per above we must also have \(f(y) \in A\), and thus \(f(y) \leq y\). Therefore \(f(y) = y\).

QED.

This then lets us provide a very short proof of one of the basic results of set theory.

Cantor–Bernstein–Schroeder theorem

Let \(A, B\) be sets, and let \(f : A \to B\) and \(g : B \to A\) be injections. There exists a bijection \(h : A \to B\)

Proof:

The idea is that we want to find some subset \(H \subseteq A\) for which we can define our function h as \(h|_H = f\) and \(h|_{H^c} = g^{-1}\).

Claim: If \(g(f(H)^c) = H^c\) then the function h is well defined and is a bijection.

Well defined is easy: By the assumption, every element of \(H^c\) is in the image of \(g\), so \(g^{-1}\) is well defined on it.

Injective: It’s injective when restricted to \(H\) and to \(H^c\), the only question is whether we can have \(h(x) = h(y)\) with \(x \in H\) and \(y \in H^c\). But by assumption and injectivity of \(g\) we have \(g^{-1}(H^c) \subseteq f(H)^c\), so this is impossible.

Surjective: By assumption, \(g^{-1}(H^c) = f(H)^c\), so the function is surjective.

So now all that remains is to find such an H. Which we will now do by using Knaster-Tarski to pull it out of a hat.

Rewrite the requirement as \(H = g(f(H)^c)^c\)

i.e. we have the function \(S(A) = g(f(A)^c)^c\) and are looking for \(S(H) = H\).

Suppose \(A \subseteq B\). Then \(f(A) \subseteq f(B)\), so \(f(B)^c \subseteq f(A)^c\), so \(g(f(B)^c) \subseteq g(f(A)^c)\), so \( g(f(A)^c)^c \subseteq g(f(B)^c)^c\). i.e. \(S(A) \subseteq S(B)\).

So \(S\) is an order preserving map on the complete poset \((\mathcal{P}, \subseteq)\). Thus by the Knaster–Tarski fixed point theorem it gives us \(H\) we want.
QED

I find Knaster-Tarski a particularly “magic” fixed point theorem usage here because there’s no obvious connection between \(A\) and \(S(A)\) – we’ve just defined \(S(A)\) so that if it had a fixed point it would be the thing we wanted, and thus a solution appears.

Our next fixed point theorem is one I find much more intuitive, which makes it perhaps slightly odd that the proof we’ll use it in is intimately connected with the Axiom of Choice (which is intuitive itself but produces unintuitive results) and Zorn’s lemma (which is too technical to really have much intuition about).

Bourbaki–Witt fixed point theorem

Let \(X\) be a chain-complete poset and \(f : X \to X\) such that \(f(x) \geq x\). For every \(x \ni X\), \(f\) has a fixed point \(y \geq x\)

Proof:

The proof is by transfinite induction. That is, we’ll take some ordinal \(\kappa\) with no injection into \(X\) (such an ordinal exists by Hartog’s Lemma) and define \(g : \kappa \to X\) as follows:

  1. \(g(0) = x\)
  2. \(g(\alpha + 1) = f(g(\alpha))\)
  3. \(g(\beta) = \sup \{ g(\alpha) : \alpha < \beta\}\) when \(\beta\) is a limit ordinal (this is well defined because g produces a chain by construction and \(X\) is chain complete)

\(g\) is, by construction and that \(x \leq f(x)\), a monotone increasing function. i.e. \(\alpha \leq \beta \implies g(\alpha) \leq g(\beta)\). It’s not injective, therefore we can find some \(\alpha < \beta\) with \(g(\alpha) = g(\beta)\). Then \(\alpha + 1 \leq \beta\), so \(g(\alpha) \leq g(\alpha + 1) \leq g(\beta) = g(\alpha)\). Therefore \(g(\alpha) = g(\alpha + 1) = f(g(\alpha))\) and \(g(\alpha)\) is our desired fixed point (and is \(\geq x\) because \(g(0) = x\). QED Now, how do we use this? Well, it provides a very nice proof of Zorn's lemma by way of a closely related result.

Hausdorff maximality theorem

Let \(X\) be a poset. Every chain \(C \subseteq X\) is contained in a maximal chain (that is one not properly contained in any other chain).

This is a theorem very closely related to Zorn’s lemma – either can very easily be proved from the other. But Zorn’s lemma is a simple corollary of Hausdorff’s maximality theorem (whileas the other direction requires at least a bit of machinery), and this way round we get to apply a nice fixed point theorem to give a very simple proof.

Proof:

Let \(\mathcal{C}\) be the set of chains of X ordered by inclusion.

Claim: This is a chain complete poset.
Proof: Let \(\mathcal{A} \subseteq \mathcal{C}\) be a chain. Let \(A = \bigcup \mathcal{A}\). If \(x, y \in A\) then \(x, y \in B\) for some \(B \in \mathcal{A}\) (because it’s a chain, so find one containing each and take the larger of the two). Therefore \(x \leq y\) or \(y \leq x\). Hence \(A \in \mathcal{C}\). Clealy it’s a least upper bound for \(\mathcal{A}\) as it is one in the poset \(\mathcal{P}(X)\).

So, we’ll now apply the Bourbaki Witt fixed point theorem. Define \(f : \mathcal{C} \to \mathcal{C}\) as follows:

  1. If A is maximal, \(f(A) = A \)
  2. If A is not maximal, use the axiom of choice to arbitrarily pick some chain strictly containing \(A\) as \(f(A)\)

By construction \(f(A) \supseteq A\), by the above \(\mathcal{C}\) is chain complete, therefore by Bourbaki Witt \(f\) has a fixed point above every element of \(\mathcal{C}\). The elements of \(\mathcal{C}\) are the chains of \(X\) and the fixed points of \(f\) must be maximal. Hence the result is proved.

QED

As promised, Zorn’s lemma is just a simple corollary:

Zorn’s Lemma

Let \(X\) be a poset such that every chain has an upper bound (not necessarily a least one). \(X\) has a maximal element.

Proof:

Pick a maximal chain, \(C\). Let \(x\) be an upper bound for \(C\). If there were some element \(y \) with \(x < y\) then \(C \cup \{y\}\) would be a chain strictly containing \(C\). Therefore no such \(y\) exists and \(x\) is maximal.

Conclusion

A lot of the proofs presented for both of these results are pretty inscrutable. They’re not hard to understand, but there are quite a lot of details to get right.

I’ve got to admit, this mostly exhausts my list of cute things in set theory you can prove by way of fixed point theorems. I’m sure there are more, and I’d be interested to hear about them, but it’s been a while since I’ve looked at this and these are the only two I remember.

Update: In talking to Imre Leader, who taught me the course in which I learned Bourbaki Witt, he points out that as far as set theory is concerned Bourbaki Witt is almost entirely useless. It can essentially be regarded as “the choice free part of Zorn’s lemma”, and as a result no one actually uses it because the hypotheses of Zorn’s lemma are so much easier to verify.

This entry was posted in Numbers are hard on by .

A cute lower bound for weighted FAS tournaments

As mentioned, for my random hack project du jour I am currently working on an optimiser for weighted feedback arc sets.

This problem can largely be summarised as follows: I have a non-negative weight matrix \(A\) and I want to find a permutation \(\pi\) that maximizes the score $$w(\pi) = \sum_{\pi(i) < \pi(j)} A_{ij}$$ (It's more normal to phrase this in terms of minimizing the sum going the other way, but the problems are equivalent) The thing I was wondering about recently: What sort of scores am I guaranteed to be able to achieve? There's the trivial lower bound that $$w(\pi) \geq \sum_{i < j} \min(A_{ij}, A_{ji})$$ but, equally trivially, in the case where \(A\) is not a constant matrix this is easy to exceed. My question is: What sort of values are we guaranteed to be able to find a value of \(\pi\) that exceeds them? I realized earlier that there's a really nice solution to this. Let \(\chi\) be a random variable that picks a permutation uniformly at random. Then we can write $$w(\chi) = \sum_{i < j} S_{ij}$$ where \(S_{ij}\) is a random variable that takes the value \(A_{ij}\) if \(\chi(i) < \chi(j)\) and else takes the value \(A_{ji}\). It's easy to see that \(S_{ij}\) takes its two values with equal probability, so $$E(S_{ij}) = \frac{1}{2}(A_{ij} + A_{ji})$$ But then we have $$E(w(\chi)) = E(\sum_{i < j} S_{ij}) = \sum_{i < j} E(S_{ij}) = \frac{1}{2} \sum A_{ij}$$ So given that the expected weight of a random permutation is \(\geq\) this value, it's certainly the case that there exists a permutation with weight \(\geq\) this value. This is nice and all, but it turns out to be relatively easy to do better. Consider \(S_{ij}\) and \(S_{kl}\). You can prove by inspection of cases that these are independent of eachother (note that the set \(S_{ij}\) is not independent - even sets of three elements of it typically aren't. This only works for pairs). This is enough that the variances add. Therefore we have $$Var(w(\chi)) = \sum Var(S_{ij})$$ It's a simple calculation to show that $$Var(S_{ij}) = \frac{1}{4} (A_{ij} - A_{ji})^2$$ So $$Var(w(\chi)) = \frac{1}{4} \sum_{i < j} (A_{ij} - A_{ji})^2$$ Why is this important? Well, we are guaranteed that we have some point which is at least the standard deviation away from the mean, and the distribution of \(w(\chi)\) is symmetric about its mean (you can see this by considering what happens under \(\pi \to \tau \cdot \pi\) where \(\tau\) is a reversal). Therefore there must be a point which is at least the mean plus the standard deviation. Therefore we can find a permutation \(\pi\) with $$w(\pi) \geq \frac{1}{2}\left(\sum A_{ij} + \sqrt{\sum_{i < j} (A_{ij} - A_{ji})^2} \right)$$ Some closing notes:

  1. Yay for mathjax. This is my first post using it
  2. My optimizer pretty consistently exceeds this bound anyway (it would be sad if it were doing worse than moderately determined random sampling), but I’ve now added a first pass which shuffles until it does.
  3. I’m nearly certain this bound is not tight except for in the trivial case of the constant matrix. When plotting graphs of the distribution of \(w(\chi)\) the result has a distinctly bell curved nature (although obviously it’s actually discrete) which drops off smoothly rather than cutting off at the SD. However calculating more about the distribution of random sampling is hard, and I suspect may not be that useful, so for now I’m leaving it at this.
  4. I have no idea if this is a known result, but it’s pretty trivial to calculate once you have the basic idea, so it’s likely it’s one of those things that has been discovered several times but wasn’t regarded as interesting enough to publish. Yay for blogs
This entry was posted in Numbers are hard on by .

Generating random spheres

So when I was in the shower earlier and I wondered to myself “Hm. How would I generate random points on an N-sphere?” (true story).

I thought about it for a bit and generated the two obvious solutions. Unfortunately the two obvious solutions are, respectively, wrong and massively inefficient (actually the inefficient one isn’t too bad for low dimensions, but I was interested in the high dimensional case where it rapidly becomes massively inefficient). Once out of the shower I looked the answer up, and found the whole thing interesting enough that I figured someone else might too, so this is a post about that.

Obvious but wrong solution

Generate N random uniform numbers between -1 and 1. Divide the resulting vector by its norm.

Why is this wrong? Well, the initial N random numbers will lie uniformly in the N cube. So the probability density at a given point is proportional to the length of the line that goes from the zero, through that point and out to the N sphere. This line is not of uniform length for the very simple reason that cubes aren’t spheres.

Obvious but slow solution

Generate N uniform random numbers between -1 and 1. Retry this until the resulting norm is <= 1. Divide the resulting vector by its norm. Why does this work? Because we're using a rejection method to make sure we get points inside the unit ball. The result is not uniformly distributed inside the unit ball but it is (evidently) spherically symmetrical. By dividing by the norm we’re then projecting the spherically symmetrical distribution onto the sphere and thus getting a uniform distribution on it.

Why is it slow? Because of the curse of dimensionality. When we do the rejection we’re effectively generating a geometric distribution with probability of success equal to the proportion of the N-cube the N-ball occupies. This means we expect to perform the generation 1 / that probability times, which tends towards infinity quite rapidly as N gets large.

Obvious in retrospect solution

At this point I looked up the answer, though I should have been able to figure it out.

What we want is a spherically symmetric distribution a la the previous solution but which is more efficient to calculate.

It turns out that a good choice for this is an N dimensional normal distribution, which can be obtained by simply generating N Normal(0, 1) random variables. You can then divide by the norm of those variables and get a distribution on the unit sphere.

This is apparently not the best solution (generating normal random numbers is a little expensive), but I couldn’t find a reference to a version which works in N dimensions for any N.

Here’s some C code implementing this: https://gist.github.com/1354700.

This entry was posted in Numbers are hard, programming on by .

Rank aggregation basics: Local Kemeny optimisation

The technical content of this post is based heavily on Rank Aggregation Revisited” by Ravi Kuma, Moni Naorz, D. Sivakumarx and Cynthia Dwork. It’s purely expository in nature on top of that.

You’ve all seen Hammer Principle now, right? If not, go check it out.

The task it performs of aggregating all the individual opinions into a single overall ranking is harder than it looks. This is a post about some of the technical details involved.

The setup is as follows: We have a bunch of items, and a bunch of votes placing a subset of those items in order.

The naive idea one starts with is as follows: If the majority of people prefer A to B, rank A higher than B. This is an obvious thing to aim for. Unfortunately it’s impossible, even if everyone ranks every item. Considering the following set of votes:

1: A, B, C
2: B, C, A
3: C, A, B

Then 1 and 3 think A < B, 1 and 2 think B < C, and 2 and 3 think C < A. So if we tried to order by majority opinion we'd have A < B < C < A. This is called Condorcet's paradox: Majority voting amongst > 2 items is impossible.

Nevertheless, we hope to be able to do at least reasonably well.

One natural approach is called Kemeny optimisation: We try to find a solution which minimizes the number of pairwise disagreements between the end rank and the individual voters. That is, for some set of items I, votes V and an aggregate ranking r, the score is

K(r) = #{ i, j in I, v in V, (r(i) < r(j)) != (v(i) < v(j)) } If r is a minimum for this score we say it's Kemeny optimal. There needn't be a unique Kemeny optimal solution: In the above example, all three of the individual votes are Kemeny optimal rankings. Kemeny optimal rankings have the following nice majority voting property: Let r be Kemeny optimal. If U and V partition the set of items, and for every u in U and v in V the majority think u < v, then for every u in U and v in V r(u) < r(v). Proof: Suppose we had r(u) > r(v) for some u, v. By passing to a smaler u and a larger v we can ensure that there is no z with r(u) > r(z) > r(v). (we can take the smallest u such that r(u) > r(v) then the largest v such that r(u) > r(v)).

But then if we were to define the ranking r’ which is exactly r except that u and v were swapped, this would decrease K: Because there are no points between them, the only pairwise disagreements it changes are those on u and v, so K(r’) – K(r) = #{ t : t(u) < t(v) } - #{ t : t(u) > t(v)}. But we know the majority think u < v, so this change in score must be negative, so we have K(r') < K(r), contradicting minimality. QED We call the conclusion of this theorem the generalised Condorcet criterion (the condorcet criterion is that if there's a single element that beats all the others in a majority vote then it should be the winner). It's a nice property - it is in some sense an approximation of majority voting. In many cases satisfying it will uniquely determine a great deal about the resulting ranking - it's only when there's ambiguity (as in the Condorcet paradox example) that it fails to do so. This should give us confidence that the Kemeny optimal solution is the right approach. So, we want to calculate a kemeny optimal solution. There's just one problem: We've defined the problem in terms of a minimization over all rankings of n items, of which there are n!. This search space is, to borrow a technical term, freaking huge. This makes it a hard problem. In fact finding a Kemeny optimal solution for rankings on n items is NP hard, even with only four votes. I won't prove this here. Read the paper if you care. So, no Kemeny optimal solutions for us. How sad. What would work well as a substitute? Well, examining our proof that Kemeny optimal solutions satisfy the generalised Condorcet condition, an interesting feature appears: We don't actually use anywhere that it is a global minimum. We only use the fact that swapping two adjacent pairs can’t decrease the score. This suggests the following relaxation of the condition:

A ranking r is locally Kemeny optimal if swapping two adjacent elements of the ranking does not decrease K.

By the above observation, locally Kemeny optimal solutions satisfy the generalized Condorcet condition.

This turns our problem from a global search to a local one: Basically we can start from any point in the search space and search locally by swapping adjacent pairs until we hit a minimum. This turns out to be quite easy to do. We basically run insertion sort: At step n we have the first n items in a locally Kemeny optimal order. Swap the n+1th item backwards until the majority think its predecessor is < it. This ensures all adjacent pairs are in the majority order, so swapping them would result in a greater than or equal K. This is of course an O(n^2) algorithm. In fact, the problem of merely finding a locally Kemeny optimal solution can be done in O(n log(n)) (for much the same reason as you can sort better than insertion sort). You just take the directed graph of majority votes and find a Hamiltonian Path. The nice thing about the above version of the algorithm is that it gives you a lot of control over where you start your search.

The above algorithm however produces an ordering which is consistent with its starting point in the following sense:

If we start with a ranking r and then run the above algorithm on it to get a ranking r’, then if r'(u) < r'(v) we must have r(u) < r(v) or a majority vote that u < v. This is easy to see: If r(u) > r(v) and the majority think that u > v then when moving u backwards we would have to stop at v at the latest.

In fact for any starting point there is only one kemeny optimal solution which is consistent for it in this way: We can see this inductively. If it’s true for the first n – 1 items, then where can we put the nth item? The only possible place is where the above algorithm puts it: If it were to put it after that place, it wouldn’t be Kemeny optimal. If it were to put it before it, it wouldn’t be consistent because it would be less than some item which it was greater than in the original ordering.

The unique locally Kemeny optimal solution is called the local Kemenization of the ranking.

So, this process gives us the core idea of the rank aggregation mechanism we’re using: We pick a starting point according to some heuristic (I’ll explain the heuristic we’re using in a later post), and from that starting point calculate its local Kemenisation. This gives us a ranking which is guaranteed to satisfy the generalised Condorcet condition, and thus likely to be a good ranking, but takes into account whatever our heuristic thought was important as well – this can matter a lot for cases where there is ambiguity in the ranking data.

This entry was posted in Numbers are hard, programming on by .