Zorn’s lemma is what happens when you get bored of transfinite induction

I was reminded of a quote earlier:

“The Axiom of Choice is obviously true, the well-ordering principle obviously false, and who can tell about Zorn’s lemma?” — Jerry Bona

I find I am slightly annoyed by this quote, because the equivalence between these three is so direct. I know that they’re equivalent is the joke, but the equivalence is close enough if you look at it the right way then building intuition about one should help you build intuition about the others.

Moreover, I think the way Zorn’s lemma is taught has a problem-solution ordering issue.

Zorn’s lemma, at its core, takes a common style of proof and isolates the boring mechanical parts into a self-contained tool that abstracts them away. This is great, but the problem is that it so effectively supplants the method it’s trying to abstract that nobody teaches that method any more so it feels like it comes out of nowhere.

That method is transfinite induction. Or, induction over infinite well ordered sets.

To explain how to use this, we need to use the following results:

Well ordered induction principle

Let \(X\) be some well ordered set, and let \(P\) be some property of elements of \(X\) such that for all \(x \in X\), if \(P(y)\) holds for all \(y < x\) then \(P(x)\). Then \(P(x)\) holds for all \(x \in X\).

Proof: Suppose not, let \(x\) be the smallest element for which it does not hold. Then it holds for all \(y < x\) by the fact that \(x\) is the smallest. Thus it holds for \(x\) by assumptions on \(P\). This is a contradiction. QED

We can use this much like we would use induction on the natural numbers.

In particular it allows us to define functions recursively: If we can define a function \(f(x)\) in terms of its values for \(y < x\) this allows us to define a function on the whole of \(X\) (consider the property \(P(X)\) = “\(f\) is uniquely defined for all \(y < x\)”

Hartog’s Lemma

Let \(X\) be some set. There is a well-ordered set \(W\) such that there is no injective function \(f: W \to X\).

Note: This doesn’t require the axiom of choice, but without the axiom of choice we can’t then conclude that there is some injective function \(f: X \to W\), because we could use such an injective function to well-order \(X\). This then implies the well-ordering theorem, which implies the axiom of choice.

Proof: I’m only going to sketch this. The core idea is that you define \(W\) as the set of all well-orders of subsets of \(X\), up to order isomorphism. If there were an injective function from \(W\) to \(X\) then we could construct an element \(x \in W\) such that the initial segment \(\{y : y < x\}\) is isomorphic to \(W\). This gives us a function \(f : W \to W\) such that \(f(x) < x\). Then \(x > f(x) > f^2(x) > \ldots\) gives us an infinite decreasing sequence in a well ordered set, which is impossible.

If you didn’t understand that sketch, don’t worry about it. It won’t be on the test. Just trust the theorem.

Putting this all together

Lets see how you use this to prove something that you’d normally prove with Zorn’s lemma: Let’s prove that every vector space has a basis.

Let \(V\) be a vector space and let \(L\) be the set of all linearly independent subsets of \(V\).

Let \(W\) be some well ordered set with no injective functions into \(L\).

Define a function \(f: W \to \mathcal{P}(V)\) recursively as follows:

If \(A = \bigcup\limits_{y < x} f(y) \) spans the whole vector space, let \(f(x) = A\). Else, let \(f(x) = A \cup \{ v \} \) where \(v \in V\) is picked using the axiom of choice to be any element outside the span of \(A\).

The intuition here is that we keep adding in new elements which are linearly independent of the existing elements and we’re going to keep going until we can’t go any further, at which point we must span the whole space.

Claim 1: If \(y < x\) then \(f(y) \subseteq f(x)\).

Proof: This is by construction. We defined \(f(x)\) in terms of a union that contains \(f(y)\).

Claim: \(f(x)\) is always linearly independent.

Proof: Induction! Suppose it’s true for all \(y < x\). Then we can find \(v_1, \ldots, v_n \in f(x)\) linearly dependent. But there are \(y_1, \ldots, y_n\) with \(v_i \in f(y_i)\). But because there are only finitely many \(y_i\) we can find \(y = \max y_i\) such that \(v_1, \ldots, v_n \in f(y)\) (because of the previous claim). So \(f(y)\) is linearly dependent, contradicting the inductive assumption. QED

So actually \(f: W \to L\).

But this means that \(f\) cannot be injective (by choice of \(W\)). So at some point we must have \(y < x\) with \(f(x) = f(y)\). But this can only happen if at some point our choice of \(A\) spanned the whole space. Thus \(f(x)\) must have spanned the whole space, and thus is a linearly independent spanning set. i.e. a basis.


And thus Zorn

It may not be obvious what about vector spaces in the above mattered and what was generalizable in the above, but after you’ve done a few of those it starts to become clearer. Rather than making you sit through that, let me highlight what I think were the salient details:

  1. We have some partially ordered set – in this case linearly independent subsets of a vector space, ordered by inclusion.
  2. We take our well ordered set and construct a function into that partially ordered set which is strictly increasing until it hits a maximal element.
  3. Because the function cannot be injective, it cannot be strictly increasing, so there must be a maximal element.

In the case of a vector space, a maximal linearly independent set must span the whole space (because otherwise you could add in another element), so that maximal element is what we were looking for.

So all we need to know now is when we can construct such a function. What was the property of linearly independent sets that let us do this?

The property was that given our choice of \(f(y)\) for \(y < x\) we were always able to choose \(f(x)\) to be greater than or equal to every \(f(y)\). i.e. we choose \(f(x)\) to be an upper bound.

If every subset of our target partially ordered set had an upper bound, then we could always construct this choice. This is however too strong a condition: e.g. it’s not the case that any two linearly independent subsets have a common upper bound. The sets \(\{v\}\) and \(\{-v\}\) do not for any non-zero \(v\).

However, all we really need is that the sort of sets we get during our transfinite induction have upper bounds.

And the important feature of these sets is that they come from an increasing sequence. In particular, any two elements of them are comparable. They form a chain.

This leads us to the notion required for Zorn’s lemma: We are interested in partially ordered sets such that every chain has an upper bound. This allows us to construct our function, and transfinite induction then gives us a maximal element.


Zorn’s lemma

Let \(T\) be a partially ordered set such that every chain has an upper bound.Then there is some maximal element \(t \in T\). i.e. there is no element \(s \in T\) such that \(t < s\).


Our proof will be very similar to our proof for vector space having a basis. The only major difference is that we’ll have to work a little harder at the recursive definition of our increasing function because where for the basis form we could construct the function and then show its output was always in \(L\), here we have to simultaneously construct the function and show that it’s well defined.

In order to do this we’ll introduce a special value \(\infty \not\in T\). If things go wrong and we fail to construct a maximal element we’ll return \(\infty\) instead. This simplifies things to showing that we never return \(\infty\).

First, lets be more explicit about our use of the axiom of choice. Let \(f\) be some choice function. We’re going to define an ‘upper bound’ function \(u(A)\) as follows:

Let \(Q = \{x: y < x \forall y \in A\}\).

Let \(R = \{x: y \leq x \forall y \in A\}\).

If \(Q\) is non-empty, let \(u(A) = f(Q)\), i.e. any element which is strictly greater than all of the elements of \(A\).

Else if \(R\) is non-empty, let \(u(A) = f(R)\) – i.e. any element which is \(\geq\) every element of \(A\).

Else, there are no upper bounds. Return \(\infty\).

Let \(W\) be some well ordered set. Define \(f : W \to T \cup \{\infty\}\) recursively as \(f(x) = \infty\) if \(f(y) = \infty\) for any \(y < x\), else \(f(x) = u(\{f(y): y < x\}))\)

Claim: This function never returns \(\infty\).

Proof by induction:

Suppose this is true for all \(y < x\). Then the set \(\{f(y): y < x\}\) must form a chain, if \(u < v < x\) we picked \(f(v) \geq f(w)\) by our definition of \(u\).

Thus, by our assumption on the partially ordered set \(T\), this set has an upper bound, so \(R\) in our definition of \(u\) is non-empty. This means that \(f(x)\) is chosen to be an element of \(R\) and thus is not \(\infty\).

So we’ve constructed an increasing \(f : W \to T\). By choosing \(W\) appropriately, this cannot be an injection. So find \(y < x\) with \(f(x) = f(y)\). This can only happen if there is no element \(s\) such that \(s > f(y)\). i.e. \(f(y)\) is a maximal element.


Although the details differed, hopefully this should look structurally pretty similar to the more concrete form with the vector space basis.

Why is this useful?

The main reason it’s useful is that the property of chains having upper bounds comes up a lot. In particular it comes up with things that are “essentially finite” in nature. Most applications of Zorn seem to boil down to the following more specific lemma:

Teichmüller–Tukey lemma

Let \(X\) be some set and let \(L \subset \mathcal{P}(X)\) be some family of sets with the property that \(A \in L\) if and only if \(B \in L\) for every finite \(B \subseteq A\).

Then there is some \(A \in L\) such that for all \(x \in X\), \(A \cup \{x\} \not\in L\).

Note: In particular, as we saw in our original proof, the linearly ordered subsets of a vector space satisfy these conditions. Most essentially “algebraic” conditions tend to satisfy it.


We consider \(L\) partially ordered by subset inclusion. We’ll show that the union of any chain of sets is in \(L\).

Suppose \(C\) were some chain of sets in \(L\) such that \(\bigcup C \not \in L\). Then we can find a finite set \(\{x_1, \ldots, x_n\} \subseteq \bigcup C\) not in \(L\).

But then we can find \(U_i\) such that \(x_i \in U_i \in C\). Because \(C\) is a chain we can thus find a maximum \(U\). But then \(\{x_1, \ldots, x_n\}\) is a finite subset of \(U\) which is not in \(L\), contradicting the assumption that \(U \in L\).


In parting

Hopefully even if you didn’t follow all the details this demystified Zorn’s lemma a bit.

In general, there is a rich theory of well ordered sets and it seems to often be skipped or deferred until after Zorn’s lemma has already been taught. I can understand why – if you fill in all the details it feels like a pretty complex piece of machinery to be introducing when all you’re going to do is use it to prove Zorn’s lemma and then forget about it.

There’s a lot more to the theory than that though. Some of it is pretty interesting, and some of it I just think is useful in demystifying where a lot of this sort of thing comes from.

This entry was posted in Numbers are hard on by .

One thought on “Zorn’s lemma is what happens when you get bored of transfinite induction

  1. Pingback: Direct proofs from the well ordering theorem | David R. MacIver

Comments are closed.