Category Archives: Numbers are hard

The littlest infinity

Note: This is mostly all stuff you’d probably learn in your first six months of a maths degree, I just like occasionally chewing on the fundamentals and seeing how they fit together. There’s nothing very deep here, but it might be pedagogically interesting.

You’re probably familiar with the natural numbers: They’re a set \(\mathbb{N}\) with a distinguished element \(0 \in \mathbb{N}\) and a successor function \(S: \mathbb{N} \to \mathbb{N}\) which is injective, has \(0 \ne S(n)\) for all \(n \in \mathbb{N}\) and has the property that if \(A \subseteq N\) with \(0 \in A\) and \(S(A) \subseteq A\) then \(A = \mathbb{N}\) (“the induction principle”).

The natural numbers are the most basic example of an infinite set, and are usually postulated to exist as an axiom of set theory. We then construct all other infinite sets starting from \(\mathbb{N}\) using e.g. power set operations.

But what is an infinite set?

There turn out to be more or less three ways of defining this, all of which are equivalent (under certain assumptions).

One way of defining it is any set which is not a finite set, but that reduces to the question “What is a finite set?”. To which the answer is any set which is in bijection (one to one correspondence) with one of the sets \(\{i \in \mathbb{N} : i < n\}\) for some \(n \in \mathbb{N}\) (you need to define the relation \(<\) on \(\mathbb{N}\) for this, but lets assume we’ve done that in some standard way e.g. by induction).

You can show inductively then that \(\mathbb{N}\) has no injection into a finite subset of itself (and thus is infinite) as follows: It’s certainly not in bijection with the empty set, which is \(\{i: i < 0\}\). Suppose you had \(f: \mathbb{N} \to \{i \in \mathbb{N} : i < n + 1\}\) a bijection. By composing it with a swap if necessary, you can assume that \(f(0) = n\). But then \(f \cdot S\) is a bijection with \(\{i \in \mathbb{N} : i < n \}\), which contradicts our induction hypothesis.

Another definition of a set being infinite is that it contains a bijective copy of \(\mathbb{N}\) (i.e. there is an injection from \(\mathbb{N}\) into it).

If there is such an injection then the set is certainly infinite in the sense of not being finite, as otherwise we’d be able to compose the injection from \(\mathbb{N}\) into it with its bijection to a finite set, which contradicts what we showed above about \(\mathbb{N}\) being infinite.

To show the opposite direction we need the axiom of choice (we actually only need a weak form of it, but we’ll ignore that detail): For any set \(X\) there is a choice function \(c : \mathcal{P}(X) \setminus \{\emptyset\} \to X\) with \(c(A) \in A\).

If \(X\) is infinite in the sense of not being finite, we can then construct a function \(f: \mathbb{N} \to X\) as follows:

  • \(f(0) = c(X)\)
  • \(f(S(n)) = c(X \setminus \{f(0), \ldots f(n)\})\)

This is well defined because the set we’re passing to \(c\) must be non-empty, or we’d have constructed a bijection with \(\{0, \ldots n\} = \{i : i < n + 1\}\).

This definition of infinity justifies the fundamental role of the natural numbers (and the title) to some degree: They’re literally the smallest infinite set, because a copy of them is contained in every other infinite set.

But finally, there is a third definition of infinity which has no reference to the natural numbers at all: \(X\) is infinite if there is some injective function \(f: X \to X\) whose range is a proper subset of \(X\). i.e. \(f\) is injective but not surjective. You can justify this by analogy with the natural numbers as our basic example of an infinite set: \(S\) is such a function, because it is injective but does not map anything to \(0\). You can also show inductively that all of the sets \(\{i: i < n\}\) do not have this property (more or less by the same argument we used to show that there was no injection from \(\mathbb{N}\) into them).

Lets see that this is equivalent to the previous definition.

First, suppose we’ve got some injection \(h: \mathbb{N} \to X\). We can construct such an \(f\) as follows:

  • if \(x \neq h(n)\) for any \(n\), then \(f(x) = x\)
  • Let \(f(h(n)) = h(S(n))\)

i.e. we define our function to be the successor function on our copy of \(\mathbb{N}\) and the identity elsewhere.

This is still injective, because it’s injective on each member  of the partition into \(h(\mathbb{N})\) and \(X \setminus h(\mathbb{N})\), and it doesn’t map either part into the other. It is however not surjective, because it maps no values to \(h(0)\).

So a set which contains a copy of \(\mathbb{N}\) must necessarily be infinite in this sense.

Now, suppose we have such an \(f\). We can use this to construct an injection from \(\mathbb{N}\) into \(X\) as follows:

  • Pick \(h(0)\) as an arbitrary element of \(X \setminus f(X)\)
  • Let \(h(S(n)) = f(h(n))\)

We can prove this is injective by inductively showing that \(h(n) \not\in \{h(m): m < n\}\)

Proof: Trivially true for \(0\) because that’s the empty set. For \(S(n)\), note that again \(h(S(n)) \ne 0\) because by choice of \(h(0)\) we must have \(f(x) \ne h(0)\) for any \(x\).

Supposing \(h(S(n)) = h(m)\) for some \(0 < m \leq n\), we must have \(f(h(n)) = h(m)\). But we can write \(m = S(k)\) for some \(k\) because \(m \neq 0\), so we have \(f(h(n)) = f(h(k))\). \(f\) is an injection, so \(h(n) = h(k)\). By our inductive hypothesis, \(h\) is an injection on \(\{1, \ldots, n\}\), so we must have \(n = k\). Therefore \(h\) is an injection as desired.

So we have three notions of infinity, all equivalent (up to the axiom of choice). Why do we care?

Well, one interesting thing about the last definition is that it’s a notion of infinity that has nothing to do with the natural numbers. This has the nice benefit that we can run it backwards: If we have a set that is infinite in that sense, then we can construct the natural numbers out of it. This is a pretty strong justification of the natural numbers as our fundamental notion of an infinite set.

To see how to do this, suppose we’ve got a set \(X\) and an injective function \(S : X \to X\) which is not surjective.

Pick some distinguished element \(0 \in X \setminus S(X)\).

Now consider the family \(\mathcal{F}\) of sets \(U \subseteq X\) with the properties that \(0 \in U\) and \(S(U) \subseteq U\). This family is nonempty because trivially \(X \in \mathcal{F}\).

Now, define \(N = \bigcap \mathcal{F}\).

We must have \(N \in \mathcal{F}\), because \(0 \in N\) and if \(U \in \mathcal{F}\) then \(S(N) \subseteq S(U) \subseteq U\), so taking intersections over all \(U \in \mathcal{F}\) we have \(S(N) \subseteq \bigcap \mathcal{F} = N\).

Moreover, if \(A \subseteq N\) with \(0 \in A\) and \(S(A) \subseteq A\), then \(A \in \mathcal{F}\), and thus \(N \subseteq A\), and so \(N = A\).

But these are precisely the requirements for \(N\) to be a model of the natural numbers, \(\mathbb{N}\)! We have a distinguished element \(0\), an injective function \(S\) that doesn’t map anything to \(0\), and it satisfies the induction principle. And, importantly, we got here without assuming that anything that looked like the natural numbers existed: All we required was the apparently weaker claim about some form of infinite set existing.

So pretty much wherever you start in defining infinite sets, you’re going to end up with the natural numbers at some point.

In the absence of at least some weak form of the axiom of choice these equivalences break down a bit, but they’re still compelling enough and strong enough that they fairly firmly place the natural numbers as the foundational object of infinite set theory.

This entry was posted in Numbers are hard on by .

Metric space completeness is a restricted form of topological compactness

This is one of those facts that is obvious once you’ve seen it but I sometimes forget isn’t well known.

Topological compactness admits the following slightly less common but directly equivalent definition:

A family of sets has the finite intersection property if every intersection of a finite subfamily of it has non-empty intersection. A topological space is said to be compact if every family of closed sets with the finite intersection property has non-empty intersection.

(The reason this is directly equivalent is that a family of closed sets with empty intersection is precisely the set of complements of some open cover)

Starting from this definition, we can get an alternative definition for what it means for a metric space to be complete that makes the relationship between completeness and compactness much more obvious:

A metric spaces is complete if and only if for every family of closed sets with the finite intersection property that contains sets of arbitrarily small diameter has non-empty intersection.

i.e. metric completeness is “compactness for arbitrarily small sets”.

Let’s show that this is equivalent to the normal Cauchy sequence definition.

First, assume a metric space has this property, let’s show it’s complete in the Cauchy sequence sense.

So, let \(x_n\) be a cauchy sequence. Now, define \(F_n = \overline{\{x_m: m \geq n\}}\).

The family \(\{F_n\}\) is nested and non-empty, so certainly has the finite intersection property. Its elements are closed by construction.

To see that it contains sets of arbitrarily small diameter, we just need to show that the tail sets \(\{x_m: m \geq n\}\) can have arbitrarily small diameter, as the diameter of the closure of a set is the same as the diameter of the set. But \(x_n\) is a cauchy sequence, so for \(\epsilon > 0\) we can find \(N\) such that for \(m, n \geq N\), \(d(x_n, x_m) < \epsilon\). Necessarily then the tail set \(\{x_m: m \geq N\}\) has diameter \(\leq \epsilon\).

Now suppose \(x \in \bigcap F_n\). Then \(x_n \to x\), as if we pick \(N\) such that \(diam(F_N) < \frac{1}{2}\epsilon\), then because \(x\) is a limit point we can find \(x_m\) with \(m \geq n\) and \(d(x_m, x) < \frac{1}{2}\epsilon\), so necessarily for all \(n \geq N\), \(d(x_n, n) \leq d(x_m, x_n) + d(x_m, x) < \epsilon\), and the sequence converges to \(x\) as desired.

The other direction now. Suppose we have a Cauchy sequence complete metric space and a family of closed sets with the finite intersection property and arbitrarily small diameter sets.

Let \(E_n\) be a sequence of sets in that family with \(diam(E_n) \to 0\).

Pick \(x_n \in \bigcap\limits_{m \leq n} E_n\) (we can do this because of the finite intersection property). Then \(x_n\) is a Cauchy sequence: Pick \(N\) such that \(diam(E_N) < \epsilon\), then for \(m, n \geq N\), \(x_m, x_n \in E_N\), so necessarily \(d(x_m, x_n) \leq diam(E_N) < \epsilon\).

Because it’s a Cauchy sequence, it converges, say to \(x\). Because the \(E_n\) are closed, \(x\) is necessarily a member of \(\bigcap E_n\).

Suppose now that \(F\) is some other member of our family. We can repeat the above construction replacing \(E_n\) with \(E_n \cap F\), and get an element \(y\) with \(y \in F \cap \bigcap E_n\). But \(x, y \in E_n\), so \(d(x, y) \leq diam(E_n) \to 0\). So \(d(x, y) = 0\) and thus \(x = y\). Hence \(x\) must be in the intersection of the whole family.


Why does this characterisation matter?

Well, clearly it doesn’t matter that much. The vast majority of people who do analysis have probably never encountered it and their lives are not substantially worse off for that lack.

There are basically two major reasons I like this fact:

  1. I think it makes the relationship between compactness and completeness much clearer.
  2. What Zorn’s lemma does for transfinite induction it does for sequences in analysis – it essentially lets you factor out a repetitive mechanical part of the proof and just reuse it without having to repeat the machinery over and over again. I’m slightly allergic to sequences, so this tends to play well with how I like to do analysis.
This entry was posted in Numbers are hard on by .

How to read a mathematics textbook

Miikka  asked me how I read a maths textbook the other day, and I didn’t have a better answer than “badly”. I’ve mostly tried to read textbooks linearly cover to cover, and this doesn’t actually work – I sometimes end up understanding the material, but it’s usually not until long after I’ve finished or given up on the book.

After some thought and experimentation, I think I now have a better answer that helps integrate understanding and reading immediately. It’s very much a work in progress, and I’m sure I will refine the process over time, but initial experience with it seems promising.

It’s designed along a couple of important principles:

  1. The goal is not, in fact, to read the textbook. The goal is to understand the material the textbook covers, and reading it helps you achieve that.
  2. Your obligation to the author ended when you bought the book. You may make use of it in whatever way you like, regardless of how the author intended it to be used. In particular you are under no obligation to read the material in the order in which it is presented, and you don’t have to read the bits you don’t care about.
  3. If you’re constantly making progress in the right direction, you will benefit even if you stop midway through the process.
  4. If it feels easy you’re probably not learning very much.

It’s very loosely based on some of the principles of active reading outlined in How To Read A Book, but they mostly don’t cover this subject so I had to invent my own techniques.

The technique we’re going to use is one of goal-directed learning. If we try to take in the entire textbook at once we’ll be overwhelmed completely (textbooks are large), so instead we repeatedly pick a goal and try to achieve that goal. You’ll end up essentially rewriting just enough of the book in your own words to understand the parts of the material that you need to achieve that goal.

A good goal is a theorem or an exercise (really an exercise is just a theorem with the proof omitted): You want to be able to prove the theorem or solve the exercise. For these purposes a proposition or a lemma is just a small theorem.

Pick one. It doesn’t really matter how. If there’s something specific you want to learn, use that. Otherwise, open the book randomly somewhere in the middle and scout around until you find something that looks interesting.

Now, get a large quantity of paper and a pencil or pen. First dedicate two sheets and create two lists: One is your Current list, the other is your Pending list. Items on the Current list are numbered, items on the Pending list are unnumbered.

The Pending list starts empty, the Current list starts with one item on it (numbered item 1), which is your chosen starting point.

Now, repeatedly iterate the following procedure:

  1. If every item on your Current list has a tick next to it, move an item from your Pending list to your Current list by crossing it off from Pending and adding it to the bottom of Current. If your Pending list is empty, stop. You’re done for now.
  2. Pick an item on your Current list that does not have a tick next to it. The highest number is usually a good default.
  3. Read the part of the book (only a paragraph or two usually) that describes that item.
  4. For any definitions or theorems you don’t currently know, add them to the Current list if they’re not already on it. If any of these were already on the Pending list, cross them off it. Once you’ve done this, go back to step 2.
  5. If you understand all the referenced items, start a new sheet of paper with the item number and its name at the top. Write that sheet (see below), then put a tick next to the item on your current list.
  6. Anything interesting you encounter in steps 3 and 4 that is not directly part of solving your current goal, add to the Pending list.

A sheet for an entry on your list depends on the type of item:

If it’s a theorem, the sheet is a proof of the theorem. You should write this out by hand. Try to prove as much of it yourself as you can. If the proof is complicated and you don’t feel it provides much insight into your current goal, add “Prove this theorem” to your pending list and instead just write out the statement of the theorem in a way you understand and treat the proof as a black box for now.

A sheet for a definition consists of a statement of that definition that you understand followed by the answer to two or three questions:

  1. What is an example of this?
  2. Why do we care? (This may be just “We need it for X on the list”)
  3. Why is this well defined? (Optional as this is often obvious)

At the end of this, you should have written a proof that you understand of the thing that you wanted to understand, and also picked up a significant chunk of the textbook along the way. Congratulations! You should definitely now take a break (to be honest, this process probably took you multiple days so you’d better have been taking breaks along the way anyway). When you’re refreshed, you can either decide you’ve got enough out of the textbook or pick another starting point. Try to pick one that is at least related to your last one so you can build on previous understanding.

Worked Example

I am applying this technique to the combinatorial optimization textbook I got as a gift, and it seems to be helping a lot, thought it’s early in the process so it’s hard to say for sure.

For my purposes I picked “Edmond’s Matroid Intersection Algorithm” because it sounded interesting and I didn’t know what a Matroid was so it seemed like a good test. I’m in the middle of it still, so the following is a work in progress.

My Current list is:

  1. Edmond’s Algorithm
  2. Matroid intersection problem ✓
  3. Matroid ✓
  4. Independence Oracle ✓
  5. Circuit ✓
  6. Independence System ✓
  7. Proposition 13.4
  8. Lower Rank
  9. Rank Quotient
  10. Proposition 13.29
  11. Rank Function ✓
  12. C(X, e) ✓
  13. Theorem 13.12(b)
  14. Theorem 13.8
  15. Proposition 13.7
  16. Theorem 13.10 ✓

My Pending list is:

  • Duality of Matroids
  • Best-in greedy algorithm
  • Worst-out greedy algorithm
  • Theorem 13.19
  • Theorem 13.20

How does it feel?

Well, it feels like a lot of work. But it feels like useful work. I think I have a decent grasp on what a Matroid is and why they’re interesting, and how we can work with them.

Also, understanding this book was always going to be a lot of work. The problem was that I wasn’t doing the work, and the great thing about this technique is that it’s given me a process which actually lets me do the work without feeling overwhelmed, because I can let go of the idea of trying to read the whole book and just focus on the specific goal and the process of getting there.

Once I’ve completed this section I’m going to turn my attention to the Simplex algorithm section, which is entirely different in that I am familiar with the Simplex algorithm but have never really managed to understand it to any significant degree. If this approach helps me achieve that then I’ll definitely regard it as a winner.

This entry was posted in Numbers are hard on by .

Direct proofs from the well ordering theorem

In the course of writing yesterday’s post about Zorn’s lemma I noticed a different proof of the theorem that every vector space has a basis:

Let \(V\) be a vector space and define some well-ordering \(<\) on it. Let \(B\) be the set of \(x\) such that \(x\) is not in the span of \(\{y: y < x\}\).

Claim: B is a basis.

Proof: Certainly \(B\) is linearly independent: If not, let \(x_1 < \ldots < x_n \in B\) be linearly dependent. Then \(x_n\) is in the span of \(x_1, \ldots, x_{n – 1}\), which are \(< x_n\), contradicting the definition of \(B\).

Claim: For all \(x\), \(x\) is in the span of \(\{y \in B: y \leq x\}\).

Proof: Transfinite induction. Either \(x\) is not in the span of \(\{y: y < x\}\), and thus \(x\) is actually in this set and thus certainly in the span. Else, we can find \(y_1, \ldots, y_n < x\) such that \(x\) is in their span. But by induction assumption, each of these is in the span of \(\{y \in B: y < x\}\), and thus so is \(X\).


To some extent this is just the normal proof in disguise, but I thought it was cute. I highly doubt it’s in any way original to me, but I hadn’t seen it before.

This made me wonder what other things you could prove this way. I’ve rarely seen the well ordering theorem used as a direct proof method except when you want to make some careful argument about cardinalities (e.g. one way the Continuum Hypothesis is often used is to put a well-ordering on \(\mathbb{R}\) such that all the initial segments are countable).

For starters, the above proof can relatively easily be turned into a proof of the Teichmüller–Tukey lemma using essentially the same construction:

Let \(X\) be some set, and let \(L \subseteq \mathcal{P}(X)\) be some family of sets of finite character. Define the function \(f : X \to \{0, 1\}\) as follows:

Let \(T = \{y < x: f(y) = 1\}\).

Let \(f(x) = 1\) if \(T \cup \{x\} \in L\), else \(f(x) = 0\).

Claim: \(S = \{x : f(x) = 1\}\) is a maximal element of \(L\).

Proof: First we show that it’s in \(L\). If not, there is some finite subset \(x_1 < \ldots \x_n \in S\) such that \(\{x_1, \ldots, x_n\} \not\in L\). But then \(f(x_n) = 1\), so with \(T\)  as in our definition of \(f\),  \(T \cup \{x_n\} \in L\). Thus every finite subset of this is in \(L\), and in particular \(\{x_1, \ldots, x_n\} \in L\), contradicting our assumption.

Now we show it’s maximal: Suppose \(S \cup \{x\} \in L\). Then we must have \(f(x) = 1\), because \(T \cup \{x\} \subseteq S \cup \{x\}\), and families of sets of finite character are closed under taking subsets. Thus \(x \in S\).


Slightly further adaptation allows us to prove Zorn’s lemma itself via this construction, and I think it actually is a slightly nicer proof than the one I presented before:

Let \(T\) be some partially ordered set with relation \(\prec\) such that every chain in \(T\) has an upper bound.

Let \(<\) be a well-order on \(T\) (not necessarily compatible with \(\prec\)\).

Define \(f : T \to \{0, 1\}\) as \(f(x) = 1\) if \(y \prec x\) for every \(y < x\) such that \(f(y) = 1\), else \(f(x) = 0\).

Let \(C = \{x: f(x) = 1\}\).

Claim: \(C\) is a chain for \(\prec\).

Proof: Let \(x, y \in C\) with \(x \neq y\). By reordering if necessary we can assume \(x < y\). Then by construction of \(f\), we must have \(x \prec y\).

Thus, by assumption, there is some upper bound for \(C\), call it \(s\).

Claim: \(s\) is maximal.

Proof: We will show that \(s \in C\). This proves that \(s\) is maximal because if there were another element \(t > s\), this would also be an upper bound, which would mean it will also have to be in \(C\), which contradicts that \(s\) is an upper bound for \(C\).

To show that \(s\) is in \(C\): We must have \(f(s) = 1\), because \(\{y : y < x, f(y) = 1 \} \subseteq C\}\), \(y \preceq  s\) for all \(y \in C\), and \(y \neq s\) for \(y < s\). These are the conditions in our definition of \(f\), so \(f(s) = 1\) and thus \(s \in C\).


Side note 1: Note that all we’ve really done in the above proof is inlined a proof of the Hausdorff maximal principle. This follows fairly straightforwardly from the Teichmüller–Tukey lemma, so having its own proof by this mechanism is not actually super useful.

Side note 2: You may have only seen the well-ordering theorem proven using Zorn’s lemma so this may seem circular, but you can prove it directly from Hartog’s lemma (which does not require the axiom of choice) as follows:

Let \(X\) be some set, and let \(c\) be a choice function for subsets of \(X\). Let \(W\) be some well ordered set that does not inject into \(X\). Define \(f : W \to X \cup \{\infty\}\) as:

If \(X \subseteq \{ f(s) : s < t \}\) then \(f(t) = \infty\). Else, \(f(t) = c(X \setminus \{ f(s) : s < t \})\).

This must eventually hit \(\infty\) as otherwise it would be injective. Restricted to \(\{t: f(t) \neq \infty\}\), f gives a bijection between \(X\) and a well-ordered set, so you can use this to define a well-ordering on \(X\).


Is this style of proof really any better than the alternatives? I don’t know. You probably still want to use Zorn’s lemma and similar most of the time anyway. I like these proofs though, and they seem like a slightly less messy way to get a feel for transfinite induction.

This entry was posted in Numbers are hard on by .

Zorn’s lemma is what happens when you get bored of transfinite induction

I was reminded of a quote earlier:

“The Axiom of Choice is obviously true, the well-ordering principle obviously false, and who can tell about Zorn’s lemma?” — Jerry Bona

I find I am slightly annoyed by this quote, because the equivalence between these three is so direct. I know that they’re equivalent is the joke, but the equivalence is close enough if you look at it the right way then building intuition about one should help you build intuition about the others.

Moreover, I think the way Zorn’s lemma is taught has a problem-solution ordering issue.

Zorn’s lemma, at its core, takes a common style of proof and isolates the boring mechanical parts into a self-contained tool that abstracts them away. This is great, but the problem is that it so effectively supplants the method it’s trying to abstract that nobody teaches that method any more so it feels like it comes out of nowhere.

That method is transfinite induction. Or, induction over infinite well ordered sets.

To explain how to use this, we need to use the following results:

Well ordered induction principle

Let \(X\) be some well ordered set, and let \(P\) be some property of elements of \(X\) such that for all \(x \in X\), if \(P(y)\) holds for all \(y < x\) then \(P(x)\). Then \(P(x)\) holds for all \(x \in X\).

Proof: Suppose not, let \(x\) be the smallest element for which it does not hold. Then it holds for all \(y < x\) by the fact that \(x\) is the smallest. Thus it holds for \(x\) by assumptions on \(P\). This is a contradiction. QED

We can use this much like we would use induction on the natural numbers.

In particular it allows us to define functions recursively: If we can define a function \(f(x)\) in terms of its values for \(y < x\) this allows us to define a function on the whole of \(X\) (consider the property \(P(X)\) = “\(f\) is uniquely defined for all \(y < x\)”

Hartog’s Lemma

Let \(X\) be some set. There is a well-ordered set \(W\) such that there is no injective function \(f: W \to X\).

Note: This doesn’t require the axiom of choice, but without the axiom of choice we can’t then conclude that there is some injective function \(f: X \to W\), because we could use such an injective function to well-order \(X\). This then implies the well-ordering theorem, which implies the axiom of choice.

Proof: I’m only going to sketch this. The core idea is that you define \(W\) as the set of all well-orders of subsets of \(X\), up to order isomorphism. If there were an injective function from \(W\) to \(X\) then we could construct an element \(x \in W\) such that the initial segment \(\{y : y < x\}\) is isomorphic to \(W\). This gives us a function \(f : W \to W\) such that \(f(x) < x\). Then \(x > f(x) > f^2(x) > \ldots\) gives us an infinite decreasing sequence in a well ordered set, which is impossible.

If you didn’t understand that sketch, don’t worry about it. It won’t be on the test. Just trust the theorem.

Putting this all together

Lets see how you use this to prove something that you’d normally prove with Zorn’s lemma: Let’s prove that every vector space has a basis.

Let \(V\) be a vector space and let \(L\) be the set of all linearly independent subsets of \(V\).

Let \(W\) be some well ordered set with no injective functions into \(L\).

Define a function \(f: W \to \mathcal{P}(V)\) recursively as follows:

If \(A = \bigcup\limits_{y < x} f(y) \) spans the whole vector space, let \(f(x) = A\). Else, let \(f(x) = A \cup \{ v \} \) where \(v \in V\) is picked using the axiom of choice to be any element outside the span of \(A\).

The intuition here is that we keep adding in new elements which are linearly independent of the existing elements and we’re going to keep going until we can’t go any further, at which point we must span the whole space.

Claim 1: If \(y < x\) then \(f(y) \subseteq f(x)\).

Proof: This is by construction. We defined \(f(x)\) in terms of a union that contains \(f(y)\).

Claim: \(f(x)\) is always linearly independent.

Proof: Induction! Suppose it’s true for all \(y < x\). Then we can find \(v_1, \ldots, v_n \in f(x)\) linearly dependent. But there are \(y_1, \ldots, y_n\) with \(v_i \in f(y_i)\). But because there are only finitely many \(y_i\) we can find \(y = \max y_i\) such that \(v_1, \ldots, v_n \in f(y)\) (because of the previous claim). So \(f(y)\) is linearly dependent, contradicting the inductive assumption. QED

So actually \(f: W \to L\).

But this means that \(f\) cannot be injective (by choice of \(W\)). So at some point we must have \(y < x\) with \(f(x) = f(y)\). But this can only happen if at some point our choice of \(A\) spanned the whole space. Thus \(f(x)\) must have spanned the whole space, and thus is a linearly independent spanning set. i.e. a basis.


And thus Zorn

It may not be obvious what about vector spaces in the above mattered and what was generalizable in the above, but after you’ve done a few of those it starts to become clearer. Rather than making you sit through that, let me highlight what I think were the salient details:

  1. We have some partially ordered set – in this case linearly independent subsets of a vector space, ordered by inclusion.
  2. We take our well ordered set and construct a function into that partially ordered set which is strictly increasing until it hits a maximal element.
  3. Because the function cannot be injective, it cannot be strictly increasing, so there must be a maximal element.

In the case of a vector space, a maximal linearly independent set must span the whole space (because otherwise you could add in another element), so that maximal element is what we were looking for.

So all we need to know now is when we can construct such a function. What was the property of linearly independent sets that let us do this?

The property was that given our choice of \(f(y)\) for \(y < x\) we were always able to choose \(f(x)\) to be greater than or equal to every \(f(y)\). i.e. we choose \(f(x)\) to be an upper bound.

If every subset of our target partially ordered set had an upper bound, then we could always construct this choice. This is however too strong a condition: e.g. it’s not the case that any two linearly independent subsets have a common upper bound. The sets \(\{v\}\) and \(\{-v\}\) do not for any non-zero \(v\).

However, all we really need is that the sort of sets we get during our transfinite induction have upper bounds.

And the important feature of these sets is that they come from an increasing sequence. In particular, any two elements of them are comparable. They form a chain.

This leads us to the notion required for Zorn’s lemma: We are interested in partially ordered sets such that every chain has an upper bound. This allows us to construct our function, and transfinite induction then gives us a maximal element.


Zorn’s lemma

Let \(T\) be a partially ordered set such that every chain has an upper bound.Then there is some maximal element \(t \in T\). i.e. there is no element \(s \in T\) such that \(t < s\).


Our proof will be very similar to our proof for vector space having a basis. The only major difference is that we’ll have to work a little harder at the recursive definition of our increasing function because where for the basis form we could construct the function and then show its output was always in \(L\), here we have to simultaneously construct the function and show that it’s well defined.

In order to do this we’ll introduce a special value \(\infty \not\in T\). If things go wrong and we fail to construct a maximal element we’ll return \(\infty\) instead. This simplifies things to showing that we never return \(\infty\).

First, lets be more explicit about our use of the axiom of choice. Let \(f\) be some choice function. We’re going to define an ‘upper bound’ function \(u(A)\) as follows:

Let \(Q = \{x: y < x \forall y \in A\}\).

Let \(R = \{x: y \leq x \forall y \in A\}\).

If \(Q\) is non-empty, let \(u(A) = f(Q)\), i.e. any element which is strictly greater than all of the elements of \(A\).

Else if \(R\) is non-empty, let \(u(A) = f(R)\) – i.e. any element which is \(\geq\) every element of \(A\).

Else, there are no upper bounds. Return \(\infty\).

Let \(W\) be some well ordered set. Define \(f : W \to T \cup \{\infty\}\) recursively as \(f(x) = \infty\) if \(f(y) = \infty\) for any \(y < x\), else \(f(x) = u(\{f(y): y < x\}))\)

Claim: This function never returns \(\infty\).

Proof by induction:

Suppose this is true for all \(y < x\). Then the set \(\{f(y): y < x\}\) must form a chain, if \(u < v < x\) we picked \(f(v) \geq f(w)\) by our definition of \(u\).

Thus, by our assumption on the partially ordered set \(T\), this set has an upper bound, so \(R\) in our definition of \(u\) is non-empty. This means that \(f(x)\) is chosen to be an element of \(R\) and thus is not \(\infty\).

So we’ve constructed an increasing \(f : W \to T\). By choosing \(W\) appropriately, this cannot be an injection. So find \(y < x\) with \(f(x) = f(y)\). This can only happen if there is no element \(s\) such that \(s > f(y)\). i.e. \(f(y)\) is a maximal element.


Although the details differed, hopefully this should look structurally pretty similar to the more concrete form with the vector space basis.

Why is this useful?

The main reason it’s useful is that the property of chains having upper bounds comes up a lot. In particular it comes up with things that are “essentially finite” in nature. Most applications of Zorn seem to boil down to the following more specific lemma:

Teichmüller–Tukey lemma

Let \(X\) be some set and let \(L \subset \mathcal{P}(X)\) be some family of sets with the property that \(A \in L\) if and only if \(B \in L\) for every finite \(B \subseteq A\).

Then there is some \(A \in L\) such that for all \(x \in X\), \(A \cup \{x\} \not\in L\).

Note: In particular, as we saw in our original proof, the linearly ordered subsets of a vector space satisfy these conditions. Most essentially “algebraic” conditions tend to satisfy it.


We consider \(L\) partially ordered by subset inclusion. We’ll show that the union of any chain of sets is in \(L\).

Suppose \(C\) were some chain of sets in \(L\) such that \(\bigcup C \not \in L\). Then we can find a finite set \(\{x_1, \ldots, x_n\} \subseteq \bigcup C\) not in \(L\).

But then we can find \(U_i\) such that \(x_i \in U_i \in C\). Because \(C\) is a chain we can thus find a maximum \(U\). But then \(\{x_1, \ldots, x_n\}\) is a finite subset of \(U\) which is not in \(L\), contradicting the assumption that \(U \in L\).


In parting

Hopefully even if you didn’t follow all the details this demystified Zorn’s lemma a bit.

In general, there is a rich theory of well ordered sets and it seems to often be skipped or deferred until after Zorn’s lemma has already been taught. I can understand why – if you fill in all the details it feels like a pretty complex piece of machinery to be introducing when all you’re going to do is use it to prove Zorn’s lemma and then forget about it.

There’s a lot more to the theory than that though. Some of it is pretty interesting, and some of it I just think is useful in demystifying where a lot of this sort of thing comes from.

This entry was posted in Numbers are hard on by .