An introduction to infinity

I was talking about infinity with some friends recently. I did a maths degree during which I was quite keen on set theory, so there’s a lot of stuff I take for granted as making sense that is actually obscure and hard to understand if you’ve not got that background. I thought I’d do my best to make some of the more important basic results clear. Basically the goal is to get you from a fairly elementary understanding of sets and functions to the point where you understand what \(\aleph_1 \leq 2^\aleph_0\) means.

I’m not going to assume a lot of background, but if you’ve not done at least some maths or computer science this is going to feel an awful lot like being dropped in the deep end. Although I’m going to sketch out the meaning of some of the basic terms, a basic familiarity and sets and functions is going to be pretty helpful here.

Sets and functions

This section is going to be extremely terse. I fully intend to explain the later concepts more thoroughly, don’t worry. This should be half reminder and half getting on board with the same notation.

A set is a collection of objects. These objects may themselves be other sets. We use curly brackets to denote a set containing a specific list of elements. e.g. write \({1, 2, 3}\) for the set containing the numbers 1, 2, 3. A set ignores duplicates. \({1, 1}\) is the same as \({1}\).

Two sets are regarded as equal if they have precisely the same elements. We write \(A \subseteq B\) if every element of \(A\) is an element of \(B\). Thus if \(A \subseteq B\) and \(B \subseteq A\) then \(A = B\).

We write \(a \in A\) to mean that \(a\) is a member of \(A\). We write \(b \not\in A\) to indicate that it is not. i.e. \(1 \in \{1, 2\}\) but \(3 \not\in \{1, 2\}\).

Given a set \(A\) we can define a restriction of it as \(\{ x \in A : p(x) \}\) as the set of all elements of \(A\) satisfying some property \(p\). e.g. we could define \(\{x \in \{1, 2, 3, 4, 5\} : x < 3\}\) and we would get \(\{1, 2\}\).

The set \(\emptyset\) is the set with no elements. For every set \(A\), \(\emptyset \subseteq A\), because the requirement that every element of \(\emptyset\) is in \(A\) is satisfied vacuously (i.e. there are no elements to falsify it).

We write \(\mathbb{N}\) for the set of all natural numbers (that is \(\{0, 1, 2, \ldots\}\)).

We write \(\mathcal{P}(A)\) to mean the set of all subsets of \(A\). e.g. \(\mathcal{P}\{1, 2\} = \{\emptyset, \{1\}, \{2\}, \{1, 2\}\}\). Note that \(\mathcal{P}(\emptyset) = \{\emptyset\}\).

Given two sets \(A \cup B\) is the set which contains precisely the elements that are in at least one of \(A\) or \(B\), \(A \cap B\) is the set which contains elements which are in both and \(A \setminus B\) is the set which contains all the elements of \(A\) that are not in \(B\).

Given a set containing only other sets, \(\bigcup A\) is the set of elements contained in at least one element of \(A\) and \(\bigcap A\) that containing only elements which are in all of them. Note that \(\bigcup \{A, B\} = A \cup B\) and \(\bigcap \{A, B\} = A \cap B\).

A rule \(f\) is said to be a function \(A \to B\) (written sometimes \(f : A \to B\)) if for every element \(a \in A\) it assigns a single element \(b \in B\). We write this as \(f(a) = b\). We call \(A\) the domain of \(f\) and \(B\) the range of \(f\).

If \(f : A \to B\) and \(U \subseteq A\) we can define the restriction of \(f\) to \(U\) as doing exactly the same thing as \(f\) but we consider its domain as \(U\). We write this \(f_U\). So \(f_U(a) = f(a)\), but \(f_U\) is only defined for elements of \(U\).

If we have \(f : A \to B\) and \(g : B \to C\) then we can define \(g \cdot f : A \to C\) as the rule which first applies \(f\) and then \(g\). i.e. \(g \cdot f)(x) = g(f(x))\).

\(f\) is said to be injective/an injection (sometimes “one-to-one”) if \(a \neq b\) implies that \(f(a) \neq f(b)\).

It is said to be surjective/a surjection (sometimes “onto”) if for every \(b\) there is some \(a\) with \(f(a) = b\).

Note that if \(f\) and \(g\) are injective then so is \(g \cdot f\). Similarly surjective.

\(f\) is said to be bijective/a bijection if it is both injective and surjective. i.e. it maps every element of \(A\) to a distinct element of \(B\) and every element of \(B\) is mapped to.

If \(f\) is a bijection then we write \(f^{-1} : B \to A\) to be the function such that \(f^{-1}(b)\) is the unique \(a\) such that \(f(a) = b\). This is a function because surjectivity means there is always at least one such \(a\) and injectivity means there can only be one such. Note that \(f^{-1}\) is also a bijection. Also note that \((g \cdot f)^{-1} = f^{-1} \cdot g^{-1}\).

If any of this was very surprising or unfamiliar to you you are welcome to continue reading but it may be a bit of a struggle. I’ll try and find a decent set of slightly less terse notes to link to.


We say two sets \(A\) and \(B\) are the same size if there is a bijective function \(A \to B\). i.e. we can define a rule that maps every element of one to a unique element of the other and vice versa. We write this as \(|A| = |B|\). For now just treat the bars as notation meaning “the size of”. We’ll later define what \(|A|\) actually means, but don’t worry about it for now.

Lets see that this make sense. Does this behave like we’d expect it to?

The identity function \(\mathrm{id}(a) = a\) is a bijection \(\mathrm{id}  : A \to A\). So \(|A| = |A|\). That’s good. We’d not get very far if that was not the case. This property is called reflexivity.

If \(|A| = |B|\) then there is a bijection \(f : A \to B\). Then \(f^{-1} : B \to A\) is a bijection, so \(|B| = |A|\). This property is called symmetry.

If \(|A| = |B|\) and \(|B| = |C|\) then we can find bijections \(f : A \to B\) and \(g : B \to C\), so \(g \cdot f : A \to C\) is a bijection. Thus \(|A| = |C|\). This property is called transitivity.

So this relationship behaves more or less like we would expect “the same size” to.

As part of both making sense of this and a teaser of what \(|A|\) will later come to mean, lets see how this works for finite sets.

Say that \(|A| = n\) for some natural number \(n\) if \(|A| = |\{x \in \mathbb{N} : x < n\}|\). so e.g. \(|\emptyset| = 0\), \(|\{1, 4, 7\}| = 3\) (because you can use the function \(f(0) = 1, f(1) = 4, f(2) = 7\). These sets basically give us “canonical” sets of a specific size. If a set \(A\) has \(|A| = n\) for some \(n \in \mathbb{N}\) then we call it finite, otherwise it’s infinite.

Of course, we first have to prove that they’re not the same size. Well, we don’t have to, but the notation would pretty weird if you could have e.g. \(|A| = 1\) and \(|A| = 3\).

Because we have transitivity, all we need to do is show that none of these sets are the same size – then if any other set were the same size as two of them we’d have a contradiction.

For convenience, lets label \(S_n = \{x \in \mathbb{N} : x < n\}\). We’ll first need a lemma:

Lemma: If \(|A| = n\) and \(b \not\in A\) then \(|A \cup \{b\}| = n + 1\)

Proof: Let \(f : S_n \to A\) be a bijection. Define \(g : S_{n+1} \to A \cup \{b\}\) by \(g(m) = f(m)\) if \(m < n\) and \(g(n) = b\).

This is an injection because \(f\) is and because \(b \not\in A\) (consider \(x, y\) with \(x \neq y\). At most one of \(x\) and \(y\) can be \(b\). If neither is, then \(g(x) = f(x) \neq f(y) = g(y)\). If one is then say it’s \(y\). Then \(g(x) \in A\) but \(g(y) \not\in A\) so \(g(x) \neq g(y)\)).

This is a surjection because \(f\) hits every element of \(A\) and we’ve additionally defined \(g\) to hit \(b\). QED

A corollary of this is that if \(b \in B\) and \(|B| = n\) then \(|B \setminus \{b\}| = n – 1\). You can prove this by setting \(A = B \setminus \{b\}\) in the lemma.

Theorem: If \(|S_m| = |S_n|\) then \(m = n\).


We can use reflexivity to swap the order if necessary to assume that \(m \leq n\). We’re going to prove by induction on \(n\) that if \(m \leq n\)  and \(|S_m| = |S_n|\) then \(m = n\).

It’s vacuously true for \(n = 0\) simply because there are no \(m < n\).

It’s true for \(n = 1\) because the only possible value of \(m\) other than \(1\) is \(0\). You cannot define a bijection from a non-empty set to the empty one because there are no elements to map to.

Now suppose we’ve proven it for \(n\) and want to prove it for \(n+1\).

Suppose \(|S_{n+1}| = m\) . Then \(|S_n| = |S_{n+1} \setminus \{n\}| = m – 1\). But by our inductive hypothesis this implies that \(m – 1 = n\) and thus \(m = n + 1\). QED

So this confirms that for the finite case at least, size really does behave like we want it to – you can count sets with natural and two sets are the same size precisely when counting them gives the same answer.

But we don’t want to just talk about whether two sets are the same size as each other. We want to talk about whether one is smaller or larger.

We do this by building on the definition of the same size as. We say \(A\) is no larger than \(B\) if \(|A| = |C|\) for some \(C \subseteq B\) (equivalently, if there is an injective function from \(A\) to \(B\)). We write this \(|A| \leq |B|\). We say \(A\) is smaller than \(B\) , or \(|A| < |B|\), if it’s no larger than \(B\) but not the same size as it.

We’re going to want to show that this behaves sensibly. First note that you can show that if \(|A| = |B|\), \(|B| \leq |C|\) and \(|C| = |D|\) then \(|A| \leq |D|\) just by composing functions (do check the details, but I’m going to start eliding them now for things that are similar to things I’ve already proven). Other similar relationships apply.

What we’re going to have to work quite a bit harder to prove is something that may seem like it should be obvious: If \(|A| \leq |B|\) and \(|B| \leq |A|\) then \(|A| = |B|\).

This is easy for finite sets: If \(|A| = m\) and \(|B| = n\) then \(|A| \leq |B|\) if and only if \(m \leq n\), because \(|S_m| \leq |S_n|\) if and only if \(m \leq n\). So if \(|A| \leq |B|\) and \(|B| \leq |A|\) then \(m \leq n\) and \(n \leq m\), so \(m = n\) and thus \(|A| = |B|\).

For infinite sets, it’s rather more complicated. Why?

Well the first hint that it’s complicated is that you can remove elements from an infinite set without changing its size. In our proof that sizes of finite sets corresponded to natural numbers we relied on the fact that if you removed an element from a set of size \(n\) you got a set of size \(n – 1\). This no longer works.

As an example, consider the set \(\mathbb{N}^+ = \mathbb{N} \setminus \{0\}\). The function \(f(n) = n + 1\) is a bijection \(f : \mathbb{N} \to \mathbb{N}^+\). In fact, any infinite subset of the natural numbers has the same size as the natural numbers (sketch proof: Let \(f(n)\) be the nth smallest element of the set).

So things are not quite so well behaved. But they are well enough behaved:

Theorem: If \(|A| \leq |B|\) and \(|B| \leq |A|\) then \(|A| = |B|\).

Warning: This proof is quite long and involved compared to what has come so far. I’ve done my best to break it down and make it as comprehensible as possible, but you might want to take a break and go get a cup of tea or something if your brain is feeling full.

Proof: Rephrasing the statement, we have injective functions \(f : A \to B\) and \(g : B \to A\) and we want to construct a bijection \(h : A \to B\)

The idea behind our construction is “simple”: We will piece together a mix of \(f\) and \(g^{-1}\) (the latter being only defined on the subset of \(A\) which \(g\) hits).

Pick \(H \subseteq g(B)\) (the notation \(g(B)\) means the set of points in \(a\) which are \(g(b)\) for some \(b \in B\)).

Define \(h(a) = g^{-1}(a)\) if \(a \in H\) else \(h(a) = f(a)\).

We want to find a condition on \(H\) such that \(h\) is a bijection.

First, note that if \(h\) is injective then \(h(U \setminus V) = h(U) \setminus h(V)\) (it’s always going to be a superset of \(h(U) \setminus h(V)\) and if there were some element of \(h(V)\) in it then this would contradict injectivity because we’d be able to find \(x \in V\) and \(x \not\in V\) with \(h(x) = h(v)\).

So if \(h\) is a bijection then we have \(h(A \setminus H) = h(A) \setminus h(H) = B \setminus h(H)\). Rearranging, \(H = h^{-1}(B \setminus h(A \setminus H)\). But \(h(A \setminus H) = f(A \setminus H\) by the definition of \(h\). And \(h^{-1}(b) = g(b)\) for \(x \not \in f(A \setminus H)\), so we have \(H = g(B \setminus f(A \setminus H))\).

This turns out to be a sufficient condition to make \(h\) a bijection. Why?

What we’re essentially saying here is that we’re carefully splitting \(A\) and \(B\) into two parts in such a way that we can use \(f\) and \(g\) to form a bijection between one of each of these parts on either side. We then combine these together to get a bijection between the whole thing.

Lets see that this works. Suppose \(H\) satisfies this condition:

\(h\) is injective: If \(x, y \in A\). If \(x, y \in H\) or \(x, y \in A \setminus H\) then we have \(h(x) \neq h(y)\) because each of \(f\) and \(g^{-1}\) are injective. So assume \(x \in H\) and \(y \in A \setminus H\) and \(h(x) = h(y)\). Then \(h(x) = f(x)\) and \(h(y) = g^{-1}(y)\). So \(f(x) = g^{-1}(y)\) and thus \(y = g(f(x)\). But \(g\) by the assumption there is some \(b \not\in f(A \setminus H)\)  (and in particular \(b \neq f(x)\) with \(g(b) = y\). This would violate the injectivity of \(b\).

\(h\) is surjective: If \(b \not\in f(A \setminus H )\) then \(g(b) \in H\) by assumption. Therefore \(h(g(b)) = b\). Hence surjectivity.

So we’ve found a condition that if we manage to find exactly the right \(H\) will give us a bijection between \(A\) and \(B\). How are we going to find such an \(H\)?

We’ll do this by considering the function \(S : \mathcal{P}(A) \to \mathcal{P}(A)\) defined by \(S(T) = g(B \setminus f(A \setminus T))\). We then want to find a fixed point of \(S\), which will give us the desired \(H\).

First: Note that if \(R \subseteq T\) then \(S(R) \subseteq S(T)\). This is because \(A \setminus R \supseteq A \setminus T\), so \(f(A \setminus R) \supseteq f(A \setminus T)\) , so \(B \setminus f(A \setminus R) \subseteq B \setminus f(A \setminus T)\), so \(g(B \setminus f(A \setminus R)) \subseteq g(B \setminus f(S \setminus T))\). We call this being order-preserving.

Our goal will basically be to do this by finding the “largest possible” fixed point.

So first we want to consider a set of candidates for fixed points which we know to be non-empty: Let \(\mathcal{F} = \{ T \subseteq A : T \subseteq S(T) \}\). This set is non-empty because \(\emptyset \in \mathcal{F}\). Further it must contain every fixed point (if there are any) because if \(T = S(T)\) then certainly \(T \subseteq S(T)\). We’re going to show that there’s a largest element of this set and that it must be a fixed point.

Note that if \(T \in \mathcal{F}\) then \(S(T) \in \mathcal{F}\). This is because of the above property: If \(T \subseteq S(T)\) then \(S(T) \subseteq S(S(T))\) by the order preserving property. Therefore \(S(T) \in \mathcal{F}\).

Now let \(H = \bigcup \mathcal{F}\).

Claim: this \(H\) is a fixed point of \(S\).

Proof of claim: We first must show that \(H \in \mathcal{F}\).

Suppose \(T \in \mathcal{F}\). Then \(T \subseteq H\). Therefore \(S(T) \subseteq S(H)\). But \(T \subseteq S(T)\), so \(T \subseteq S(H)\). So \(H\) is a union of sets each of which is a subset of \(S(H)\), so we must have \(H \subseteq S(H)\). So \(H \in \mathcal{F}\).

But we have shown that if \(H \in \mathcal{F}\) then \(S(H) \in \mathcal{F}\). Therefore \(S(H) \subseteq H\) and thus \(H = S(H)\).

One way to think of this proof is that we’re starting with the empty-set and repeatedly iterating \(S\) on it – because at every point our current set has the property that \(T \subseteq S(T)\) this will always result in growing the set. Eventually this set must stop growing and that gives us a fixed point.

You actually can prove it that way, but it turns out you need a lot of machinery to make sense of “eventually”, so this proof with unions is nicer.


This is called the Cantor–Bernstein–Schroeder theorem. When I did my first course in set theory the proof came with a big disclaimer saying “Don’t worry, you’re not expected to understand this”. In fairness, the provided proof was I think much more complicated than this one (this one steals a trick from a more advanced version of the proof and basically inlines the theorem it uses into the proof. I’ve previously written about the real version of this). Still, I’d recommend that if you didn’t understand that you might want to reread it a few times to see if you can. It’s worth doing.

Or, you know, not. Your call really.

So we’ve now shown that “is at least as large as” behaves sensibly in the sense that two sets can’t be at least as large as each other but not be the same size.

There are two important questions we haven’t answered:

  1. Given any two sets, is one at least as large as the other? i.e. can we have sets which we just can’t compare the sizes of?
  2. Are all infinite sets the same size?

The answer to the former I will defer to a later post (that given my track record of saying that may or may not get written).

The latter though, we can answer right now!

Theorem: \(|A| < |\mathcal{P}(A)|\). i.e. the power set of \(A\) is strictly bigger than \(A\).


First lets get out of the way that \(|A| \leq \mathcal{P}(A)\). The function \(f(a) = \{a\}\) is an injection into the power set.

So now we just need to show that they’re not equal in size. In order to do that, we’ll employ something that looks a lot like one of the classic paradoxes of set theory.

Suppose \(f : A \to \mathcal{P}(A)\) is injective. We’ll try to construct a point in \(\mathcal{P}(A)\) that it can’t hit.

You know Russel’s paradox? The set of all sets that don’t contain themselves? Lets build one of those.

Let \(R = \{x \in A : x \not\in f(x) \}\)

OK, it’s close anyway.

Now suppose \(f(y) = R\).

Is \(y \in R\)? If yes, it contradicts the definition of \(R\) because \(y \in f(y)\). If no, it should be in \(R\) because \(R\) is the set of elements not contained in their image under \(f\).

Therefore there cannot exist such a \(y\) and \(f\) is not surjective. Hence there is no bijection \(f : A \to \mathcal{P}(A)\), and thus \(|A| < |\mathcal{P}(A)|\). QED

Note that we can keep iterating this: It’s not just the case that there are different sizes of infinite set. It’s the case that there are arbitrarily large sizes of infinite set.

I promised that at the end of this you’d understand what it means that \(\aleph_1 \leq 2^{\aleph_0}\). This was perhaps ambitious of me, but I can at least provide you with a sketch of what it means:

First, note that if \(|A| = |B|\) then \(|\mathcal{P}(A)| = |\mathcal{P}(B)|\) because we can turn a bijection between \(A\) and \(B\) into a bijection between \(\mathcal{P}(A)\) and \(\mathcal{P}(B)\) by applying the bijection to the individual elements of the set.

This sorta justifies the notation \(|\mathcal{P}(A)| = 2^{|A|}\). The exponentiation probably doesn’t make any sense right now. There is a general notion of exponentiation of cardinals, but it’s not very important right now so lets not worry about it and just say that it’s a convenient notation because it holds true for finite sets (there are \(2^n\) elements in the power set of a set of \(n\) elements).

The last piece you then need to make sense of the statement is that when a set is the same size as \(\mathbb{N}\) we write \(|A| = \aleph_0\).

This means that as a special case of our last theorem we can write \(\aleph_0 < 2^{\aleph_0}\).

But what’s \(\aleph_1\)?

Well. The idea is that \(\aleph_0\) is the smallest size of infinity and that \(\aleph_1\) is the second smallest size of infinity. Note that we don’t currently know that this statement makes sense (it turns out it does, but I haven’t proven nearly enough in this post to show that).

The continuum hypothesis is that there is no set \(A\) with \(\aleph_0 < |A| < 2^{\aleph_0}\).

Rephrasing this, that means that the continuum hypothesis is \(\aleph_1 = 2^{\aleph_0}\). i.e. that \(2^{\aleph_0}\) is the smallest size of infinity larger than \(\aleph_0\).

The answer to the continuum hypothesis turns out to be “It’s complicated”. I confess I never got to the point where I really understood the proofs that it was complicated – I can sketch them out in layman speak, but not enough to really recreate the details.

But hopefully I can get you to the point where all the details I’ve elided here at least make sense.

In a later post that is. In the meantime, feedback as to whether this one made any sense to anyone who didn’t already understand the material would be useful to calibrate what that later post contains.

This entry was posted in Numbers are hard on by .