Category Archives: Numbers are hard

Some things that are the same size as each other

In my last post I introduced the notion of the size of an infinite set and proved two of the fundamental theorems about them.

In particular I pointed out that infinite sets could be the same size as proper subsets of themselves, and that they were strictly smaller than power sets of themselves.

One of the reasons why the fact that there are different sizes of infinity is a little surprising is that it turns out that a lot of things you’d think of as being different sizes are actually the same size.

This post is mostly worked examples of that. Along the way I’ll also use these examples to introduce some of the basic ideas of cardinal arithmetic, but this is mostly about showing how some of this works in practice.

A piece of notation: The set of real numbers is denoted \(\mathbb{R}\), the set of rational numbers \(\mathbb{Q}\) (a rational number is anything that can be represented as \(\frac{m}{n}\) for integers n\) and the set of integers \(\mathbb{Z}\). I’m not going to attempt to give a formal definition of the real numbers here, but I also shouldn’t rely on anything too surprising about them.

How large are these sets? They’re the same size as the two main sets we’ve already studied.

  • \(\mathbb{Z} = \aleph_0\)
  • \(\mathbb{Q} = \aleph_0\)
  • \(\mathbb{R} = 2^{\aleph_0}\)

In order to prove the first two, we’re first going to look at a slightly different set. Consider \(\mathbb{N}^2\), the set of ordered pairs of natural numbers (e.g. \((1, 2)\), \((7, 3)\), etc). How large is this set?

In fact, \(|\mathbb{N}^2| = \aleph_0\).

Why? Well, certainly \(|\mathbb{N}^2| \geq \aleph_0\), because the function \(f(n) = (n, 0)\) is an injection \(f : \mathbb{N} \to \mathbb{N}^2\). But we can also construct an injection \(g : \mathbb{N}^2 \to \mathbb{N}\) as \(g((i, j)) = 2^i 3^j\). Uniqueness of prime decomposition then gives us that this is an injection, and thus \(|\mathbb{N}|^2 = \aleph_0\).

You can also construct a bijection explicitly. It’s kinda informative to do so, but the details are fiddly enough that I confess I can’t really be bothered with them.

But this fact now gives us the cardinalities of \(\mathbb{Z}\) nearly for free. Let \(f : \mathbb{Z} \to \mathbb{N}^2\) be such that \(f(n) = (s, \mathrm{abs}(n)\), where \(s = 0\) if \(n < 0\) and \(s = 1\) otherwise. Then this is an injection \(f : \mathbb{Z} \to \mathbb{N}^2\). We already know \(\mathbb{Z} \geq \aleph_0\) (because it contains \(\mathbb{N}\) as a subset), so \(|\mathbb{Z}| = \aleph_0\).

We can then do the same with \(\mathbb{Q}\). Express a rational \(x = \frac{m}{n}\) as a reduced fraction (i.e. \(n > 0\) and m and n have no common divisor). Define \(g(x) = (f(x), n)\) where \(f\) is our injection from above. This is an injection into \(\mathbb{N}^2\) and thus \(|\mathbb{Q}| = \aleph_0\).

The use of pairs for studying cardinality turns out to be quite useful. In fact it’s the case that for any infinite set that \(|A^2| = |A|\). It will be some time before we have the tools to prove this and I don’t know if I’m going to, but it’s worth knowing. We will prove it later for another concrete example though.

In general, pairs like this give us a notion of multiplication for cardinal numbers. We can construct the pair set \(A \times B\) as the set of ordered pairs where the first is an element of \(A\) and the second is an element of \(B\). Then we write \(|A \times B| = |A| |B|\).

If \(|A| \leq |C|\) and \(|B| \leq |D|\) then \(|A| |B| \leq |C| |D|\). We can see this by just pairing together the two injections we have from this inequality: If \(f : A \to C\), \(g : B \to D\) are injections then the function \(h((a, b)) = (f(a), f(b)\) is an injection \(A \times B \to C \times D\). This also shows that our notion of multiplication is well defined in the sense that if different sets have the same cardinality then their pair sets also have the same cardinality.

In particular this means that e.g. \(|\mathbb{Q}^2| = |\mathbb{Z} \times \mathbb{N}| = \aleph_0\). Any pairing of sets of size \(\aleph_0\) will give a set of size \(\aleph_0\).

Now lets look at the real numbers. Why is \(|\mathbb{R}| = 2^{\aleph_0}\)?

Well. First consider the set of all functions \(f : \mathbb{N} \to \{0,1\}\). We’ll call this set \(2^\mathbb{N}\). This has a natural bijection with \(\mathcal{P}(\mathbb{N})\) by identifying a set with its indicator function. i.e. \(f(A)(n) = 1\) if \(n \in A\) else \(f(A)(n) = 0\). We’ll show this set is in bijection with the real numbers.

First, here’s an injection \(f : 2^\mathbb{N} \to \mathbb{R}\): Let \(f(t)\) be the number whose decimal representation is \(1\) if \(t(n) = 1\) or \(0\) otherwise. This is an injection because decimal representations are unique if they contain no infinite sequence of trailing 9s.

In the other direction, choose a binary representation for \(x\) (it doesn’t matter which, but lets arbitrarily pick the version without trailing 1s where there’s ambiguity). Then the function \(g(x)(n) = \) the nth binary digit of \(x\) is an injection \(g : \mathbb{R} \to 2^\mathbb{N}\). Hence \(|\mathbb{R}| = |2^\mathbb{N}| = 2^{\aleph_0}\).

Note the bit at the end there where we wrote \(|2^\mathbb{N}| = 2^{\aleph_0}\)? This is of course where the notation \(2^{\aleph_0}\) actually comes from.

In general we can define a notion of cardinal exponentiation: \(|A|^|B| = |A^B|\), where \(A^B\) is the set of functions \(f : B \to A\). This is consistent with the above notation because \(|\{0,1\}| = 2\).

Note that if \(|A| \leq |C|, |B| \leq |D|\) then \(|A|^{|C|} \leq |C|^{|D|}\). If \(f, g\) are our injections then pick some point \(c \in C\). Define \(h : A^B \to C^D\) as \(h(t)(d) = g(t(f^{-1}(d)))\) if \(f^{-1}\) is defined there, else \(h(d) = c\). This is an injection because if \(t \neq t’\) then we can find \(a\) with \(t(a) \neq t'(a)\). Therefore \(h(t)(f(a) \neq h(t’)(f(a)\) and so \(h(t) \neq h(t’)\).

While we’re here, we should also define cardinal addition. The idea of cardinal addition is that we want a disjoint union option. \(|A| + |B|\) is \(|A \cup B|\) ignoring the fact that \(A\) and \(B\) might overlap. We can do this by defining it as \(|A| + |B| = |\{0\} \times A \cup \{1\} \times B|\), with the pairing just there to force that disjointness. We’ll write that disjoint union operation \(A \sqcup B\).

Turns out a lot of the nice properties you’d hope held do hold. In particular:

  • \(|A| |B| = |B| |A|\)
  • \(|A| + |B| = |B| + |A|\)
  • \(|A| (|B| + |C|) = |A| |B| + |A| |C|\)

I’m going to label these “exercise for the reader”. There are fairly natural bijections in all cases.

One thing which I will do because it’s important for the next bit is show that the following exponentiation law holds: \(|A|^{|B|} \times |A|^{|C|} = |A|^{|B| + |C|}\).

Proof: This is because we can define a function on \(A \sqcup B\) by defining it piecewise on \(A\) and \(B\). So given a pair \(t = (f, g) \in |A|^{|B|} \times |A|^{|C|}\) we can define \(h(t)((0, a)) = f(a)\), \(h(t)((1, b)) = g(b)\). \(h\) is a bijection. QED

This will let us prove that \(|\mathbb{R}^2| = |\mathbb{R}|\).

First note that \(\aleph_0 + \aleph_0 = \aleph_0\) because the definition of \(\mathbb{N} \sqcup \mathbb{N}\) actually defines it as a subset of \(\mathbb{N}^2\).

This then gives us that \(|\mathbb{R}^2| = |\mathbb{R}|^2 = (2^{\aleph_0})^2 = 2^{\aleph_0 + \aleph_0} = 2^{\aleph_0} = \mathbb{R}\).

So there as many pairs of real numbers as real numbers.

It turns out to be more extreme than that. \(|\mathbb{R}^\mathbb{N}| = |\mathbb{R}|\). i.e. there are as many infinite sequences of real numbers as there are real numbers.

Why?

Well, first: we have the exponentiation rule that \((|A|^{|B|})^|C| = |A|^{|B| |C|}\). This is because given a function \(f : B \times C \to A\) you can map it to the function \(f'(b)(c) = f((b, c))\) and this mapping is a bijection.

This means that \(|\mathbb{R}^\mathbb{N}| = 2^{\aleph_0 \times \aleph_0} = 2^{\aleph_0} = |\mathbb{R}|\).

One thing that is strictly larger than \(\mathbb{R}\) is of course \(\mathbb{R}^\mathbb{R}\). This is because \(|\mathbb{R}^\mathbb{R}| \geq |2^\mathbb{R}| > \mathbb{R}\). So there are strictly more functions from \(\mathbb{R} \to \mathbb{R}\) than their are real numbers.

This isn’t true for continuous functions though. Let \(\mathcal{C}\) be the set of continuous functions \(\mathbb{R} \to \mathbb{R}\). Then \(|\mathcal{C}| = |\mathbb{R}|\).

Why? Well because a continuous function is uniquely defined by its value on the rationals (because every real number has rationals arbitrarily close to it), so the function \(h : \mathcal{C} \to \mathbb{R}^\mathbb{Q}\) defined by \(h(f) = f|_\mathbb{Q}\) is an injection. But \(|\mathbb{R}^\mathbb{Q}| = |\mathbb{R}^\mathbb{N}| = |\mathbb{R}|\).

So, there you have it. Some examples of infinity, some things that are the same size as each other, and some that aren’t. Hopefully this makes some of the ideas slightly more familiar.

 

This entry was posted in Numbers are hard on by .

An introduction to infinity

I was talking about infinity with some friends recently. I did a maths degree during which I was quite keen on set theory, so there’s a lot of stuff I take for granted as making sense that is actually obscure and hard to understand if you’ve not got that background. I thought I’d do my best to make some of the more important basic results clear. Basically the goal is to get you from a fairly elementary understanding of sets and functions to the point where you understand what \(\aleph_1 \leq 2^\aleph_0\) means.

I’m not going to assume a lot of background, but if you’ve not done at least some maths or computer science this is going to feel an awful lot like being dropped in the deep end. Although I’m going to sketch out the meaning of some of the basic terms, a basic familiarity and sets and functions is going to be pretty helpful here.

Sets and functions

This section is going to be extremely terse. I fully intend to explain the later concepts more thoroughly, don’t worry. This should be half reminder and half getting on board with the same notation.

A set is a collection of objects. These objects may themselves be other sets. We use curly brackets to denote a set containing a specific list of elements. e.g. write \({1, 2, 3}\) for the set containing the numbers 1, 2, 3. A set ignores duplicates. \({1, 1}\) is the same as \({1}\).

Two sets are regarded as equal if they have precisely the same elements. We write \(A \subseteq B\) if every element of \(A\) is an element of \(B\). Thus if \(A \subseteq B\) and \(B \subseteq A\) then \(A = B\).

We write \(a \in A\) to mean that \(a\) is a member of \(A\). We write \(b \not\in A\) to indicate that it is not. i.e. \(1 \in \{1, 2\}\) but \(3 \not\in \{1, 2\}\).

Given a set \(A\) we can define a restriction of it as \(\{ x \in A : p(x) \}\) as the set of all elements of \(A\) satisfying some property \(p\). e.g. we could define \(\{x \in \{1, 2, 3, 4, 5\} : x < 3\}\) and we would get \(\{1, 2\}\).

The set \(\emptyset\) is the set with no elements. For every set \(A\), \(\emptyset \subseteq A\), because the requirement that every element of \(\emptyset\) is in \(A\) is satisfied vacuously (i.e. there are no elements to falsify it).

We write \(\mathbb{N}\) for the set of all natural numbers (that is \(\{0, 1, 2, \ldots\}\)).

We write \(\mathcal{P}(A)\) to mean the set of all subsets of \(A\). e.g. \(\mathcal{P}\{1, 2\} = \{\emptyset, \{1\}, \{2\}, \{1, 2\}\}\). Note that \(\mathcal{P}(\emptyset) = \{\emptyset\}\).

Given two sets \(A \cup B\) is the set which contains precisely the elements that are in at least one of \(A\) or \(B\), \(A \cap B\) is the set which contains elements which are in both and \(A \setminus B\) is the set which contains all the elements of \(A\) that are not in \(B\).

Given a set containing only other sets, \(\bigcup A\) is the set of elements contained in at least one element of \(A\) and \(\bigcap A\) that containing only elements which are in all of them. Note that \(\bigcup \{A, B\} = A \cup B\) and \(\bigcap \{A, B\} = A \cap B\).

A rule \(f\) is said to be a function \(A \to B\) (written sometimes \(f : A \to B\)) if for every element \(a \in A\) it assigns a single element \(b \in B\). We write this as \(f(a) = b\). We call \(A\) the domain of \(f\) and \(B\) the range of \(f\).

If \(f : A \to B\) and \(U \subseteq A\) we can define the restriction of \(f\) to \(U\) as doing exactly the same thing as \(f\) but we consider its domain as \(U\). We write this \(f_U\). So \(f_U(a) = f(a)\), but \(f_U\) is only defined for elements of \(U\).

If we have \(f : A \to B\) and \(g : B \to C\) then we can define \(g \cdot f : A \to C\) as the rule which first applies \(f\) and then \(g\). i.e. \(g \cdot f)(x) = g(f(x))\).

\(f\) is said to be injective/an injection (sometimes “one-to-one”) if \(a \neq b\) implies that \(f(a) \neq f(b)\).

It is said to be surjective/a surjection (sometimes “onto”) if for every \(b\) there is some \(a\) with \(f(a) = b\).

Note that if \(f\) and \(g\) are injective then so is \(g \cdot f\). Similarly surjective.

\(f\) is said to be bijective/a bijection if it is both injective and surjective. i.e. it maps every element of \(A\) to a distinct element of \(B\) and every element of \(B\) is mapped to.

If \(f\) is a bijection then we write \(f^{-1} : B \to A\) to be the function such that \(f^{-1}(b)\) is the unique \(a\) such that \(f(a) = b\). This is a function because surjectivity means there is always at least one such \(a\) and injectivity means there can only be one such. Note that \(f^{-1}\) is also a bijection. Also note that \((g \cdot f)^{-1} = f^{-1} \cdot g^{-1}\).

If any of this was very surprising or unfamiliar to you you are welcome to continue reading but it may be a bit of a struggle. I’ll try and find a decent set of slightly less terse notes to link to.

Cardinalities

We say two sets \(A\) and \(B\) are the same size if there is a bijective function \(A \to B\). i.e. we can define a rule that maps every element of one to a unique element of the other and vice versa. We write this as \(|A| = |B|\). For now just treat the bars as notation meaning “the size of”. We’ll later define what \(|A|\) actually means, but don’t worry about it for now.

Lets see that this make sense. Does this behave like we’d expect it to?

The identity function \(\mathrm{id}(a) = a\) is a bijection \(\mathrm{id}  : A \to A\). So \(|A| = |A|\). That’s good. We’d not get very far if that was not the case. This property is called reflexivity.

If \(|A| = |B|\) then there is a bijection \(f : A \to B\). Then \(f^{-1} : B \to A\) is a bijection, so \(|B| = |A|\). This property is called symmetry.

If \(|A| = |B|\) and \(|B| = |C|\) then we can find bijections \(f : A \to B\) and \(g : B \to C\), so \(g \cdot f : A \to C\) is a bijection. Thus \(|A| = |C|\). This property is called transitivity.

So this relationship behaves more or less like we would expect “the same size” to.

As part of both making sense of this and a teaser of what \(|A|\) will later come to mean, lets see how this works for finite sets.

Say that \(|A| = n\) for some natural number \(n\) if \(|A| = |\{x \in \mathbb{N} : x < n\}|\). so e.g. \(|\emptyset| = 0\), \(|\{1, 4, 7\}| = 3\) (because you can use the function \(f(0) = 1, f(1) = 4, f(2) = 7\). These sets basically give us “canonical” sets of a specific size. If a set \(A\) has \(|A| = n\) for some \(n \in \mathbb{N}\) then we call it finite, otherwise it’s infinite.

Of course, we first have to prove that they’re not the same size. Well, we don’t have to, but the notation would pretty weird if you could have e.g. \(|A| = 1\) and \(|A| = 3\).

Because we have transitivity, all we need to do is show that none of these sets are the same size – then if any other set were the same size as two of them we’d have a contradiction.

For convenience, lets label \(S_n = \{x \in \mathbb{N} : x < n\}\). We’ll first need a lemma:

Lemma: If \(|A| = n\) and \(b \not\in A\) then \(|A \cup \{b\}| = n + 1\)

Proof: Let \(f : S_n \to A\) be a bijection. Define \(g : S_{n+1} \to A \cup \{b\}\) by \(g(m) = f(m)\) if \(m < n\) and \(g(n) = b\).

This is an injection because \(f\) is and because \(b \not\in A\) (consider \(x, y\) with \(x \neq y\). At most one of \(x\) and \(y\) can be \(b\). If neither is, then \(g(x) = f(x) \neq f(y) = g(y)\). If one is then say it’s \(y\). Then \(g(x) \in A\) but \(g(y) \not\in A\) so \(g(x) \neq g(y)\)).

This is a surjection because \(f\) hits every element of \(A\) and we’ve additionally defined \(g\) to hit \(b\). QED

A corollary of this is that if \(b \in B\) and \(|B| = n\) then \(|B \setminus \{b\}| = n – 1\). You can prove this by setting \(A = B \setminus \{b\}\) in the lemma.

Theorem: If \(|S_m| = |S_n|\) then \(m = n\).

Proof:

We can use reflexivity to swap the order if necessary to assume that \(m \leq n\). We’re going to prove by induction on \(n\) that if \(m \leq n\)  and \(|S_m| = |S_n|\) then \(m = n\).

It’s vacuously true for \(n = 0\) simply because there are no \(m < n\).

It’s true for \(n = 1\) because the only possible value of \(m\) other than \(1\) is \(0\). You cannot define a bijection from a non-empty set to the empty one because there are no elements to map to.

Now suppose we’ve proven it for \(n\) and want to prove it for \(n+1\).

Suppose \(|S_{n+1}| = m\) . Then \(|S_n| = |S_{n+1} \setminus \{n\}| = m – 1\). But by our inductive hypothesis this implies that \(m – 1 = n\) and thus \(m = n + 1\). QED

So this confirms that for the finite case at least, size really does behave like we want it to – you can count sets with natural and two sets are the same size precisely when counting them gives the same answer.

But we don’t want to just talk about whether two sets are the same size as each other. We want to talk about whether one is smaller or larger.

We do this by building on the definition of the same size as. We say \(A\) is no larger than \(B\) if \(|A| = |C|\) for some \(C \subseteq B\) (equivalently, if there is an injective function from \(A\) to \(B\)). We write this \(|A| \leq |B|\). We say \(A\) is smaller than \(B\) , or \(|A| < |B|\), if it’s no larger than \(B\) but not the same size as it.

We’re going to want to show that this behaves sensibly. First note that you can show that if \(|A| = |B|\), \(|B| \leq |C|\) and \(|C| = |D|\) then \(|A| \leq |D|\) just by composing functions (do check the details, but I’m going to start eliding them now for things that are similar to things I’ve already proven). Other similar relationships apply.

What we’re going to have to work quite a bit harder to prove is something that may seem like it should be obvious: If \(|A| \leq |B|\) and \(|B| \leq |A|\) then \(|A| = |B|\).

This is easy for finite sets: If \(|A| = m\) and \(|B| = n\) then \(|A| \leq |B|\) if and only if \(m \leq n\), because \(|S_m| \leq |S_n|\) if and only if \(m \leq n\). So if \(|A| \leq |B|\) and \(|B| \leq |A|\) then \(m \leq n\) and \(n \leq m\), so \(m = n\) and thus \(|A| = |B|\).

For infinite sets, it’s rather more complicated. Why?

Well the first hint that it’s complicated is that you can remove elements from an infinite set without changing its size. In our proof that sizes of finite sets corresponded to natural numbers we relied on the fact that if you removed an element from a set of size \(n\) you got a set of size \(n – 1\). This no longer works.

As an example, consider the set \(\mathbb{N}^+ = \mathbb{N} \setminus \{0\}\). The function \(f(n) = n + 1\) is a bijection \(f : \mathbb{N} \to \mathbb{N}^+\). In fact, any infinite subset of the natural numbers has the same size as the natural numbers (sketch proof: Let \(f(n)\) be the nth smallest element of the set).

So things are not quite so well behaved. But they are well enough behaved:

Theorem: If \(|A| \leq |B|\) and \(|B| \leq |A|\) then \(|A| = |B|\).

Warning: This proof is quite long and involved compared to what has come so far. I’ve done my best to break it down and make it as comprehensible as possible, but you might want to take a break and go get a cup of tea or something if your brain is feeling full.

Proof: Rephrasing the statement, we have injective functions \(f : A \to B\) and \(g : B \to A\) and we want to construct a bijection \(h : A \to B\)

The idea behind our construction is “simple”: We will piece together a mix of \(f\) and \(g^{-1}\) (the latter being only defined on the subset of \(A\) which \(g\) hits).

Pick \(H \subseteq g(B)\) (the notation \(g(B)\) means the set of points in \(a\) which are \(g(b)\) for some \(b \in B\)).

Define \(h(a) = g^{-1}(a)\) if \(a \in H\) else \(h(a) = f(a)\).

We want to find a condition on \(H\) such that \(h\) is a bijection.

First, note that if \(h\) is injective then \(h(U \setminus V) = h(U) \setminus h(V)\) (it’s always going to be a superset of \(h(U) \setminus h(V)\) and if there were some element of \(h(V)\) in it then this would contradict injectivity because we’d be able to find \(x \in V\) and \(x \not\in V\) with \(h(x) = h(v)\).

So if \(h\) is a bijection then we have \(h(A \setminus H) = h(A) \setminus h(H) = B \setminus h(H)\). Rearranging, \(H = h^{-1}(B \setminus h(A \setminus H)\). But \(h(A \setminus H) = f(A \setminus H\) by the definition of \(h\). And \(h^{-1}(b) = g(b)\) for \(x \not \in f(A \setminus H)\), so we have \(H = g(B \setminus f(A \setminus H))\).

This turns out to be a sufficient condition to make \(h\) a bijection. Why?

What we’re essentially saying here is that we’re carefully splitting \(A\) and \(B\) into two parts in such a way that we can use \(f\) and \(g\) to form a bijection between one of each of these parts on either side. We then combine these together to get a bijection between the whole thing.

Lets see that this works. Suppose \(H\) satisfies this condition:

\(h\) is injective: If \(x, y \in A\). If \(x, y \in H\) or \(x, y \in A \setminus H\) then we have \(h(x) \neq h(y)\) because each of \(f\) and \(g^{-1}\) are injective. So assume \(x \in H\) and \(y \in A \setminus H\) and \(h(x) = h(y)\). Then \(h(x) = f(x)\) and \(h(y) = g^{-1}(y)\). So \(f(x) = g^{-1}(y)\) and thus \(y = g(f(x)\). But \(g\) by the assumption there is some \(b \not\in f(A \setminus H)\)  (and in particular \(b \neq f(x)\) with \(g(b) = y\). This would violate the injectivity of \(b\).

\(h\) is surjective: If \(b \not\in f(A \setminus H )\) then \(g(b) \in H\) by assumption. Therefore \(h(g(b)) = b\). Hence surjectivity.

So we’ve found a condition that if we manage to find exactly the right \(H\) will give us a bijection between \(A\) and \(B\). How are we going to find such an \(H\)?

We’ll do this by considering the function \(S : \mathcal{P}(A) \to \mathcal{P}(A)\) defined by \(S(T) = g(B \setminus f(A \setminus T))\). We then want to find a fixed point of \(S\), which will give us the desired \(H\).

First: Note that if \(R \subseteq T\) then \(S(R) \subseteq S(T)\). This is because \(A \setminus R \supseteq A \setminus T\), so \(f(A \setminus R) \supseteq f(A \setminus T)\) , so \(B \setminus f(A \setminus R) \subseteq B \setminus f(A \setminus T)\), so \(g(B \setminus f(A \setminus R)) \subseteq g(B \setminus f(S \setminus T))\). We call this being order-preserving.

Our goal will basically be to do this by finding the “largest possible” fixed point.

So first we want to consider a set of candidates for fixed points which we know to be non-empty: Let \(\mathcal{F} = \{ T \subseteq A : T \subseteq S(T) \}\). This set is non-empty because \(\emptyset \in \mathcal{F}\). Further it must contain every fixed point (if there are any) because if \(T = S(T)\) then certainly \(T \subseteq S(T)\). We’re going to show that there’s a largest element of this set and that it must be a fixed point.

Note that if \(T \in \mathcal{F}\) then \(S(T) \in \mathcal{F}\). This is because of the above property: If \(T \subseteq S(T)\) then \(S(T) \subseteq S(S(T))\) by the order preserving property. Therefore \(S(T) \in \mathcal{F}\).

Now let \(H = \bigcup \mathcal{F}\).

Claim: this \(H\) is a fixed point of \(S\).

Proof of claim: We first must show that \(H \in \mathcal{F}\).

Suppose \(T \in \mathcal{F}\). Then \(T \subseteq H\). Therefore \(S(T) \subseteq S(H)\). But \(T \subseteq S(T)\), so \(T \subseteq S(H)\). So \(H\) is a union of sets each of which is a subset of \(S(H)\), so we must have \(H \subseteq S(H)\). So \(H \in \mathcal{F}\).

But we have shown that if \(H \in \mathcal{F}\) then \(S(H) \in \mathcal{F}\). Therefore \(S(H) \subseteq H\) and thus \(H = S(H)\).

One way to think of this proof is that we’re starting with the empty-set and repeatedly iterating \(S\) on it – because at every point our current set has the property that \(T \subseteq S(T)\) this will always result in growing the set. Eventually this set must stop growing and that gives us a fixed point.

You actually can prove it that way, but it turns out you need a lot of machinery to make sense of “eventually”, so this proof with unions is nicer.

QED

This is called the Cantor–Bernstein–Schroeder theorem. When I did my first course in set theory the proof came with a big disclaimer saying “Don’t worry, you’re not expected to understand this”. In fairness, the provided proof was I think much more complicated than this one (this one steals a trick from a more advanced version of the proof and basically inlines the theorem it uses into the proof. I’ve previously written about the real version of this). Still, I’d recommend that if you didn’t understand that you might want to reread it a few times to see if you can. It’s worth doing.

Or, you know, not. Your call really.

So we’ve now shown that “is at least as large as” behaves sensibly in the sense that two sets can’t be at least as large as each other but not be the same size.

There are two important questions we haven’t answered:

  1. Given any two sets, is one at least as large as the other? i.e. can we have sets which we just can’t compare the sizes of?
  2. Are all infinite sets the same size?

The answer to the former I will defer to a later post (that given my track record of saying that may or may not get written).

The latter though, we can answer right now!

Theorem: \(|A| < |\mathcal{P}(A)|\). i.e. the power set of \(A\) is strictly bigger than \(A\).

Proof:

First lets get out of the way that \(|A| \leq \mathcal{P}(A)\). The function \(f(a) = \{a\}\) is an injection into the power set.

So now we just need to show that they’re not equal in size. In order to do that, we’ll employ something that looks a lot like one of the classic paradoxes of set theory.

Suppose \(f : A \to \mathcal{P}(A)\) is injective. We’ll try to construct a point in \(\mathcal{P}(A)\) that it can’t hit.

You know Russel’s paradox? The set of all sets that don’t contain themselves? Lets build one of those.

Let \(R = \{x \in A : x \not\in f(x) \}\)

OK, it’s close anyway.

Now suppose \(f(y) = R\).

Is \(y \in R\)? If yes, it contradicts the definition of \(R\) because \(y \in f(y)\). If no, it should be in \(R\) because \(R\) is the set of elements not contained in their image under \(f\).

Therefore there cannot exist such a \(y\) and \(f\) is not surjective. Hence there is no bijection \(f : A \to \mathcal{P}(A)\), and thus \(|A| < |\mathcal{P}(A)|\). QED

Note that we can keep iterating this: It’s not just the case that there are different sizes of infinite set. It’s the case that there are arbitrarily large sizes of infinite set.

I promised that at the end of this you’d understand what it means that \(\aleph_1 \leq 2^{\aleph_0}\). This was perhaps ambitious of me, but I can at least provide you with a sketch of what it means:

First, note that if \(|A| = |B|\) then \(|\mathcal{P}(A)| = |\mathcal{P}(B)|\) because we can turn a bijection between \(A\) and \(B\) into a bijection between \(\mathcal{P}(A)\) and \(\mathcal{P}(B)\) by applying the bijection to the individual elements of the set.

This sorta justifies the notation \(|\mathcal{P}(A)| = 2^{|A|}\). The exponentiation probably doesn’t make any sense right now. There is a general notion of exponentiation of cardinals, but it’s not very important right now so lets not worry about it and just say that it’s a convenient notation because it holds true for finite sets (there are \(2^n\) elements in the power set of a set of \(n\) elements).

The last piece you then need to make sense of the statement is that when a set is the same size as \(\mathbb{N}\) we write \(|A| = \aleph_0\).

This means that as a special case of our last theorem we can write \(\aleph_0 < 2^{\aleph_0}\).

But what’s \(\aleph_1\)?

Well. The idea is that \(\aleph_0\) is the smallest size of infinity and that \(\aleph_1\) is the second smallest size of infinity. Note that we don’t currently know that this statement makes sense (it turns out it does, but I haven’t proven nearly enough in this post to show that).

The continuum hypothesis is that there is no set \(A\) with \(\aleph_0 < |A| < 2^{\aleph_0}\).

Rephrasing this, that means that the continuum hypothesis is \(\aleph_1 = 2^{\aleph_0}\). i.e. that \(2^{\aleph_0}\) is the smallest size of infinity larger than \(\aleph_0\).

The answer to the continuum hypothesis turns out to be “It’s complicated”. I confess I never got to the point where I really understood the proofs that it was complicated – I can sketch them out in layman speak, but not enough to really recreate the details.

But hopefully I can get you to the point where all the details I’ve elided here at least make sense.

In a later post that is. In the meantime, feedback as to whether this one made any sense to anyone who didn’t already understand the material would be useful to calibrate what that later post contains.

This entry was posted in Numbers are hard on by .

Bayesian reasoners shouldn’t believe logical arguments

Advance warning: If you’re familiar with Bayesian reasoning it is unlikely this post contains anything new unless I’m making novel mistakes, which is totally possible.

Second advance warning: If taken too seriously, this article may be hazardous to your health.

Let me present you with a hypothetical, abstracted, argument:

Me: C
You: Not C!
Me: B?
You: *shrug*
Me: A?
You: Yes
Me: A implies B?
You: Yes?
Me: B implies C?
You: … Yes
Me: Therefore C?
You: C. :-(

Does this seem like a likely scenario to you? We have had a disagreement. I have presented a logical argument from shared premises for my side of the disagreement. You have accepted that argument and changed your position.

Yeah, it sounds pretty implausible to me too. A more likely response from you at the end is:

You: Not C!

I will of course find this highly irrational and be irritated by your response.

…unless you’re a Bayesian reasoner, in which case you are behaving entirely correctly, and I’ll give you a free pass.

Wait, what?

Lets start with a simplified example with only two propositions.

Suppose you have propositions \(A\) and \(B\), which you believe with probabilities \(a\) and \(b\) respectively. You currently believe these two to be independent, so \(P(A \wedge B) = ab\)

Now, suppose I come along and convince you that \(A \implies B\) is true (I’ll call this proposition \(I\)). What is your new probability for k\(B\)?

Well, by Bayes rule, \(P(B|I) = \frac{P(B \wedge I)}{P(I)} = P(B) \frac{P(I|B)}{P(I)}\)

\(I = A \implies B = \neg\left( A \wedge \neg B\right)\). So \(P(I) = 1 – a(1 – b)\).

\(P(I|B) = 1\) because everything implies a true proposition. Therefore \(P(B|I) = \frac{b}{(1 – a(1 – b))}\).

This is a slightly gross formula. Note however it does have the obviously desirable property that your believe in B goes up, or at least stays the same. Lets quickly check it with some numbers.

\(a\) \(b\) \(P(B | I)\)
0.100 0.100 0.110
0.100 0.500 0.526
0.100 0.900 0.909
0.500 0.100 0.182
0.500 0.500 0.667
0.500 0.900 0.947
0.900 0.100 0.526
0.900 0.500 0.909
0.900 0.900 0.989

These look pretty plausible. Our beliefs do not seem to change to an unrealistic degree, but we have provided significant evidence in favour of \(B\).

But as a good Bayesian reasoner, you shouldn’t assign probabilities 0 or 1 to things. Certainty is poisonous to good probability updates. So when I came along and convinced you that \(A \implies B\), you really shouldn’t have believed me completely. Instead you should have assigned some probability \(r\) to it. So what happens now?

Well we know what the probability of \(B\) given \(I\) is, but what is the probability given \(\neg I\)? Well \(\neg I = \neg (A \implies B) = A \wedge \neg B\), so \(P(B|\neg I) = 0\). The implication can only be false if \(B\) is (because everything implies a true statement).

This means that your posterior probability for \(B\) should be \(r P(B|I)\). So \(r\) is essentially a factor slowing your update process.

Note that because my posterior belief in B is \(b \frac{r}{P(I)}\), as long as my claim that \(A \implies B\) is at least as convincing as my prior belief in it, my argument will increase your belief in it.

Now. Lets suppose that you are in fact entirely convinced before hand that \(A\) and that \(\neg B\), and my argument entirely convinces you that \(A \implies B\).

Of course, we don’t believe in certainty. Things you are entirely convinced of may prove to be false. Suppose now that in the past you have noticed that when you’re entirely convinced of something, you’re right with about probability \(p\). Lets be over-optimistic and say that \(p\) is somewhere in the 0.9 range.

What should your posterior probability for \(B\) now be? We have \(b = 1 – p\) and \(a = r = p\). Then your posterior probability for \(B\) is \(r P(B | I) = p \frac{1 – p}{(1 – p(1 – (1 – p)))} = p \frac{1 – p}{1 – p^2} = \frac{p}{p+1} = 1 – \frac{1}{p+1}\).

You know what the interesting thing about this is? The interesting thing is that it’s always less than half. A perfectly convincing argument that a thing I completely believe in implies a thing I completely disbelieve in should never do more than create a state of complete uncertainty in your mind.

It turns out that reasonable degrees of certainty get pretty close to that too. If you’re right about things you’re certain about with probability 0.9 then your posterior probability for \(B\) should be 0.47. If you’re only right with probability 0.7 then it should be \(0.41\). Of course, if you’re only right about that often then \(0.41\) isn’t all that far from your threshold for certainty in the negative result.

In conclusion: If you believe A and not B, and I convince you that A implies B, you should not now go away and believe B. Instead you should be confused, with a bias towards still assuming not B, until you’ve resolved this.

Now, lets go one step further to our original example. We are instead arguing about \(C\), and my argument proceeds via an intermediary \(B\). Your prior is that \(A\), \(B\) and \(C\) are all independent. You are certain that \(A\), certain that \(\neg C\) and have no opinion on \(B\) (i.e. you believe it with probability \(\frac{1}{2}\).

I now provide you with a p-convincing argument that \(A \implies B\). What is your posterior probability for \(B\)?

Well, plugging it into our previous we get \(b’ = p \frac{b}{1 – p(1 – b)} = \frac{p}{2 – p}\). Again, checking against some numbers, if \(p = 0.9\) then \(b’ \approx 0.82\), which seems reasonable.

Suppose now that I provide you p-convincing evidence that \(B \implies C\). What’s your posterior for \(C\)?

Well, again with the previous formula only replacing \(a\) with \(b’\) and \(b\) with \(c\) we have

\[\begin{align*}
c’ &= \frac{p c}{1 – b'(1 – c)} \\
&= \frac{p(1-p)}{1 – \frac{p^2}{2 – p}} \\
&= \frac{p(1-p)(2 – p)}{2 – p – p^2}\\
\end{align*}\]

this isn’t a nice formula, but we can plug numbers in. Suppose your certainties are 0.9. Then your posterior is \(c’ \approx 0.34\). You’re no longer certain that \(C\) is false, but you’re still pretty convinced despite the fact that I’ve just presented you with an apparently water-tight argument to the contrary. This result is pretty robust with respect too your degree of certainty, too. As \(p \to 1\), this seems to tend to \(\frac{1}{3}\), and for \(p = \frac{1}{2}\) (i.e. you’re wrong half the time when you’re certain!) we get \(c = 0.3\).

In conclusion: An apparently water tight logical argument that goes from a single premise you believe in to a premise you disbelieve in via something you have no opinion on should not substantially update your beliefs, even if it casts some doubt on them.

Of course, if you’re a Bayesian reasoner, this post is an argument that starts from a premise you believe in, goes via something you have no opinion on, and concludes something you likely don’t believe in. Therefore it shouldn’t change your beliefs very much.

This entry was posted in Decision Theory, Numbers are hard on by .

Collaborative coin flipping

This post is based on an answer from Paul Crowley to a question I asked on Twitter. I’m writing it up mostly for my notes and because I had a follow up question about it I wanted to answer.

You and I want to decide something by coin toss. But there’s a problem. You see, I don’t trust your coin and you don’t trust my coin.

How do we resolve this issue?

Simple: We each flip our coins. If the results come up the same, call that heads. If the results come up different, call that tails.

Theorem: If at least one of us is using a fair coin, the result is fair.

Proof:

Without loss of generality, suppose I have a fair coin.

P(Heads) = P(Heads|You flip tails) * P(you flip tails) + P(Heads|You flip heads) * P(you flip heads)

But each of these conditional probabilities is \(\frac{1}{2}\) because they’re just the probability that I flip tails and heads respectively. So

P(Heads) = \(\frac{1}{2}\)(P(you flip tails) + P(you flip heads)) = \(\frac{1}{2}\) as desired.

QED

Simple enough.

The follow up question I wanted to ask though: What happens if we both cheat?

Suppose I’m flipping heads with probability \(p\) and you’re flipping heads with probability \(q\). Then \(P(Heads) = pq + (1 – p)(1 – q)\).

Supposing I want to maximize the probability of getting a heads. Let \(f(p) = pq + (1 – p)(1 – q) = 1 – q + (2q – 1)p \). If I don’t know the value of \(q\) (or at least which side of \(\frac{1}{2}\) it’s on) there’s not much I can do here: Any choice of \(p\) I make might hurt my odds rather than help them. If I do know whether \(q > \frac{1}{2}\) or not then I can choose \(p \in \{0,1\}\) to get \(P(\mathrm{Heads}\) = \mathrm{max}(q, 1-q)\). It’s unclear to me exactly what the game theory here is, but it does suggest that if you think that I am able to predict your value of \(q\) then your best choice is to use a fair coin.

This entry was posted in Numbers are hard on by .

When does one normal distribution dominate another?

Continuing the study of the dominance relationship I defined previously (and only I care about) I thought to ask the question “When does one normal random variable dominate another?”. The answer is very easy to work out, but I found it surprising until I actually did the maths.

Theorem: Let \(A \sim \mathrm{Norm}(\mu_1, \sigma_1^2)\), \(B \sim \mathrm{Norm}(\mu_2, \sigma_2^2)\). Then \(A \preceq B\) iff \(\mu_1 \leq \mu_2\) and \(\sigma_1 = \sigma_2\).

Note the equality in the second part: Given two normal distributions with different variance, neither will dominate th e other.

Proof:

Let \(G(t) = P(Z \geq t)\) where \(Z \sim \mathrm{Norm}(0,1)\). Then \(P(A \geq t) = G(\frac{t-\mu_1}{\sigma_1})\), \(P(B \geq t) = G(\frac{t-\mu_2}{\sigma_2})\). \(G\) is strictly decreasing, so \(P(A \geq t) \leq P(B \geq t)\) iff \(\frac{t-\mu_1}{\sigma_1} \geq \frac{t-\mu_2}{\sigma_2}\) iff \(\left(\frac{1}{\sigma_1} – \frac{1}{\sigma_2}\right) t \geq \frac{\mu_1}{\sigma_1} – \frac{\mu_2}{\sigma_2}\).

Because the left hand side is linear in \(t\), this can only be satisfied for all \(t\) if the coefficient is 0. i.e. if \(\sigma_1 = \sigma_2\). In this case it is satisfied iff \(0 \geq \frac{\mu_1}{\sigma_1} – \frac{\mu_2}{\sigma_2} = \frac{\mu_1 – \mu_2}{\sigma_1}\) i.e. iff \(\mu_1 \leq \mu_2\). Running this backwards, if \(\mu_1 \leq \mu_2\) and \(\sigma_1 = \sigma_2\) then this inequality is always satisfies and thus \(A \preceq B\).

QED

Like I said, very straightforward algebra, but a little surprising as a result. I wasn’t thinking about the lower tail, so I expected there to be cases where a lower mean lower standard deviation was dominated, but it turns out not.

This entry was posted in Numbers are hard on by .