Category Archives: Numbers are hard

Test-Case Selection and Choice Theory

Attention conservation notice: Honestly you probably don’t care about this post unless the title sounds really intriguing to you. I’m not sure this matters at all.

At its most abstract, the test case selection problem is a function \(T : \mathcal{P}(A) \setminus \{\emptyset\} \to A\) such that \(T(U) \in U\). i.e. we have some set \(A\) of possible test cases, a bug that occurs in some non-empty subset of the test cases, and we want to select a test case to represent the bug. Call such a \(T\) a choice function.

(The axiom of choice states that choice functions always exist, but our sets of test cases are actually always finite, so we don’t need that here).

This elides some complexity in that:

  1. Our test case selection may be non-deterministic. In this case we can (by fixing the set of random choices made) treat this as a choice-function valued random variable, so it still reduces to this.
  2. The test-case selection might fail. We can ignore this problem by assuming that it will succeed eventually (assuming it doesn’t do something silly) and just running it until it does. For some methods this may take rather a long time and we might have to worry about implementation details like the heat death of the universe, but ¯\_(ツ)_/¯

This is actually a coincidence of naming for the most part, but choice functions are interestingly related to choice theory: The problem of expressing a preference between alternatives, and describing axioms of “rationality” for doing so.

One interesting axiom of rationality is that of contraction consistency. Contraction consistency is the requirement that if \(T(A) \in B \subseteq A\) then \(T(B) = T(A)\). i.e. if you picked \(T(A)\) as the best element of \(A\), removing elements from \(A\) that aren’t \(T(A)\) shouldn’t change your opinion!

On the face of it this seems reasonable, but it actually imposes a very strong restriction on what \(T\) can look like.

Theorem: If \(T\) is contraction consistent then \(T(A) = \min\limits_\prec A\) for some total order \(\prec\).


Define \(a \prec b\) if \(a = T(\{a, b\})\). This is antisymmetric because the set \(\{a, b\}\) doesn’t depend on the order of \(a\) and \(b\), and reflexive because \((T(\{a\}) = \in \{a\}\), and total because \(T(\{a, b\} \in \{a, b\}\) so either \(a \prec b\) or \(b \prec a\).

So to show that it’s a total order we now need to show that it’s transitive. Suppose \(a \prec b\) and \(b \prec c\). Then \(T(\{a, b, c\}) = a\): If it were \(c\) then this would violate contraction consistency when considering \(\{b, c\}\), and if it were \(b\) then it would violate it when considering \(\{a, b\}\). Now, by constraction consistency, \(T(\{a, c\}) = T(\{a, b, c\}) = a\), so \(a \prec c\).

Now suppose \(b \in A\). By constraction consistency, \(T(\{T(A), b\}) = T(A)\). Therefore \(T(A) \prec b\), and so \(T(A) = \min\limits_\prec A\) as desired.


What this means in practice is that the only contraction consistent test-case selection methods must be equivalent to bounded exhaustive enumeration: You have some total ordering over your test cases, and in order to return a test case \(a \in A\) you must have verified that \(b \not\in A\) for every \(b \prec a\). This is potentially very expensive if all you have is membership queries for \(A\)! If you have more structure on \(A\) (e.g. because of symbolic execution) then you could potentially rule out this happening without actually performing those membership queries.

If you adopt the classic approach of generating a random test case then running test-case reduction on it, you will typically require substantially fewer membership queries for the set, but the \(\prec\) relationship in the above proof may not even be transitive!

Say, for example, that we have \(A = \mathbb{N}\), and our test case reduction algorithm consists of iterating the operations \(n \to n / 2\) and \(n \to n – 1\) to a fixed point as long as they remain in the set. So for example if we have \(A\) as the set of even numbers and start from \(8\) then we will go \(8 \to 4 \to 2\), having tried \(7, 3, 1\) and concluded they weren’t in the set. If we’d started from \(10\) though we’ve have become stuck because both \(10 – 1\) and \(10 / 2\) are odd. This is then an example of intransitivity, because we have \(5 \prec 10\), and \(4 \prec 5\), but \(4 \not\prec 10\), and this intransitivity is in large part responsible for our failure to find a global minimum.

(Note that it would be perfectly possible to have \(\prec\) be transitive and have \(T\) not be contraction-consistent – just make \(T\) do something different whenever \(|A| > 3\). Contraction consistency implies the transitivity of \(\prec\), but we still needed the full consistency to show that \(T(A)\) was the \(\prec\)-minimum).

This is roughly equivalent to the observation that classic choice theory only really works for logically omniscient agents: Test-case reduction is actually a boundedly rational agent that is unable (or unwilling) to run exponential time algorithms, while a logically omniscient agent is perfectly happy to do that and considers it a free action.

There is also an interesting topological perspective that allows us to firm up what we mean by \(T\) being “black box”, but I think I’ll leave that for another time.

This entry was posted in Numbers are hard, programming on by .

Times to exhaust finite distributions when sampling with replacement

Due to reasons a maths problem has been utterly hijacking my brain since Friday. I think I now have it solved to my satisfaction. I’m morally certain this is nothing new and has been solved a thousand times before, but I couldn’t  figure out the right keywords to Google and each time I posed it clearly enough to ask for help I realised a new avenue to explore and so ended up solving it before I had to.

The problem boils down to this:

Suppose I have some random variable \(X\), taking values \(1, \ldots, n\) with probabilities \(P(X = i) = p_i > 0\).

Update: steve mcc points out that the uniform version of this is normally called the Coupon Collector’s Problem. I have definitely heard that before but could not for the life of me remember it. Given that term it’s easy to find a bunch of prior art on this. I’ve yet to do a review. I still found the specific details interesting and it was fun to work on, but it’s up to you if you want to read this given that…

Now suppose I have infinitely many independent \(X_i\) each with the same distribution as \(X\).

How long do we expect it to take until we’ve seen every value \(1, \ldots, n\)? i.e. If \(T\) is a random variable whose value is the first \(i\) such that \(X_j = k\) for some \(j \leq i\) and each \(1 \leq k \leq n\), what is \(E(T)\)?

I don’t have an exact calculation for \(E(T)\) and my experiments on special cases suggest that there is no nice formulae. I nevertheless consider the problem solved to my satisfaction by the following three results:

  1. \(E(T) \geq n H(n)\), where \(H(n) = 1 + \frac{1}{2} + \ldots + \frac{1}{n}\) is the nth harmonic number, and this bound is achieved when \(X\) is the uniform distribution.
  2. \(E(T) \leq \sum \frac{1}{p_i}\), and this bound is achievable in the limit (and possibly exactly but I don’t think so) by some distributions I will describe below.
  3. \(E(T) \geq \frac{n^2 – 1}{4 E(X)}\). This bound is not especially tight but is good enough for asymptotics.

In particular, if \(X\) comes from a family of distributions in which \(n\) is allowed to vary and \(E(X)\) remains bounded, \(E(T)\) is asymptotically at least quadratic.


Suppose \(f(n)\) is a lower bound on \(E(T)\). Suppose we draw \(i\) for our first draw. Then we may reduce exhausting \(\{1, \ldots, n\}\) to exhausting the other values, which takes at most \(f(n – 1)\) draws. But each draw takes in expectation \(\frac{1}{1 – p_i}\) draws of \(X\), as the time to draw a value other than \(i\) is a geometric distribution with parameter \(1 – p_i\).

Therefore \(E(T) \geq 1 + f(n – 1)\sum \frac{p_i}{1 – p_i}\). This sum is minimized when the \(p_i\) all have the same value, which must be \(\frac{1}{n}\). So by substituting in, we have \(E(T) \geq 1 + \frac{n}{n – 1} f(n – 1)\), with equality when \(T\) is uniform. Thus \(f(n) = 1 + \frac{n}{n – 1} f(n – 1)\). \(n H(n) = n (H(n – 1) + \frac{1}{n}) = 1 + n H(n – 1) = 1 + \frac{n}{n – 1} (n – 1) H(n – 1)\), and thus \(n H(n)\) is the solution to this recurrence relationship.

Thus \(E(T) \geq n H(n)\) and this bound is tight when \(X\) is uniform.

To see the upper bound, consider the following process: Let \(S_0 = 0\) and let \(S_k\) be the first \(i > S_{k – 1}\) with \(X_i = k\). Then certainly \(S_n \geq T\) as by that point we must have seen each value, and so \(E(T) \leq E(S_n)\). But \(S_{k + 1} – S(k)\) is a geometric distribution with parameter \(p_k\), so \(E(S_{k + 1} – S_k) = \frac{1}{p_k}\). Thus by summing up the terms we have \(E(S_n) = \sum \frac{1}{p_k}\) as desired.

To see that this is tight is a bit of a mess and I’m going to omit the calculations. The family of distributions that demonstrates it are as follows: Let \(p_n = \epsilon^{n – 1}\) and \(p_i = \epsilon^{i – 1}(1 – \epsilon)\)  for \(i < n\) (The intuition here is that each \(i\) captures \(1 – \epsilon\) worth of the remaining probability). Then both the bound and \(E(T)\) end up as \(\epsilon^{1 – n} + o(\epsilon^{1 – n} )\), so as \(\epsilon \to 0\) their ratio converges to \(1\).

The final bound is the most interesting one, and I think captures a lot of the intuition that \(E(T)\) is maximized by being “more uniform”. Because the value of \(E(T)\) is invariant under permutations of \(\{1, \ldots, n\}\), we can reorder the values such that \(p_{i + 1} \leq p_i\).

For distributions satisfying this constraint, then \(E(X)\) is strictly maximized by the uniform distribution (that is, the maximum value is obtained on the uniform distribution and any other distribution attains a strictly smaller value).

To prove the bound, first observe that if \(p = P(X \geq i)\) then \(E(T) \geq \frac{n – i}{p}\) (in fact \(E(T) \geq \frac{(n – i) H(n)}{p}\), but the calculations become messier with that factor in so I chose to drop the \(H(n)\) term).

The reason is that to finish we must draw all \(n – i\) values which are greater than or equal to \(i\), and if we only do that with probability \(p\) then it takes an expected number of times equal to at least \(\frac{1}{p}\) to draw each, because \(1 – p\) worth of draws end up drawing \(< i\).

But we can then use the result that \(P(X \geq i) \leq \frac{E(X)}{i}\) (because \(X \geq 0\). Thus by combining these we have \(E(T) \geq \frac{i (n – i)}{E(X})\).

\(i\) is arbitrary, so we can choose it to maximize this expression. If \(n\) is even it is maximised by \(i = \frac{n}{2}\), if \(n = 2k + 1\) it is maximized by (i = k\). Combining these cases we get a maximized value of at least \(\frac{n^2 – 1}{4}\). Plugging these in to our original bound, we get the desired result.

You can get more bounds by considering the same argument with \(E(X^k)\) for arbitrary \(k\). The general bound is actually that \(E(T) \geq n^{k + 1} \frac{(1 + k^{-1})^k}{(k + 1) E(X^k)} + O(n^k)\), but the details are messy so I’ve omitted it here. These bounds are still interesting, as when \(E(X^k)\) is small this indicates that \(X\) is very tightly concentrated around small values of \(i\), which causes discovering all of the values to take a very long time. As we saw with the example for demonstrating that the \(\sum \frac{1}{p_i}\) bound was tight, it can even be exponential in \(n\)!

You can also get similar bounds for exponential variates. e.g. \(E(T) \geq \frac{e^{s(n – 1)}{E(e^{sX})\), so if \(s\) is small enough that \(E(e^{sX})\) is bounded as \(n\) varies then we will see this sort of exponential growth (you can get slightly tighter bounds if \(s \geq \frac{1}{n}\) but it’s probably not worth bothering).

If you found this post interesting, may I highly recommend “Birthday paradox, coupon collectors, caching algorithms and self-organizing search.“, which I found only after steve mcc put me on to the right search terms. It doesn’t contain these bounds, so is in that sense orthogonal, but it has some really cool general techniques.

So what’s this all about?

I was interested in the question of how well random testing can do at exploring its space. In this model each \(X = i\) represents the random tester finding a distinct example up to equivalence.

Typical experience is that random testing “plateaus” – its rate of discoveries goes down over time – and that seems to be more or less backed by this model: \(E(T)\) is the amount of time it takes to explore it feasible search space, and this grows superlinearly in \(n\).

In the case where you manage to be fully uniform over inequivalent states, this isn’t too bad – a logarithmic slow down is pretty manageable – but if there is any sort of concentration around some common point (which will be \(i = 1\) after reordering), it is likely that finding new examples becomes much more expensive over time as each new example is discovered.

In order to say how applicable this is we’d need to do some sort of studying of what random testers actually do, and we’d need some sort of equivalence oracle which is basically impossible, so for the moment I don’t have any real way of actually applying this result to the problem that motivated it, but it is at least suggestive.

This entry was posted in Numbers are hard on by .

Reality is a countably infinite Sierpiński cube

I was thinking out loud on Twitter about what weird beliefs I hold, after realising (more or less as I was writing it) that my philosophical positions are practically banal (at least to anyone who has thought about these issues a bit, whether or not they agree with me).

I came up with a couple, but probably the most interesting (if very very niche) that I thought of is that one true and accurate mathematical model of reality is  time cube a closed, connected, subset of the countably infinite Sierpinski cube.

I consider this opinion to be not that weird and more importantly obviously correct, but I’m aware that this is a niche opinion, but hear me out.

Before we start, a quick note on the nature of reality. I am being deliberately imprecise about what I mean by “reality” here, and basically mean “any physical system”. This could be “life the universe and everything” and we are attempting to solve physics, or it could be some toy restricted physical system of interest and we are trying to nail down its behaviour. This post applies equally well to any physical system we want to be able to model.

Consider an experiment. Let’s pretend we can have deterministic experiments for convenience – you can easily work around the impossibility by making countably infinitely many copies of the experiment and considering each of them to be the answer you got the nth time you ran the experiment.

Also for simplicity we’ll assume that experiments can only have one of two outcomes (this is no loss of generality as long as experiments can only have finitely many outcomes – you just consider the finitely many experiments of the form “Was the outcome X?” – and if they have infinitely many outcomes you still need to ultimately make a finite classification of the result and so can consider the experiment composed with that classification).

There are three sensible possible outcomes you could have here:

  • Yes
  • No
  • I don’t know, maybe?

Physical experiments are inherently imprecise – things go wrong in your experiment, in your setup, in just about every bloody thing, so set of experiments whose outcome will give you total certainty is implausible and we can ignore it.

Which leaves us with experiments where one of the answer is maybe. It doesn’t matter which answer the other one is (we can always just invert the question).

So we’ve run an experiment and got an answer. What does that tell us about the true state of reality?

Well whatever reality is we must have some notion of “an approximate region” – all of our observation of reality is imprecise, so there must be some notion of precision to make sense of that.

So reality is a topological space.

What does the result of a physical experiment tell us about the state of reality?

Well if the answer is “maybe” it doesn’t tell us anything. Literally any point in reality could be mapped to “maybe”.

But if the answer is yes then this should tell us only imprecisely where we are in reality. i.e. the set of points that map to yes must be an open set.

So an experiment is a function from reality to {yes, maybe}. The set of points mapping to yes must be an open set.

And what this means is that experiments are continuous functions to the set {yes, maybe} endowed with the Sierpiński topology. The set {yes} is open, and the whole set and the empty set are open, but nothing else is.

Now let’s postulate that if two states of reality give exactly the same answer on every single experiment, they’re the same state of reality. This is true in the same sense that existing is the thing that reality does – a difference that makes no difference might as well be treated as if it is no difference.

So what we have is the following:

  1. Any state of reality is a point in the cube \(S^E\) where \(E\) is the set of available experiments and \(S = \{\mathrm{yes}, \mathrm{maybe}\}\).
  2. All of the coordinate functions are continuous functions when \(S\) is endowed with the Sierpinski topology.

This is almost enough to show that reality can be modelled as a subset of the Sierpinski cube, not quite: There are many topologies compatible with this – reality could have the discrete topology.

But we are finite beings. And what that means is that any given point in time we can have observed the outcome of at most finitely many experiments.

Each of these experiments determine where we are only in the open set of some coordinate in our cube, thus the set that the experiments have determined us to be in is an intersection of finitely many open sets in the product topology on that cube, and thus is open in that topology.

Therefore the set of states of reality that we know we are in is always an open set in the product topology. So this is the “natural” topology on reality.

So reality is a subset of a Sierpiński cube. We now have to answer two questions to get the rest of the way:

  • How many dimensions does the cube have?
  • What sort of subset is it?

The first one is easy: The set of experiments we can perform is definitely infinite (we can repeat a single experiment arbitrarily many times). It’s also definitely countable, because any experiment we can perform is one we can describe (and two experiments are distinct only up to our ability to describe that distinction), and there are only countably many sentences.

So reality is a subset of the countably infinite dimensional Sierpiński cube.

What sort of subset?

Well that’s harder, and my arguments for it are less convincing.

It’s probably not usually the whole set. It’s unlikely that reality contains a state that is just constantly maybe.

It might as well be a closed set, because if it’s not we can’t tell – there is no physical experiment we can perform that will determine that a point in the closure of reality is not in reality, and it would be aesthetically and philosophically displeasing to have aphysical states of reality that are approximated arbitrarily well.

In most cases it’s usually going to be a connected set. Why? Well, because you’re “in” some state of reality, and you might as well restrict yourself to the path component of that state – if you can’t continuously deform from where you are to another state, that state probably isn’t interesting to you even if it in some sense exists.

Is it an uncountable subset of the Sierpinski cube? I don’t know, maybe. Depends on what you’re modelling.

Anyway, so there you have it. Reality is a closed, connected, subset of the countably infinite dimensional Sierpiński cube.

What are the philosophical implications?

Well, one obvious philosophical implication is that reality is compact, path connected, and second countable, but may not be Hausdorff.

(You should imagine a very very straight face as I delivered that line)

More seriously, the big implication for me is on how we model physical systems. We don’t have to model physical systems as the Sierpiński cube. Indeed we usually won’t want to – it’s not very friendly to work with – but whatever model we choose for our physical systems should have a continuous function (or, really, a family of continuous functions to take into account the fact that we fudged the non-determinism of our experiments) from it to the relevant Sierpiński cube for the physical system under question.

Another thing worth noting is that the argument is more interesting than the conclusion, and in particular the specific embedding is more important that the embedding exists. In fact every second countable T0 topological space embeds in the Sierpinski cube, so the conclusion boils down to the fact that reality is a T0, second countable, compact, and connected (path connected really) topological space (which are more or less the assumptions we used!).

But I think the specific choice of embedding matters more than that, and the fact that we the coordinates correspond to natural experiments we can run.

And, importantly, any decision we make based on that model needs to factor through that function. Decisions are based on a finite set of experiments, and anything that requires us to be able to know our model to more precision than the topology of reality allows us to is aphysical, and should be avoided.

Determinism is topologically impossible

I’ve been doing some work on topological models of decision making recently (I do weird things for fun) and a result popped out of it that I was very surprised by, despite it being essentially just a restatement of some elementary definitions in topology.

The result is this: Under many common models of reality, there are no non-trivial deterministic experiments we can perform even if the underlying reality is deterministic.

The conclusion follows from two very simple assumptions (either of which may be wrong but both of which are very physically plausible).

  1. We are interested in some model of reality as a connected topological space \(X\) (e.g. \(\mathbb{R}^n\), some Hilbert space of operators, whatever).
  2. No experimental outcome can give us infinite precision about that model. i.e. any experimental outcome only tells us where we are up to membership of some open subset of \(X\).

Under this model, regardless of the underlying physical laws, any fully deterministic experiment tells us nothing about the underlying reality.

What does this mean?

Let \(X\) be our model of reality and let \(\mathcal{O}\) be some set of experimental outcomes. A deterministic experiment is some function \(f: X \to \mathcal{O}\).

By our finite precision assumption each of the sets \(U_o = \{x \in X: f(x) = o\}\) are open. But if \(f(x) = o\) and \(o \neq o’\) then \(f(x) \neq o’\) so \(x \not\in U_{o’}\). Therefore they’re disjoint.

But certainly \(x \in U_{f(x)}\), so they also cover \(X\).

But we assumed that \(X\) is connected. So we can’t cover it by disjoint non-empty open sets. Therefore at most one of these sets is non-empty, and thus \(X = U_o\) for some \(o\). i.e. \(f\) constantly takes the value \(o\) and as a result tells us nothing about where we are in \(X\).

Obviously this is a slightly toy model, and the conclusion is practically baked into the premise, so it might not map to reality that closely.

But how could it fail to do so?

One way it can’t fail to do so is that the underlying reality might “really” be disconnected. That doesn’t matter, because it’s not a result about the underlying reality, it’s a result about models of reality, and most of our models of reality are connected regardless of whether the underlying reality is. But certainly if our model is somehow disconnected (e.g. we live in some simulation by a cellular automaton) then this result doesn’t apply.

It could also fail because we have access to experiments that grant us infinite precision. That would be weird, and certainly doesn’t correspond to any sort of experiment I know about – mostly the thing we measure reality with is other reality, which tends to put a bound on how precise we can be.

It could also fail to be interesting in some cases. For example if our purpose is to measure a mathematical constant that we’re not sure how to calculate then we want the result of our experiment to be a constant function (but note that this is only for mathematical constants. Physical constants that vary depending on where in the space of our model we are don’t get this get out clause).

There are also classes of experiments that don’t fall into this construction: For example, it might be that \(O\) itself has some topology on it, our experiments are actually continuous functions into O, and that we can’t actually observe which point we get in \(O\), only its value up to some open set. Indeed, the experiments we’ve already considered are the special case where \(O\) is discrete. The problem with this is that then \(f(X)\) is a connected subset of \(O\), so we’ve just recursed to the problem of determining where we are in \(O\)!

You can also have experiments that are deterministic whenever they work but tell you nothing when they fail. So for example you could have an experiment that returns \(1\) or \(0\), and whenever it returns \(1\) you know you’re in some open set \(U\), but when it returns \(0\) you might or might not be in \(U\), you have no idea. This corresponds to the above case of \(O\) having a topology, where we let \(O\) be the Sierpinski space. This works by giving up on the idea that \(0\) and \(1\) are “distinguishable” elements of the output space – under this topology, the set \(\{0\}\) is not open, and so the set \(U_0\) need not be, and the connectivity argument falls apart.

And finally, and most interestingly, our experiment might just not be defined everywhere.

Consider a two parameter model of reality. e.g. our parameters are the mass of a neutron and the mass of a proton (I know these vary because binding energy or something, but lets pretend they don’t for simplicity of example). So our model space is \((0, \infty)^2\) – a model which is certainly connected, and it’s extremely plausible that we cannot determine each value to more than finite precision. Call these parameters \(u\) and \(v\).

We want an experiment to determine whether protons are more massive than neutrons.

This is “easy”. We perform the following sequence of experiments: We measure each of \(u\) and \(v\) to within a value of \(\frac{1}{n}\). If \(|u – v| > \frac{2}{n}\) then we know their masses precisely enough to answer the question and can stop and return the answer. If not, we increase \(n\) and try again.

Or, more abstractly, we know that the sets \(u > v\) and \(v < u\) are open subsets of our model, so we just return whichever one we’re in.

These work fine, except for the pesky case where \(u = v\) – protons and neutrons are equally massive. In that case our first series of experiments never terminates and our second one has no answer to return.

So we have deterministic experiments (assuming we can actually deterministically measure things to that precision, which is probably false but I’m prepared to pretend we can for the sake of the example) that give us the answer we want, but it only works in a subset of our model: The quarter plane with the diagonal removed, which is no longer a connected set!

Fundamentally, this is a result about boundaries in our models of reality – any attempt to create a deterministic experiment will run into a set like the above plane: Suppose we had a deterministic experiment which was defined only on some subset of \(X\). Then we could find some \(o\) with \(U_o\) a non-empty proper subset of \(X\). Then the set \(\overline{U} \cap U^c\) where the closure of \(U_o\) meets its complement (which is non-empty because \(X\) is connected) is a boundary like the diagonal above – on one side of it we know that \(f\) returns \(o\). On the other side we know that it doesn’t return \(o\), but in the middle at the boundary it is impossible for us to tell.

What are the implications?

Well, practically, not much. Nobody believed that any of the experiments we’re currently performing are fully deterministic anyway.

But philosophically this is interesting to me for a couple of reasons:

  1. I for one was very surprised that such a trivial topological result had such a non-trivial physical interpretation.
  2. The idea that non-determinism was some intrinsic property of measurement and not a consequence of underlying physical non-determinism is not one that had ever previously occurred to me.
  3. We need to be very careful about boundaries in our models of reality, because we often can’t really tell if we’re on them or not.
  4. It may in fact be desirable to assume that all interesting quantities are never equal unless we have a deep theoretical reason to believe them to be equal, which largely lets us avoid this problem except when our theory is wrong.

(As per usual, if you like this sort of thing, vote with your wallet and support my writing on Patreon! I mean, you’ll get weird maths posts either way, but you’ll get more weird maths posts, and access to upcoming drafts, if you do).

Convergence of alternating sums

Colin Beveridge tweeted about the following Formula for Pi the other day:

\(\pi = 1 + \frac{1}{2} + \frac{1}{3} + \frac{1}{4} – \frac{1}{5} + \frac{1}{6} + \ldots\)

With the exact formula being:

After the first two terms, the signs are determined as follows: If the denominator is a prime of the form 4m – 1, the sign is positive; if the denominator is a prime of the form 4m + 1, the sign is negative; for composite numbers, the sign is equal the product of the signs of its factors.

This is due to Euler.

(If you just want to see a proof of this, skip to near the end of this post)

I opined that it wasn’t that surprising, because for any real number \(x\) you can get a sequence of alternating signs for this sum which converge to it as follows:

Inductively define \(\epsilon_n\) as follows: \(\epsilon_{n+1}\) is \(1\) if \(\sum\limits_{i \leq n} \frac{\epsilon_i}{n} < x\), else \(\epsilon_{n+1} = -1\).

After the first time the partial sums \(s_n = \sum\limits_{i \leq n} \frac{\epsilon_i}{n} \) cross \(x\), you must have \(|s_n – x| \leq \frac{1}{n}\), so the sum converges to \(x\).

There are many inequivalent sums that lead to \(x\) too. e.g. You can make the convergence arbitrarily slow if you like: If you have some sequence of non-negative numbers \(b_n \to 0\) then you can construct a sequence \(\epsilon_n\) as above where instead of changing the sign every time you cross \(x\), you change the sign to negative when you cross \(x + b_n\), then back to positive when you cross \(x – b_{n+1}\), etc.

There’s nothing special about \(\frac{1}{n}\) here either – all that matters is that it’s a non-negative sequence tending to zero whose sum does not converge.

But there is something special about \(\frac{1}{n}\) here that means we can probably expect a particularly rich set of “natural” examples from it due to a slightly stronger condition it satisfies: It’s in \(l^2\). That is, \(\sum\frac{1}{n^2}\) converges (specifically to \(\frac{\pi^2}{6}\), but that doesn’t matter for now).

Why does this matter?

Well it turns out to be interesting for the following reason: If \(a_n\) is a sequence in \(l^2\), and \(B_1, \ldots B_n, \ldots\) are independent Bernoulli random variables, then \(\sum (-1)^{B_n} a_n\) converges to a finite value with probability 1. That is, if we just randomly pick each sign then then result will converge to something.

This follows from the following stronger theorem: Let \(A_i\) be a sequence of independent random variables with \(E(A_i) = 0\) and \(\sum E(A_i)^2 < \infty\). Then \(\sum A_i\) converges to a finite value with probability one.

First a thing to note: Because of Kolmogorov’s zero-one law, such a sequence will converge with probability \(0\) or \(1\) – no in between values are possible. I think this makes the result significantly less surprising (though we won’t actually use this fact in the proof).

The key idea of the proof is this: We’ll use the bound on the tail behaviour that convergence gives us to look at the \(\lim\inf\) and the \(\lim\sup\) (the limit inferior and limit superior) of the sums of the random variables and show that these are equal almost surely. This implies that the limit exists and is equal to their common value (which might be infinite).

And the way we’ll do that is we’ll consider the possibility that they are well separated: Let \(D_{a, b} = \{ \lim\inf S_n < a < b < \lim\sup S_n\}\), where \(S_n = \sum\limits_{i \leq n} A_i\). Then \(\{\lim\inf S_n < \lim\sup S_n\} = \bigcup\limits_{a, b \in \mathbb{Q}} D_{a, b}\), which is a countable union, so if each \(D_{a, b}\) has probability zero then so does the whole set, and the limit exists.

So let’s now establish an upper bound on the probability of lying in \(D_{a, b}\). Pick any \(N\). If \(D_{a, b}\) occurs then the sums lie below \(a\) and above \(b\) (not at the same time) infinitely often after \(N\) because of the definitions of limit inferior and superior. So we can find some \(N < m < n\) with \(S_m < a\) and \(S_n > b\).

But then \(\sum\limits_{i=m}^n A_i > b – a\), and hence \(\sum\limits_{i=m}^n |A_i| > b – a\).

By the Cauchy-Schwartz inequality, for any \(t_1, \ldots t_n \geq 0\) we have \(n \sum t_i = (1, \ldots, 1) \cdot (t_1, \ldots, t_n) \leq ||(1, \ldots, 1)|| ||(t_1, \ldots, t_n)|| = \sqrt{n} (\sum t_i^2)^{\frac{1}{2}}\), so \(\sum t_i^2 \geq n (\sum t_i)^2\). Edit: This step is wrong. The result is still true because of deeper theory than I go into here (it comes from Martingale theory) and this step may be recoverable. I’ll try to fix it later. Thanks to @sl2c for pointing this out.

In particular applied to the above this means that  \(\sum\limits_{i=m}^n |A_i|^2 \geq (n – m) (\sum |A_i|)^2 \geq (n – m)(b – a)^2 \geq (b – a)^2 \).

But \(\sum\limits_{i=m}^n |A_i|^2 \leq \sum\limits_{i=N}^\infty |A_i|^2\). So we must have \(\sum\limits_{i=N}^\infty |A_i|^2 \geq (b – a)^2\).

But for any non-negative random variable, \(P(X \geq x) \leq \frac{E(X)}{x}\). This means we must have \(P(D_{a, b}) \leq P\left(\sum\limits_{i=N}^\infty |A_i|^2 \geq (b – a)^2\right) \leq \frac{E(\sum\limits_{i=N}^\infty |A_i|^2)}{(b – a)^2} = \frac{\sum\limits_{i=N}^\infty E(|A_i|^2))}{(b – a)^2} \).

But \(N\) was arbitrary, and we know \(\sum\limits_{i=N}^\infty E(|A_i|^2) < \infty\), so as \(N \to 0\) the right hand side tends to zero. Therefore \(P(D_{a, b}) = 0\) as desired and the result is almost proved.

Note that we’ve not used the independence of the \(A_i\) so far! It’s only necessary to prove that the sums converge to a finite value. To see that we need some sort of condition for that, consider the following: Let \(X\) be a random variable that takes the values \(1, -1\) with equal probability. Let \(A_i = \frac{X}{n}\). Then \(E(A_i) = \) and \(\sum E(A_i^2) < \infty\), but \(\sum A_i\) takes the values \(\infty, -\infty\) with equal probability (and no other values).

But with independence this can’t happen for the following reason: For any random variable we have \(E(X) \leq E(X^2)^{\frac{1}{2}}\) (because \(0 \leq \mathrm{Var}(X) = E(X^2) – E(X)^2\), though it also follows from other less elementary results). So we have \(E(|S_n|) \leq E(S_n^2)^{\frac{1}{2}}\). But \(E(S_n)^2 = \sum\limits_{i, j \leq n} A_i A_j = \sum\limits_{i \leq n} A_n^2\), because \(A_i, A_j\) are independent for \(i \neq j\)  so \(E(A_i A_j) = E(A_i) E(A_j) = 0\).

This means that \(E(|\sum A_i|) = \lim E(|S_n|) \leq \sqrt{\sum E(A_i)^2} < \infty\). Thus, necessarily, \(P(|\sum A_i| = \infty) = 0\).

And now the result is proved.

The proper setting for the above result is really the theory of Martingales, and the proof I gave above is mostly a modified proof of Doob’s Second Martingale convergence theorem, where I’ve used a stronger hypothesis to bound the difference in the inferior and the superior limits.

In fact, for the special case of our coin tossing a stronger statement is true: Suppose \(a_n\) is a sequence of positive numbers such that \(\sum a_n = \infty\) and \(\sum a_n^2 < \infty\). Let \(A_n\) be a sequence of independent random variables with \(A_n = \pm a_n\), taking each value with equal probability. Then for any interval \(a < b\), \(P(a < \sum A_n < b) > 0\).

To see this, write \((a, b) = (x – \epsilon, x + \epsilon)\) by letting \(x = \frac{a + b}{2}\) and \(\epsilon = \frac{b – a}{2}\).

Run the process above for constructing a sequence converging to \(x\) until we have \(N\) such that our choices so far have lead us to \(|\sum_{i \leq N} A_i  – x| < \frac{\epsilon}{2}\) and \(\sum_{i > N} a_i^2 < \frac{\epsilon^2}{8}\). The initial sequence of choices that lead us here happens with probability \(2^{-N}\), and the condition on the sum of the tail guarantees via Chebyshev’s inequality that \(P(|\sum_{i \geq N} A_i| \leq \frac{\epsilon}{2}) \geq \frac{1}{2}\) so  we have

\(P(|\sum A_i – x| < \epsilon) \geq P(\sum\limits_{i=0}^N A_i – x < \frac{\epsilon}{2}) P(\sum\limits_{i > N} A_i) < \frac{\epsilon}{2}) \geq 2^{-N-1} > 0\).

So any value is not only possible but at least somewhat probable.

But this is a true continuous distribution, in the sense that \(P(\sum A_i = u) = 0\) for any value \(u\). To see this, consider the values taken only on a sum of indices. Let \(A(T) = \sum\limits_{i \in T} A_i\). Then \(A(T)\) and \(A(\mathbb{N} \setminus T)\) are independent, so if \(A\) has atoms then both of \(A(T)\) and  \(A(\mathbb{N} \setminus T)\) must also. But if we pick \(T\) to be some subsequence \(t_n\) where \(a_{t_n} < 2^{-n}\) then all assignments of signs produce a unique result, so the probability of achieving any given value is zero. (Thanks to Robert Israel for this proof).

But still, \(\pi\) is a bit of an outlier: The variance of this distribution is of course \(\sum \frac{1}{n^2} = \frac{\pi^2}{6}\), so \(\pi\) is \(\sqrt{6} \approx 2.45\) standard deviations out. I don’t have an exact estimate of how probable that is for this distribution, but again by Chebyshev’s inequality we know that we can’t get a result at least this extreme more than a sixth of the time.

So why do we get this particular value here?

Well it turns out to be an unsurprising result for another reason, which is that it comes from a general technique for producing interesting infinite sums by starting from interesting products.

Suppose we have \(\sum f(n) \frac{1}{n}\) where \(f(n)\) is some multiplicative function. i.e. \(f(mn) = f(m) f(p)\). Then we can write this as \(\prod (1 – f(p_i) \frac{1}{p_i})^{-1}\), with \(p_i\) being the i’th prime. There are issues of convergence we should worry about here, but in the finest Eulerian style we won’t. Both sides converging is probably a sufficient condition. A lot of this can probably be fixed by introducing a \(t\) parameter, assuming \(|t| < 1\) and taking the limit \(t \to 1\) of \(\sum t^n f(n) \frac{1}{n}\) and  \(\\prod (1 – t f(p_i) p_i)^{-1}\), then because mumble mumble analytic continuation the two limits are equal. I haven’t checked the details, so I won’t do this here and will just follow Euler’s convention of assuming infinity isn’t a problem.

To prove this, we’ll consider the following sequence of infinite sums, \(P_0 = \sum\limits_{n \in R_n} f(n) \frac{1}{n} \), where \(R_n\) is the set of numbers not divisible by any of the first \(n\) primes (so \(R_0 = \mathbb{N}\)).

If we split \(R_n\) up into the set of numbers which are a multiple of \(p_{n+1}\) and those that are not, we get the recurrence relationship that \(P_n = \frac{f(p_{n+1})}{p_{n+1}} P_n + P_{n+1}\), by just taking out a single factor of \(\frac{f(p_{n+1})}{p_{n+1}} \) from the components of the sum over the values that are divisible by \(p_{n+1}\). So \((1 – \frac{f(p_{n+1})}{p_{n+1}} ) P_n = P_{n+1}\)

Iterating this, we get \(P_0 \prod\limits_1^\infty (1 – (\frac{a_{p_{n+1}}}{p_{n+1}} ) = \lim\limits_{n \to \infty} P_n = 1\).

i.e. \(\sum f(n) \frac{1}{n} = P_0 = \prod\limits_1^\infty (1 – \frac{f(p_{n+1)}}{p_{n+1}})^{-1} \) as desired.

We can also go from products to sums: If we have \(\prod (1 – b_i )^{-1} \) then we can write it as \(\sum S_n \), where \(S_n\) is the sum of all products of sequences of \(n\) of the \(b_i\) (allowing repetitions).

If we then have \(b_i = f(p) \frac{1}{p_i}\), this becomes \(\prod (1 – f(p_i) \frac{1}{p_i})^{-1}\), the products all become unique, and we have \(\sum \frac{a_n}{n}\) again, where we extend \(f\) to all numbers by multiplying its values on their prime factors.

Using these tools, proving the sequence result for \(\pi\) becomes a matter of (somewhat tedious) computation.

We start with \(\frac{\pi}{4} = \arctan(1) = 1 – \frac{1}{3} + \frac{1}{5} – \ldots\) using the standard power series for arctan. This can be written in the above form with \(f(n) = 0\) if \(n\) is even, \(f(n) = -1\) if \(n \mod 4 = 3\) and \(f(n) = 1\) if \(n = 1 \mod 4\). So we have:

\(\frac{\pi}{4} = \prod (1 – f(p_i) \frac{1}{p_i})^{-1}\).

We can do the same with the standard sum \(\frac{\pi^2}{6} = \sum \frac{1}{n^2} = \prod (1 – \frac{1}{p_i^2})^{-1}\) (because \(n \to \frac{1}{n}\) is certainly multiplicative).

Now, divide the second product by the first and we get:

\(\frac{2}{3} \pi = \frac{1}{1 – \frac{1}{2^2}} \prod\limits_{p_i > 2} \left( \frac{1 – \frac{1}{p_i^2}}{1 – f(p_i) \frac{1}{p_i}} \right)^{-1}\)

The term outside the product comes from the fact that there’s no \(2\) term in our product for \(\frac{\pi}{4}\) and is equal to \(\frac{4}{3}\), so rearranging we get \(\frac{\pi}{2} = \prod\limits_{p_i > 2} \ldots\).

The term inside the product actually splits up fairly nicely: Because we can factor \(1 – \frac{1}{p_i^2} = (1 – \frac{1}{p_i})(1 + \frac{1}{p_i})\). The bottom is now one of these two terms, so this factors into whichever of the two the bottom is not. i.e. as \(1 + f(p_i)\).

So from this we conclude that

\(\frac{\pi}{2} = \prod_{p_i > 2} (1 + f(p_i))^-1\), or \(\pi = \frac{1}{1 – \frac{1}{2}}  \prod_{p_i > 2} (1 + f(p_i))^-1\).

If we define \(g\) as \(g(2) = 1\), \(g(p) = -f(p)\) for \(p\) an odd prime, and extend to composite numbers multiplicatively, this then becomes  \(\pi = \prod (1 – g(p_i))^{-1}) = \sum g(n) \frac{1}{n}\), which was our desired sum.

This proof more or less follows Euler’s, via the translation by Springer (which I borrowed off a friend and would honestly recommend you do the same rather than paying more than £100 for). The details are a bit different mostly because it was the only way I could follow them – in particular he tends to eschew the detailed algebra and just manipulate examples – and unlike him I feel guilty about the lack of precise analysis of convergence, but it’s certainly derived from his.

The product representation also allows us to strengthen our original result: Not only can any number be the limit of the sum by choosing primes appropriately, we can insist that the signs are multiplicative in the sense that the sign assigned to \(mn\) is the product of the signs assigned to \(m\) and \(n\).

The reason for this is that \(\sum \frac{1}{p_i}\) diverges. As a result, \(\prod\limits_{p_i > n} (1 + \frac{1}{p_i}) = \infty\) and \(\prod\limits_{p_i > n} (1 – \frac{1}{p_i}) = 0\). This means we can just repeat our original construction with the product instead of the sum: Assign positive signs as long as the partial product is less than the desired result, assign negative ones as long as it’s greater. The result is a product converging to the desired value, which we can then turn into a sum converging to the desired value with multiplicative signs.

So that more or less completes the journey down this particular rabbit hole: You can get any number by manipulating the signs, you should expect most choices of numbers to be at least reasonably plausible, and we have some useful machinery for picking good choices of signs for leading to specific numbers.

Whether or not it’s still surprising is up to you.

(This post turned out to be a ridiculous amount of work to write. If you want more like it, or just want to say thanks, I do have a Patreon for this blog. Any donations you want to send my way would be warmly appreciated)

This entry was posted in Numbers are hard on by .