Category Archives: Numbers are hard

Some simple theorems about dominance of probability distributions

This is a post about some very simple results about a particular partial order on probability distributions. It came about because I was talking with Paul Crowley about whether I believed in there being a valid total ordering on probability distributions which says that one is strictly better than the other. My answer was that I did not. After some additional reflection I realised that there was a partial ordering I considered valid. After some additional additional reflection I realised it had a quite nice structure. This is a write up of the things I noticed about it. It’s all quite easy to prove, so the results are probably more interesting than the proofs themselves.

Let \(X, Y\) be natural number valued random variables.

Say \(X \preceq Y\) if \(P(X \geq n) \leq P(Y \geq n)\) for all \(x\). \(X \prec Y\) if \(X \preceq Y\) and \(X \neq Y\). Read this as “Y dominates X” or “X is dominated by Y”. The idea is simply that we always expect \(Y\) to produce better results than \(Z\), if we read larger as better.

\(\preceq\) is a partial order because it’s the intersection of the partial orders \(P(X \geq n) \leq P(Y \geq n)\)

For \(p \in [0,1]\), let \(L(p, A, B)\) be the random variable which consists of drawing from A with probability p and from B with probability 1 – p.

Theorem: If \(A, B \preceq C\) then \(L(p, A, B) \preceq C\). Similarily if \(C \preceq A, B\) then \(C \preceq L(p, A, B\)

Proof: \(P(L(p,A,B) \geq n) = p P(A \geq n) + (1 – p) P(B \geq n) \leq p P(C \geq n) + (1 – p) P(C \geq n) = P(C \geq n)\) as desired. Similarly the other way.
QED

Lemma: \(E(X) = \sum\limits_{n=1}^\infty P(X >= n)\)
Proof: Let \(p_n = P(X = n)\).

\[\sum\limits_{n = 1}^\infty P(X >= n) = \sum\limits_{n=1}^\infty \sum\limits_{m=n}^\infty p_m\]

A term \(p_k\) appears \(k\) times in the right hand side – it is present if in the sum if \(k \leq n\) and not otherwise. So we can rearrange the sum (the terms are all non-negative so this is valid without worrying about convergence issues) to get

\[\sum\limits_{n = 1}^\infty P(X >= n) = \sum\limits_{n=1}^\infty n p_n = \sum\limits_{n=0}^\infty n p_n = E(X)\]
QED

Theorem: If \(X \prec Y\) then \(E(X) < E(Y)\). Proof: Suppose \(X \preceq Y\). Then \(P(X \geq n) < P(Y \geq n)\) for some \(n\), as it's always \(\leq\) and if it were always equal then the distributions would be the same, which they're not by hypothesis. Therefore \(\sum P(X \geq n) < \sum P(Y \geq n)\), because all the other terms are \(\leq\). But according to the lemma this equation is \(E(X) < E(Y)\) as desired. QED Lemma: Let \(x_n\) be a monotonic decreasing real-valued sequence converging to 0 such that \(x_0 = 1\). Then there is a random variable \(X\) with \(P(X \geq n) = x_n\). Proof: Let \(p_n = x_n - x_{n+1}\). Because \(x_n\) is monotonic decreasing we have \(p_n \geq 0\). Also, \(p_n \leq x_n \leq x_0 = 1\). Further \(\sum\limits_{k=0}^n p_k = x_0 - x_1 + x_1 - x_2 + \ldots + x_n - x_{n+1} = 1 - x_{n+1}\). So \(\sum\limits_{k=0}^\infty p_k = \lim\limits_{n \to \infty} 1 - x_{n+1} = 1\). So \(p_n\) is a probability distribution. Let \(X\) be a random variable such that \(P(X = n) = p_n\). Then \(P(X \geq n) = \sum\limits_{k = n}^\infty p_n = \sum\limits_{k = n}^\infty x_n - x_{n+1} = x_n - x_{n+1} + x_{n+1} - x_{n+2} + \ldots = x_n\). QED Theorem: The set of random variables with \(\preceq\) form a lattice. i.e. for random variables \(X,Y\) there is a random variable \(X \wedge Y\) such that \(Z \preceq X, Y\) iff \(X \preceq X \wedge Y\). Similarly there is \(X \vee Y\) such that \(X, Y \preceq Z\) iff \(X \wedge Y \preceq Z\). Proof: To get \(X \wedge Y\) take the sequence \(n \to \mathrm{min}(P(X \geq n), P(Y \geq n))\). This satisfies the conditions of the lemma, so we can find a random variable \(Z\) such that this is \(P(Z \geq n)\). By the definition of min this satisfies the desired properties. Similarly with max to get \(X \vee Y\). QED

This entry was posted in Numbers are hard on by .

More on infinitary decision strategies

I’ve mostly lost interest in further studying the infinitary version of the problem. The simple conclusion seems to be that the structure of the uncountable case is significantly more complicated than the structure of the set of measures on a sigma-algebra, so there’s probably not much more classification one can do. Here are some brief notes on examples:

Lemma: Every set admits a subset consistent strategy
Proof: Every set can be well-ordered. Let \(s(U)(V) = 1\) if \(\mathrm{min}(U) \in V\), else \(0\).

Theorem: Every measure \(m\) can be extended to a subset consistent strategy. That is, there is some subset consistent strategy with \(s(S) = m\).
Proof:

Let \(s\) be a subset consistent strategy on \(S\). Define \(s'(U)(V) = \frac{m(U \cap V)}{m(U)}\) if \(m(U) > 0\), else \(s(U)(V) = s'(U)(V)\).

That this works is a simple matter of case checking.
Let \(U \subseteq V \subseteq W\). We want to show \(s(W)(U) = s(W)(V) s(V)(U)\).

If \(m(W) = 0\) then \(s = s’\) on all the sets in question, so this follows from \(s’\) being subset consistent. Else if \(m(V) = 0\) then \(S(W)(V) = S(W)(U) = 0\) so both sides are 0. Else \(m(W)(U) = \frac{m(U)}{m(W)} = \frac{m(U)}{m(V)} \frac{m(V)}{m(W)} = s(U)(V) s(V)(W)\) as desired.

QED

In fact every subset consistent strategy may be decomposed nearly this way. The only caveat is that the subset consistent strategy \(s’\) ends up being on the set \(S’ = \{x : m(\{x\}) \neq 0\}\).

As proof, just take \(m = s(S)\) and \(s’ = s|_{S’}\).

However somewhat self-evidently this operation isn’t idempotent: Because in the construction we could have taken \(s’\) to be just about anything we wanted, we can do this again and again.

Here’s an interesting example: Let \(A\) be a well ordered set. Let \(S = \bigcup I_a\) where \(I_a\) is a copy of the unit interval \([0,1]\). Additionally, fix some well-ordering of \(S\). Give this the \(\sigma\)-algebra generated by the measurable sets of each \(I_a\).

Lemma: Every measurable set in this \(\sigma\) algebra intersects either countably or coucountably \(I_a\).
Proof: The set of sets which do that is closed under countable unions and complements.

Define \(s(U)(V) = \frac{m(U \cap I_a \cap V)}{m(U \cap I_a)} where \(a\) is minimal such that \(I_a \cap U \neq \emptyset\). If there is no such \(a\) then instead use the well-ordering’s strategy.

Claim: If \(A = \omega_2\) then given this \(s\) there is no well-ordered sequence of decompositions as above that gives \(s\).

Proof: Well, I haven’t really proved it to be honest. It has plausibility though. Here’s my handwavey reasoning:

The decomposition of a set in this involves just knocking off the first initial interval. So you can’t decompose it in fewer than \(\omega_2\) steps. There are then \(\omega_1\) values of \(I_a\) which the \(\omega_1\)th step does not intersect, which is not co-countable.

QED

I’ve done some additional work on the partial orders on sets that such strategies define, but it doesn’t seem to go anywhere very fruitful. I’m going to consider this problem closed for now.

This entry was posted in Numbers are hard on by .

An infinitary extension to the previous theorem about random variables

This is an attempt at an infinitary version of my previous post on finite strategies

Let \(S\) be some set with a sigma-algebra on it such that points are measurable. Let \(\mathcal{M}\) be be the set of measurable subsets. A strategy is a function s from \(\mathcal{M}\) to the set of probability measures on \((S, \mathcal{M})\) such that \(s(A)(A) = 1\).

A strategy s is subset consistent if whenever \(U, V, W\) are measureable sets with \(U \subseteq V \subseteq W\) we have \(s(W)(U) = S(W)(V) S(V)(U)\).

It’s reasonably easy to see this reduces to the previous version if \(S\) is finite and the sigma-algebra is \(P(S)\). While random variables are replaced by measures, there’s not really much difference between them.

Now fix s as a subset consistent strategy.

We can convert this to the version we had before to get the relations \(\prec\) and \(\sim\) on \(S\). The finiteness of \(S\) was not used to derive any of the properties of these.

Further when we pass to the quotient we get a totally ordered set \(T\) and a partition \(\{L_t : t \in T\}\) such that if \(x \in L_s, y \in L_t\) then \(x \prec y\) iff \(s < t\). Everything we did after that point was reliant on the finiteness of \(S\), so we'll have to try harder here. Theorem: \(T\) is the reverse of a well-ordered relation. i.e. every set has a maximal element. Proof: Note that we need only show that every countable set has a maximal element, as given a set with no maximal element we can construct a strictly increasing sequence in it \(x_1 < x_2 < \ldots\), which would be a countable subset with no maximal element. Suppose now we have some countable subset \(A \subseteq T\). Pick \(U \subseteq S\) such that \(U\) chooses an element from each of \(\{L_a : a \in A\}\). Then \(s(U)(\{x\}) \neq 0\) for some \(x \in U\), as otherwise \(s(U)(U) = 0\) (because \(U\) is countable). Then it can't be the case that \(x \prec y\) for any \(y \in U\). Let \(x \in L_t\). Then \(t\) must be a maximal element of \(A\). QED Theorem: \(L_t\) is countable. Proof: First note that for any countable \(A \subseteq L_t\) it must be the case that for all \(x \in L_t\), \(s(A)(\{x\}) > 0\), for if not then they all must be \(0\) (because otherwise any zero ones would be \(\prec\) the non-zero ones), and thus we would have \(s(A)(A) = 0\) because of countable additivity.

Now suppose \(L_t\) is uncountable. We can find at least \(\aleph_1\) distinct elements in it. Let \((x_a : a < \omega_1)\) be a sequence of length \(\omega_1\) consisting of distinct elements in \(L_t\). Let \(X_a = \{x_b : b \leq a\}\) and consider the sequence \(p_a = s(X_a)(\{x_0\})\). Claim: If \(c < d\) then \(p_c > p_d\).
Proof: By subset compatibility, \(p_d = s(X_d)(X_c) s(X_c)(\{x_0\}) \leq (1 – s(X_d)(\{x_d\})) p_c < p_c\) So we have constructed a strictly decreasing sequence of real numbers of length \(\omega_1\). But this is impossible, for standard set theoretic reasons. QED This gives us a theorem very akin to our finitary version: Theorem: Let \(s\) be a subset consistent strategy on S. There is a reverse well ordered set \(T\), a partition \(\{L_t : t \in T\}\) and weight function \(w : S : \to (0, \infty)\) such that for countable \(U \subseteq S\), \(s(U)\) is determined as follows: Let \(L_t\) be the largest \(t\) such that \(U' L_t \cap U \neq \emptyset\). Then \(s(U)(V) = \frac{\sum\limits_{x \in U' \cap V} w(x)}{\sum\limits_{x in U'} w(x)}\). Proof: We've essentially done all the work for this. Define \(w(x) = s(L_t)(\{x\})\) where \(x \in L_t\). We need the countability of \(U\) only to get that \(s(U)(V) = \sum\limits_{x \in V} s(U)(\{x\})\), after which everything proceeds as in the finite case. QED So the behaviour for countable sets is extremely analagous to the finite case. How do things work for uncountable sets? I currently have no idea at all. I don't currently have any examples where the behaviour for uncountable sets is different, but I'd be moderately surprised if there weren't some. If anything, I'd expect it to be possible to build an example out of any probability measure.

This entry was posted in Numbers are hard on by .

A possible more principled approach to generation and simplification in hypothesis

Currently the way generation and simplification in hypothesis work are very ad hoc and undisciplined. The API is spread across various places, and the actual behaviour is very under-specified.

There are two perations exposed, produce and simplify. produce takes a size parameter and produces values of the specifed type of about the provided size, whileas simplify takes a value of the specified type and produces a generator over simplified variants of it.

The meaning of these terms is explicit meaning of these terms is deliberately undefined: There is no specified meaning for “about that size” or “simplified variants”.

This post is an attempt to sketch out some ideas about how this could become better specified. I’m just going to use maths rather than Python here – some of the beginnings of this has made it into code, but it’s no more than a starting point right now.

Let \(S\) be a set of values we’ll call our state space. We will call a search tactic for \(S\) a triple:

\[\begin{align*}
\mathrm{complexity} & : S \to [0, \infty) \\
\mathrm{simplify} & : S \to [S]^{< \omega}\\ \mathrm{produce} & : [0, \infty) \to \mathrm{RV}(S) \\ \end{align*}\] Where \(\mathrm{RV}(S)\) is the set of random variables taking values in \(S\) and \([S]^{<\omega}\) is the set of finite sequences taking values in \(S\). These operations should satisfy the following properties:

  1. \(\mathrm{h}(\mathrm{produce}(x)) = \min(x, \log(|S|))\)
  2. \(x \to \mathbb{E}(\mathrm{complexity}(\mathrm{produce}(x)))\) should be monotone increasing in X
  3. \(\mathrm{complexity}(y) \leq \mathrm{complexity}(x)\) for \(y \in \mathrm{simplify}(x)\)

In general it would be nice if the distribution of \(\mathrm{produce}(x)\) minimized the expected complexity, but I think that would be too restrictive.

Where \(\mathrm{h}\) is the entropy function.

The idea here is that the entropy of the distribution is a good measure of how spread out the search space is – a low entropy distribution will be very concentrated, whileas a high entropy distribution will be very spread out. This makes it a good “size” function. The requirement that the expected complexity be monotone increasing captures the idea of the search space spreading out, and the requirement that simplification not increase the complexity captures the idea of moving towards the values more like what you generated at low size.

Here are some examples of how you might produce search strategies:

A search strategy for positive real numbers could be:

\[\begin{align*}
\mathrm{complexity}(x) & = x \\
\mathrm{simplify}(x) & = x, \ldots, x – n \mbox{ for } n < x\\ \mathrm{produce}(h) & = \mathrm{Exp}(e^h - 1) \\ \end{align*}\] The exponential distribution seems to be a nice choice because it's a maximum entropy distribution for a given expectation, but I don't really know if it's a minimal choice of expectation for a fixed entropy. Another example. Given search strategies for \(S\) and \(T\) we can produce a search strategy for \(S \times T\) as follows: \[\begin{align*} \mathrm{complexity}(x, y) & = \mathrm{complexity}_S(x) + \mathrm{complexity}_T(y)\\ \mathrm{simplify}(x, y) & = [(a, b) : a \in \mathrm{simplify}_S(x), b \in \mathrm{simplify}_T(y)] \\ \mathrm{produce}(h) & = (\mathrm{produce}_S(\frac{1}{2}h), \mathrm{produce}_T(\frac{1}{2}h)) \\ \end{align*}\] The first two should be self-explanatory. The produce function works because the entropy of a product of independent variables is the sum of the entropy of its components. It might also be potentially interesting to distribute the entropy less uniformly through the components, but this is probably simplest. You can also generate a search strategy for sequences given a search strategy for natural numbers \(\mathbb{N}\) and a search strategy for \(S\) we can generate a search strategy for \([S]^{< \omega}\): If we define the complexity of a sequence as the sum of the complexities of its components, we can define its simplifications as coordinatewise simplifications of subsequences. The produce is the only hard one to define. Its definition goes as follows: We will generate length as a random variable \(N = \mathrm{produce}_{\mathbb{N}}(i)\) for some entropy \(i\). We will then allocate a total entropy \(j\) to the sequence of length \(N\). So the coordinates will be generated as \(x_i = \mathrm{produce}_S(\frac{j}{N})\). Let \(T\) be the random variable of the sequence produced. The value of \(N\) is completely specified by \(S\), so \(h(S) = h(S,N)\). We can then use conditional entropy to calculate this: We have that \(h(S | N = n) = j\) because we allocated the entropy equally between each of the coordinates.

So \(h(S) = h(N) + h(S | N) = i + j\)

So we can allocate the entropy between coordinates and length as we wish – either an equal split, or biasing in favour of shorter sequences with more complex coordinates or short sequences with complex coordinates.

Anyway, those are some worked examples. It seems to work reasonably well, and is more pleasantly principled than the current ad hoc approach. It remains to be seen how well it works in practice.

This entry was posted in Hypothesis, Numbers are hard, programming, Uncategorized on by .

Nontransitive dice are unsurprising

Every now and then someone finds out that nontransitive dice exist and is surprised by this fact. I in turn am always kinda surprised that their existence is surprising.

Why? Well, basically I don’t expect things to be transitive. Transitivity is a very nice property. When the universe hands you a transitive (total) property it’s basically saying “Hey, here. You’re in luck. The decision problem on this is easy, so you can spend your time worrying about other things”. The universe is a bit of a dick, so I don’t really expect it to do that very often.

So what is transitivity? Transitivity basically says “If A is better than B and B is better than C then A is better than C”. So something is nontransitive if you can find A, B and C where A is better than B, B is better than C and C is better than A.

In particular, nontransitive dice are a set of dice A,B,C such that if you roll pairs of them and the one with the highest score wins, A will consistently beat B which will consistently beat C which will consistently beat A.

They’re quite neat, but their existence shouldn’t be surprising. This is basically a post about how you might go about constructing non-transitive dice if all you had was an inclination that they might exist. Hopefully by the time you get to the end of it you should find the fact that they exist as unsurprising as I do.

We’re going to start with something apparently unrelated: An example of non-transitivity which is especially dear to my heart. Condorcet’s Paradox.

What is Condorcet’s Paradox? It’s the following:

Suppose you run a voting system in which each voter ranks each candidate in order of preference. That is, a vote looks something like “A,B,C” to mean “I prefer A to both B and C and I prefer B to C”.

Given votes cast like this, it is possible to find yourself in a situation where the majority of people prefer A to B, B to C and C to A (where “prefer” here means “ranked higher”). i.e. even though each individual vote gave a transitive ordering of the candidates, the majority preference does not.

How is this possible? Well, lets see if we can construct a simple example.

Well, first lets convince ourselves that this is impossible with only two voters:

The only way two voters can have a majority preference is if they both agree on an ordering.

So if A is preferred by the majority to B and B is preferred to the majority by C then both voters have ranked A ahead of B and B ahead of C. This means that also both voters have ranked A ahead of C, and so A is preferred to C.

So it’s impossible with two voters. Can we do it with three?

Well, in the same way as with two voters, if we have a unanimous preference between two candidates this isn’t going to work. So if this can possible work, for each pair we need to have two voters ranking it one way and one ranking it the other. So lets start with how the votes should look for A and B.

A,B
A,B
B,A

Now we need to figure out a way to insert C into the rankings in order to make this work.

We need to put C after B twice. If we do this for both the A,B votes then we’ll have A and B both preferred to C, which we don’t want, but we have to do it for one of them, which means we need one vote which is “A,B,C”. We also need to do it for the third vote, so this must look like either “B,C,A” or “B,A,C”.

We’re now down to few enough possibilities that we can just try things and see if they work. Let’s try B,C,A.

So we have two votes:

A,B,C
B,C,A

And one which is either A,C,B or C,A,B (because we need C to be before B in this ranking). We want C to be preferred to A in order to get non-transitivity, and they’re currently tied, so we need to have C before A, so this has to be C,A,B.

So our votes are now:

A,B,C
B,C,A
C,A,B

Our tallies now look like:

A is preferred to B because votes 1 and 3 put it before B.
B is preferred to C because votes 1 and 2 put it before C.
C is preferred to A because votes 2 and 3 put it before A.

So we do indeed have A preferred to B preferred to C preferred to A.

I regard Condorcet’s Paradox as basically the protoexample of intransitivity: I find that 9 times out of 10 if you have an intransitive relation you can construct an example that looks like Condorcet’s Paradox.

Let’s see how to do that with our dice:

We’re going to construct dice which basically look like voters.

Specifically each of our dice will have opposite sides marked with the same value, so they can take only three distinct values. We will label these values “Low, Medium and High”. Values may be in the range 1-9, with low values being 1-3, medium being 4-6 and high being 7-9.

The idea of our construction is that any high roll will beat any medium or low roll, etc. Further each die rolls low, medium or high with equal probability. So the only thing that determines the winner will be which one wins when they have the same value type. These categories low, medium and high will be our voters 1, 2 and 3, and one die will beat another most of the time if and only if the majority of voters rank it more highly.

So we want:

Low: A,B,C
Medium: B,C,A
High: C,A,B

Thus A will beat B if we roll low or high. B will beat C if we roll low or medium. C will beat A if we roll medium or high.

Conveniently, we have constructed our categories with three values each, so all we need to do is assign each of those values to a different die to get the ranking.

So:

A takes the values 3,4 and 8 (best low, worst medium, middle high)
B takes the values 2,6,7 (middle low, best medium, worst high)
C takes the values 1,5,9 (worst low, middle medium, best high)

And we get the desired result: On a category tie, A will beat B two thirds of the time, B will beat C two thirds of the time and C will beat A two thirds of the time.

We didn’t deliberately try for it, but it’s also worth noting that these all have the same expected score: 3 + 4 + 8 = 2 + 6 + 7 = 1 + 5 + 9 = 15. So they all have an expected score of 5.

As a side note, the actual victory chances are much lower than this: A tie will only happen a third of the time, and the rest of the time each die wins exactly half the time, so the real advantage is \(\frac{1}{3} \times \frac{2}{3} + \frac{2}{3} \times \frac{1}{2} = \frac{5}{9}\). So they will win more than half the time (just shy of 56% of the time), but not by much.

So there we have it. Non-transitive dice. This isn’t the only way you can get them – you can make it work with only the normal values 1 to 6 for example – but it’s probably the simplest example to understand, and now that we’ve shown that it’s possible the rest is just optimising for specific details.

This entry was posted in Numbers are hard on by .