Category Archives: Numbers are hard

Another example of a compact convex set

At the end of my last post on compact convex sets I claimed that every point in one was a countable sum of extreme points. I later realised that I’d made a stronger claim than the theory did, and thus that it was probably false. As such I owe all 5 of you who read my maths posts a counter-example.

One reason it took a little figuring out is that I’m pretty sure the result is true for normed spaces. I haven’t worked out the details, but basically I think you can use a sequence of finite sums approximating it to within \(2^{-n}) to build a countable sum equalling it exactly. I may bother to figure out the details, but I’m about 90% sure they will work.

However the claim is false for locally convex (Hausdorff) topological vector spaces. Here’s an example:

Let \(V = \mathbb{R}^{2^{\aleph_0}}\) with the product topology. The basic open sets in this are convex, so it’s a locally convex topological vector space.

Let \(A\) be the set of non-negative vectors such that for all \(a, b\), \(x_a + x_b \leq 1\). That is, we’ve “cut off the corners” of the unit cube. This is a closed subset of the cube \([0, 1]^{2^{\aleph_0}}\), which is compact because of Tychonoff’s theorem, therefore is itself compact. It’s an intersection of convex sets, so is also convex.

The extreme points of this are then the origin and all the unit vectors \(e_a\). Therefore any countable sum of extreme points can only be non-zero on countable many coordinates (because convergence is pointwise). But the vector which is uniformly \(\frac{1}{4}\) everywhere is a member of \(A\) which is non-zero at uncountably many coordinates, and thus cannot be a countable sum of the extreme points.

This entry was posted in Numbers are hard on by .

Some examples of compact convex sets

Due to reasons (connected to incomplete von-neumann morgenstern orders if you must know) I’ve been thinking about the Krein-Milman theorem recently, and I realised that I had some misconceptions in my head that were obviously false when I thought about them for five minutes. Here are some examples to illustrate its boundaries.

The Krein-Milman theorem says that every compact convex subset of a locally convex vector space is the closed convex hull of its extreme points.

Consider \(\mathbb{R}^n\) with the normal euclidean norm. Then the unit ball \(\overline{B}(0, 1)\) is a compact convex set with infinitely many extreme points.

The same is true of any \(l^p\) norm with \(1 < p < \infty\), because such spaces are strictly convex.

It is not true for \(p = 1\) (whose extreme points are the points \(\pm e_n\)), or for \(l^\infty\) (whose extreme points are the set of points with \(x_i = \pm 1\)), both of whose unit balls have finitely many extreme points.

Moreover you cannot drop the “closed” part. Not every point is a convex combination of finitely many extreme points:

Consider \(A \subseteq l^\infty\) defined by \(A = \{x : |x_n| \leq \frac{1}{n}\}\). This is convex (because it’s a product of convex sets) and compact (it’s closed, so complete, and you can construct \(\epsilon\)-nets explicitly). Then the extreme points are the set of all points \(\{x : |x_n| = \frac{1}{n}\}\).

Let \(y = \sum\limits_n \lambda_i x_i\) be a finite combination of extreme points. Then \(n y_i\) can only take the \(2^n\) (and in particular finitely many) values \(\sum\limits_n \pm \lambda i\).

Consider \(y_n = 2^{-n}\). Then \(y \in A\) and \(n y_n\) takes infinitely many values and thus \(y\) cannot be a combination of finitely many extreme points of \(A\).

Note that it is true that every point in a compact convex set is a sum \(\sum\limits_{n=0}^\infty p_n x_n\) where \(\sum p_n = 1\) and the \(x_n\) are extreme points, you just need the sum to be infinite.

Edit: Actually I think it’s not, though I don’t yet have a counterexample. In general you may need a full integral over a probability measure on the extreme points. I’m going to have a thing about how to construct such.

This entry was posted in Numbers are hard on by .

Correlation does not imply correlation

By now you’ve probably seen a graph like this:

xqOt9mP

(source)

The site tylervigen.com finds spurious and entertaining correlations. It’s pretty good. If you haven’t seen it I encourage you to go check it out.

Chances are if you’ve already seen it then you’ve seen it linked with some caption like “Reminder that correlation does not imply causation“.

I wish this wasn’t the take home message that people were deriving from this, primarily for two main reasons:

Firstly, I just don’t think this message needs pushing any harder. Most of the people who can usefully receive it either already understand stats well enough that they don’t need it or already understand stats badly enough that every time someone posts a paper they smugly assert “Ah, but correlation doesn’t imply causation” and then pat themselves on the back and say “Well done, me. You did a science. Good job. You deserve your cookie” (Sorry. This is one of the many peeves in my menagerie of pets). “Correlation doesn’t imply causation” has more or less reached saturation, and that saturation point is far larger than is actually useful.

But secondly, there’s something more interesting going on here.

Some of these correlations are very much of the classic common causal variable sort of correlation: Neither causes the other, but both are caused by the same thing. e.g. in the graph above, what we have is two variables which are increasing over time, so they correlate. This is unsurprising (if you’re wondering why they track so well on the graph that’s another interesting thing demonstrated here: Note that the two variables aren’t on the same scale. They only track that closely because the scale was adjusted to fit. This doesn’t actually affect the correlation, but it makes its display more convincing). More or them are like this:

age-of-miss-america_murders-by-steam-hot-vapours-and-hot-objects

(source)

Sure, the track is not quite as perfect as before, but it’s still remarkably good.

Clearly people are inspired by Miss America to commit hot steamy murder.

Except, well, that’s probably not what’s really going on. I mean, I’m not ruling out the possibility that there is some underlying common causal factor here, but it doesn’t seem very likely. What seems more likely is that this correlation has zero predictive power and actually it will fall apart if more data points are added. All that produced this graph is that the dice happened to roll the right way – the fact that for all 11 data points the graphs have tracked each-other is nothing more than a coincidence.

“But David,” you cry, “that seems incredibly unlikely. How could such a perfect relationship occur by chance?”

I’m glad you asked, suspiciously convenient anonymous blog commenter. I’m glad you asked.

The simple answer is that this happens in two ways. We only have 11 data points. This means that you can get quite high correlations purely by chance. The standard critical value for a correlation which will occur by chance no more than one time in 100 for this many data points is 0.684 . This is based on a model that doesn’t actually hold here, as it assumes normality of the data which is unlikely to hold for all these variables, but it illustrates the point. Also regardless of what your model is, a correlation of any sorts of variables that only holds by chance about one time in a hundred is probably a pretty reasonable correlation.

Still, one time in a hundred is a pretty big coincidence, right?

Well… it is if you’re not looking for it. The thing about things which happen one time in a hundred is that they’re surprising if you look at a single thing and it happens, but they’re not very surprising when you look at 100 things and one of them exhibits it.

And when looking at correlations there are a lot more than a hundred things you’re looking at, because you’re not just looking at the different variables you’re looking at pairs of them, and there are a lot more pairs than variables. So if there were e.g. 100 different variables then there would be 4950 pairs of variables (the formula is \(\frac{n(n-1)}{2}\) – there are n choices for the first variable, n – 1 choices for the other because it has to be a different one. This overcounts by a factor of two because the pairs are unordered so you can get each pair two ways depending on which you pick first, so divide the answer by 2).

This means that if you were looking for correlations that only occur one time in a hundred, you’d expect to find about 50 of them in 100 variables. And every time you double the number of variables, this goes up by a factor of (nearly) 4 – so if you had 200 variables you’d find 200 correlations, etc. In fact, the site seems to have about 3500 variables (based on some html scraping) so you’d expect it to find about 61,000 different significant correlations. Suddenly the fact that it was able to mine through all those variables to find things which track so well doesn’t seem very surprising, does it?

Which is the message I’d like people to derive from this site. Not that correlation doesn’t imply causation, but that there will always appear to be correlations if you look hard enough, even where none really exist, and you shouldn’t expect that the correlations you find that way will actually predict anything about future values. This is an object lesson about the dangers of running too many experiments and not considering that most of the positive results will be false positives, not about the conclusions you draw from those results.

This entry was posted in Numbers are hard on by .

Locally compact Hausdorff spaces and the Riesz Representation Theorem

So I’m trying to get my head around the proof of the Riesz Representation Theorem (I’ve mostly got it, I’m just trying to. As part of doing this I was trying to figure out the role of the assumption that the space was a locally compact Hausdorff space: The proof generally seems to follow through with just normality (and maybe paracompactness?).

(Note: Asking questions like this is the mathematics equivalent of my asking small questions approach to learning)

Eventually I realised where it was hiding. It’s not actually in the existence part of the proof, it’s in the uniqueness: If the space is not locally compact then you can’t cover enough points with functions of compact support and thus there will be a large chunk of the space that your functions just ignore and you can use whatever measure you like there.

More detailed proof: Let \(x\) be a point with no compact neighbourhood. Then every function \(f\) of compact support has \(f(x) = 0\) as otherwise the support of \(f\) would be a compact neighbourhood of \(x\). Therefore the measure which assigns a mass of 1 to \(x\) is indistinguishable from the \(0\) measure by integrating against functions of compact support. QED

This lead me to think about the structure of locally compact subsets of topological spaces. In particular I noticed the following:

Theorem: Let \(X\) be a regular topological space. Then there is a maximal open sets \(A \subseteq X\) such that \(A\) is locally compact in the subset topology.

Proof:

Let \(A\) be the set of points with a compact neighbourhood (that is there is open \(U \ni x\) with \(\overline{U}\) compact).

Then certainly every locally compact open subset of \(X\) is contained in \(A\): Let \(B\) be such a subset and let \(x \in B\). Then there exists \(x \in U \subseteq B\) with \(\overline{U} \subseteq B\) compact (because the closure is compact it doesn’t matter whether we mean closure in \(B\) or in \(X\)). Thus by definition of \(A\), \(x \in A\).

So we need only show that \(A\) is locally compact.

Suppose \(x\) in \(A\). Then because \(X\) is regular, we have open sets \(T, V\) with \(x \in T\), \(A^c \subseteq V\) and \(T \cap V = \emptyset\).

Now. Suppose \(x \in U\) with \(\overline{U}\) compact (such exists by definition of \(A\)). Then \(x \in U \cap T\). But \(\overline{U \cap T} \subseteq \overline{T}\) so is a closed subset of a compact space and thus compact. Further, \(\overline{U \cap T} \subseteq V^c \subseteq A\). Hence \(x\) has an open neighbourhood whose compact closure is contained in \(A\). Thus \(A\) is compact with the subset topology.

QED

So essentially \(A\) is the set of points you can distinguish with functions of compact support, right?

Well. Almost.

It turns out to be relatively easy to find an example where there is a function of compact support whose support is not contained in \(A\). In order to do this we just need to construct an example where \(\overline{A}\) is compact.

To do this we’ll glue together my two favourite examples of a locally compact space and a non locally compact space. Let \(X = \mathbb{N} \cup l^\infty\). In order to distinguish the zeros, let \(\tau\) be the 0 of \(l^\infty\).

We will give this the topology generated by the following basic open sets:

  1. \(\{n\}\) for \(n \in \mathbb{N}\)
  2. \(B(x, \epsilon)\) for \(x \in l^\infty\) with \(x \neq \tau\) and \(\epsilon > 0\)
  3. \([n, \infty) \cup B(\tau, \epsilon)\) for \(n \in \mathbb{N}\) and \(\epsilon > 0\)

where \(B(x, \epsilon)\) is the \(l^\infty\) ball.

So essentially we’re gluing together these two spaces by treating the \(0\) of \(l^\infty\) as the “point at infinity” in the one point compactification of \(\mathbb{N}\).

Then in this case \(A = \mathbb{N}\): \(\mathbb{N}\) is a locally compact open subset of \(X\) and any point \(x \not\in \mathbb{N}\) has no compact neighbourhoods (because no open subset of \(l^\infty\) has compact closure). But \(\overline{A} = \mathbb{N} \cup \{\tau\}\) which is homeomorphic to the one point compactification of \(\mathbb{N}\) and thus compact.

This then leads us to our definition of a function whose compact support is not contained in \(A\): Let \(f(n) = \frac{1}{n}\) for \(n \in \mathbb{N}\) and \(f(x) = 0\) for \(x \in l^\infty\). Then \(f\)’s support is \(\overline{A}\) which is compact, and so \(f\) has compact support.

(Note that we could have arranged for \(\overline{A}\) to be an arbitrary compactification of \(A\) using a similar construction: Take the compactification and glue a distinct copy of \(l^\infty\) to each point at infinity)

In general the set of functions of compact support on \(X\) are a subset of the set of functions which vanish at infinity on \(A\) (that is, for \(\epsilon > 0\), \(\{x : |f(x)| \geq \epsilon\}\) is compact.

Proof: Let \(\epsilon > 0\). Then \(\{x \in X : |f(x)| \geq \epsilon\} \subseteq \mathrm{supp}(f)\) so is a closed subset of a compact space and thus compact. We thus only need to show that it is a subset of \(A\) to prove the result.

But it is a subset of \(\{x \in X : |f(x)| > \frac{1}{2}\epsilon \}\) which is an open set whose closure is contained in  \(\{x \in X : |f(x)| \geq \frac{1}{2}\epsilon \}\), which is compact. Every open set with compact closure is a subset of \(A\), so  \(\{x \in X : |f(x)| \geq \epsilon\} \subseteq A\) as desired. Thus \(f|_A\) vanishes at infinity.

QED

Is the converse true? It turns out not. The following is true however:

Given \(f : A \to \mathbb{R}\) vanishing at infinity we can extend it to a continuous function \(f: X \to \mathbb{R}\) with support contained in \(\overline{A}\).

The obvious (and only possible) definition is to extend it with \(f(x) = 0\) for \(x \not\in A\). Does this work?

For \(y \not\in A\) the set \(U = \{x : |f(x)| \geq 0\}^c\) is an open set containing \(x\) such that for \(u \in U\), \(f(u) \in B(0, \epsilon)\), so \(f\) is continuous at \(x\) as desired.

QED

 

The problem is that in general there’s no reason to expect \(\overline{A}\) to be compact. Consider for example pasting \(\mathbb{N}\) and \(l^\infty\) together and not joining them together, just treating this as a disjoint union. Then \(A = \overline{A} = \mathbb{N}\) and the extension of the function does not have compact support.

So in general we have \(C_c(A) \subseteq C_c(X) \subseteq C_0(A)\), and it’s possible to have either or both of these inclusions be equalities (to get both you just choose \(X\) to be any locally compact space so that \(A = X\)). I’m not sure it’s possible to say more about it than that.

This entry was posted in Numbers are hard on by .

Paying attention to sigma-algebras

So as part of my new resolution to start reading the books on my shelves, I recently read through Probability with Martingales.

I’d be lying if I said I fully understood all the material: It’s quite dense, and my ability to read mathematics has atrophied a lot (I’m now doing a reread of Rudin to refresh my memory). But there’s one very basic point that stuck out as genuinely interesting to me.

When introducing measure theory, it’s common to treat sigma-algebras as this annoying detail you have to suffer through in order to get to the good stuff. They’re that family of sets that it’s really annoying that it isn’t the whole power set. And we would have gotten away with it, if it weren’t for that pesky axiom of choice.

In Probability with Martingales this is not the treatment they are given. The sigma algebras are a first class part of the theory: You’re not just interested in the largest sigma algebra you can get, you care quite a lot about the structure of different families of sigma algebras. In particular you are very interested in sub sigma algebras.

Why?

Well. If I may briefly read too much into the fact that elements of a sigma algebra are called measurable sets… what are we measuring them with?

It turns out that there’s a pretty natural interpretation of sub-sigma algebras in terms of measurable functions: If you have a sigma-algebra \(\mathcal{G}\) on \(X\) and a family of measurable functions \(\{f_\alpha : X \to Y_\alpha : \alpha \in A \}\) then you can look at the the smallest sigma-algebra \(\sigma(f_\alpha) \subseteq \mathcal{G}\) for which all these functions are still measurable. This is essentially the measurable sets which we can observe by only asking questions about these functions.

It turns out that every sub sigma algebra can be realised this way, but the proof is disappointing: Given \(\mathcal{F} \subseteq \mathcal{G}\) you just consider the identify function \(\iota: (X, \mathcal{F}) \to (X, \mathcal{G})\) and \(\mathcal{G}\) is the sigma-algebra generated by this function.

One interesting special case of this is sequential random processes. Suppose we have a set of random variables \(X_1, \ldots, X_n, \ldots\) (not necessarily independent, identically distributed, or even taking values in the same set). Our underlying space then captures an entire infinite chain of random variables stretching into the future. But we are finite beings and can only actually look at what has  happened so far. This then gives us a nested sequence of sigma algebras \(\mathcal{F_1} \subseteq \ldots \subseteq \mathcal{F_n} \subseteq \ldots \) where \(\mathcal{F_n} = \sigma(X_1, \ldots, X_n)\) is the collection of things we  we can measure at time n.

One of the reasons this is interesting is that a lot of things we would naturally pose in terms of random variables can instead be posed in terms of sigma-algebras. This tends to very naturally erase any difference between single random variables and families of random variables. e.g. you can talk about independence of sigma algebras (\(\mathcal{G}\) and \(\mathcal{H}\) are independent iff for \(\mu(G \cap H) = \mu(G) \mu(H)\) for \(G \in \mathcal{G}, H \in \mathcal{H}\)) and two families of random variables are independent if and only if the generated sigma algebras are independent.

A more abstract reason it’s interesting is that it’s quite nice to see the sigma-algebras play a front and center role as opposed to this annoyance we want to forget about. I think it makes the theory richer and more coherent to do it this way.

This entry was posted in Numbers are hard on by .