Category Archives: Numbers are hard

Topological compactness is an induction principle

Compactness is one of the most important concepts in point-set topology. However when you first come across it it might not necessarily be obvious why. Paul Crowley asked me yesterday to do a follow-up post to my one about continuous functions to explain compactness, so I spent some time this morning thinking about how one might make it seem more intuitive.

I’m not sure this is that post, but it’s an interesting insight I had while trying to figure that post out. Another more comprehensive one may follow.

There are a variety of induction principles. The most well known one is that if you prove that something is true for 0 and that if it’s true for n it’s true for n + 1 then it must be true for all natural numbers. This is a special case of the induction principle for well ordered sets: In a well-ordered set, if p being true for all y < x implies that p is true for x, then p is true for all x.

The common thread is that induction allows you to conclude that something is true for the larger case by showing it’s true for smaller pieces and that the set of things for which it is true is closed under some operation.

I noticed this morning that there is a topological induction principle that’s fairly straightforwardly equivalent to compactness. What it essentially says is that compactness is the property that lets you conclude that things that are true locally are true globally.

Theorem: A topological space \(X\) is compact if and only if for every property of sets \(p\) such that:

  1. For all \(x \in X\) there is some open set \(U\) with \(x \in U\) and \(p(U)\).
  2. If \(p(A_1), \ldots, p(A_n)\) then \(p(\bigcup A_i)\)

Then \(p(X)\).

Proof:

First assume \(X\) is compact. By property 1 the set of open \(U\) such that \(p(U)\) is an open cover. By compactness it has a finite subcover. i.e. we can find \(U_1, \ldots, U_n\) such that \(p(U_i)\) and \(\bigcup U_i = X\). Thus by property 2, \(p(X)\).

Now assume the induction principle holds. Let \(\mathcal{U}\) be an open cover of \(X\). Let \(p(A)\) be the property that \(A\) is covered by a finite union of elements of \(\mathcal{U}\). This satisfies the requisite properties – every point is contained in a single element of \(\mathcal{U}\) which suffices for property 1, and a finite union of finite sets is finite, so it satisfies property 2. Therefore by the induction principle, \(X\) is also covered by a finite union of elements of \(\mathcal{U}\), i.e. a finite subcover. Therefore \(X\) is compact.

QED

Note the restriction to finite unions. If we’re allowed to take arbitrary unions then this is just true of all topological spaces.

I feel like there’s probably a similar induction principle lurking in the intersections of closed sets equivalent form. Perhaps something that looks more like recursive function definition? I’m not totally sure.

For an example of how you could use this, here is a reframing of the proofs of two classic results:

Theorem: A compact metric space is bounded.

Proof: A finite union of bounded sets is bounded, and the open ball \(B(x, 1)\) is a bounded open set containing \(x\). Therefore by topological induction \(X\) must also be bounded. QED

Theorem: A compact subset of a Hausdorff topological space is closed.

Let \(X\) be Hausdorff and \(Y \subseteq X\) be compact. Let \(x \in X \setminus Y\).

Let \(p(A)\) be true if \(x\) is not in the closure of \(A\) considered as a subset of \(X\).

Because \(X\) is Hausdorff for every \(y \in Y\) we can find \(U, V\) open and disjoint with \(y \in U\) and \(x \in V\). So \(\overline{U} \subseteq V^c \subseteq X \setminus \{x\}\) and thus \(p(U)\). The closure of a finite union of sets is the union of the closure of the sets, so if \(p(A_1), \ldots, P(A_n)\) then \(p(\bigcup A_i)\). Therefore \(p(Y)\). i.e. \(x\) is not in the closure of \(Y\). But \(x\) was an arbitrary point not in \(Y\), hence \(Y\) must be closed. QED.

These aren’t really very different from the normal proofs. Virtually identical even. I think they come out slightly more nicely this way, but there’s not much in it. Hopefully it helps clear up some of the intuition around compactness though.

This entry was posted in Numbers are hard on by .

Continuous functions are those which preserve approximate measurements

I like the point-set topology definition of continuous function. It’s elegant, generalises well, and I think puts a bunch of things on firmer foundations than epsilon-delta definitions.

But it also confusing to some people. Why are open sets? Why is it that the pre image of an open set under a continuous function is open rather than the image?

One way to fix this is to start with different but equivalent definitions of topological spaces. This is fine, but it’s a little unsatisfying. The open set formulation is widely used because it’s quite powerful. It would be nice to be able to make intuitive sense of it. Additionally, the same sort of definition crops up elsewhere – e.g. a measurable function is one where the pre-image of measurable sets are measurable.

So I’d like to give you some intuition as to why open sets make sense and why given that intuition the definition of the continuous function is the “obvious” one.

I suggest that the intuitive concept you should attach to an open set is that and open set is an approximate measurement.

What does this mean?

Well, first let me pin down what I mean by the words individually.

A “measurement” does not here mean something like “this rod is exactly 1.23 meters long”. “This rod is less than a mile long” or “this rod is between 1 and 2 meters long” are also measurements. “The length of this rod is no more than 100 times its diameter” is also a measurement. A measurement in this case is anything that helps you pin down the range of possible objects.

And “approximate” does not mean “I guessed”. It means “you do not need to know the exact value arbitrarily well in order to validate this measurement”. You can easily validate that the rod is between 1 and 2 meters long with a tape measure. You can’t validate that it’s exactly 1.23 meters long with a tape measure (but you can validate that it’s not).

An approximate measurement is, more or less, one where you only need a finite amount of information to validate it.

Note that you might need an infinite amount of information to refute it. If I tell you that the rod is less than one meter long and it turns out that the rod is exactly one meter long down to such a subquantum scale that it turns out we’re all living in a simulation of a platonic euclidean universe then you need to measure its length infinitely precisely in order to tell me I’m wrong – even if you measure it down to the nearest micron it might be half a micron short of one meter.

So this is our intuitive and imprecise definition of an open set: An open set is one where for any member of the set we can prove that it’s a member of that set with a finite amount of information.

This is of course nonsense. How does this give rise to different topologies? And what constitutes information?

What those questions are then determines our topology. They don’t need to actually correspond to any notion of finiteness (for example we could simply define the discrete topology in which all of the questions “Is it this point?” are permitted), but many classic ones do: e.g. You only need to evaluate a real number to a finite number of decimal places to prove that it’s in an open set.

Essentially these two resolve themselves together: Topologies correspond to different sorts of questions we can ask, and then “finite amount of information” just means that for every member we can prove that it’s a member by only asking a finite number of those questions.

This intuition corresponds nicely to the topology axioms: You only need 0 questions to determine if a member of the whole set is a member of the whole set, the empty set satisfies the property vacuously. If you have an arbitrary union \(\bigcup U_i\) then for \(x \in \bigcup U_i\), \(x \in U_j\) for some \(j\) and you only need a finite set of questions to prove that. If \(x \in U \cap V\) then you can take the finite proof that \(x \in U\) and the finite proof that \(x \in V\) and union them together.

You can make all this formal and get yet another characterisation of topological spaces but it’s not very interesting and ends up mostly corresponding to existing notions.

With that notion of approximate measures hand waved, we can now hand wave our notion of a continuous function:

If you apply a continuous function to some input and make an approximate measurement of the result, this gives you an approximate measurement of the input.

So for example if we just consider the length of a rod and make an approximate measurement of that, this gives us an approximate measurement of the whole rod: It still constrains the space of possible objects in a way we only need to ask finitely many questions to answer.

And this is precisely what “the preimage of an open set is open” means: If we make some measurement \(V\) and constrain \(f(x) \in V\) then this precisely corresponds to \(x \in f^{-1}(V)\). So “an approximate measurement of the result of a continuous function gives an approximate measurement of its input” is exactly “The preimage of an open set under a continuous function is open”.

But why does that match what we would intuitively think of as “continuity”?

Well, in some cases it doesn’t really, but that’s OK. For examples where we have more intuition about what continuous should mean it matches quite nicely:

Consider e.g. \(f\) with \(f(0) = 1\) and \(f(x) = 0\) otherwise. Now consider the measurement \(f(x) > \frac{1}{2}\). In order to know whether this holds for \(x\) we’re back in the “this rod is exactly one meter long” territory – no matter how precisely you measure \(x\) it might be just a bit closer to zero than that but still non-zero.

This works in more generality: At any point of discontinuity \(x\) you will find open sets that you need to know \(y\) arbitrarily well to distinguish it from \(x\) in order to determine membership.

Note also that an approximate measurement of the input to a continuous function does not give you an approximate measurement to the output. Consider e.g. the constant function \(f(x) = 1\). Then given some open set \(U\), in order to determine if \(y \in f(U)\) we need to test if \(y = 1\). This requires infinitely many decimal points of \(y\) and thus is not an approximate measurement.

Anyway, that’s enough hand waving. I don’t know if this actually clears things up for anyone (I figured this representation out long after I’d already internalized the rules of topology), but hopefully it’s given a different perspective on it.

This entry was posted in Numbers are hard on by .

Asymptotic behaviour of max(Z_n)

More on high variance strategies.

I was wondering what the asymptotic behaviour of \(\max(Z_n)\) was. It seems “obvious” that it should grow without bound, but it turns out that it grows really very slowly. It turns out that for \(n = 10^{10}\) it’s still less than 7.

I thought I would quickly verify that it does actually grow without bound.

Let \(T_n = \max Z_n\). Then \(E(T_n) = E(T_n | T_n < 0) P(T_n < 0) + E(T_n | T_n > 0) P(T_n > 0)\). But \(E(T_n | T_n < 0) \geq -1\) (because \(E(Z_1 | Z_1 < 0) = -1\) and \(T_n \geq Z_1\)\) and \(P(T_n < 0) = 2^{-n}\), so \(E(T_n) \geq (1 – 2^{-n}) E(T_n|T_n > 0) – 2^{-n}\). So we need only concern ourselves with the positive behaviour.

Let \(S_n\) be a random variable with the distribution of \(T_n | T_n > 0\).

Consider \(t > 0\). We want to show that for sufficiently large \(n\), \(E(S_n) > t\).

Now for any \(s\), \(E(S_n) \geq s P(S_n \geq s)\), because \(S_n\) is always positive. So let \(s = 2t\). Now \(E(S_n) \geq 2t P(S_n \geq 2t)\).

But \(P(S_n \geq s) \geq P(T_n \geq s) = 1 – F(s)^n\) where \(F\) is the cdf of the standard normal distribution. So \(E(S_n) \geq 2t (1 – F(2t)^n)\).

But if \(n \geq \frac{\log(\frac{1}{2})}{\log(F(2t))}\) then \(F(2t)^n \leq \frac{1}{2}\) and so \(E(S_n) \geq 2t (1 – \frac{1}{2}) = t\).

Thus \(E(S_n)\) grows without bound and thus so does \(E(T_n)\).

This could be tightened up a bit to get the asymptotic behaviour of the growth, but I’m not that clear on what the asymptotic behaviour of \(F(t)\) is so I haven’t worked through the details yet.

This entry was posted in Numbers are hard on by .

Another take on high variance strategies

So I was thinking about my high variance strategies post and I realised that there was a case I hadn’t considered which is kinda important.

Which is that often you’re not very interested in how good your best solution is, only that it’s at least this good for some “this good” bar. e.g. you don’t care how cured cancer is as long as it’s pretty cured, you don’t care how many votes you got as long as it’s enough to win the election, etc.

So for these circumstances, maximizing the expected value is just not very interesting. What you want to do is maximize \(P(R \geq t)\) for some threshold \(t\). The strategies for this look quite different.

Firstly: If you can ensure that \(\mu > t\) the optimal strategy is basically do that and then make the variance as low as you can.

For the case where you can’t do that the region of which is better, variance or mean, becomes more complicated.

Let \(F\) be the cumulative distribution function of your standardised distribution (this can be normal but it doesn’t matter for this). Then \(P(R \geq t) = (1 – F(\frac{t – \mu}{\sigma}))^n\). This is what we want to maximize.

But really what we’re interested in for this question is whether mean or variance are more useful. So we’ll only look at local maximization. Because this probability is monotonically decreasing in \(g(\mu, \sigma) = \frac{t – \mu}{\sigma}\) we can just minimize that.

\(\frac{\partial}{\partial \mu} g = -\frac{1}{\sigma}\)

\(\frac{\partial}{\partial \sigma} g = -\frac{t – \mu}{\sigma^2}\)

So what we’re interested in is the region where increasing \(\sigma\) will decrease \(g\) faster than increasing \(\mu\) will. i.e. we want the region where

\(- \frac{t – \mu}{\sigma^2} < -\frac{1}{\sigma}\)

or equivalently

\(t – \mu > \sigma\)

i.e. \(t > \mu + \sigma\)

That’s a surprisingly neat result. So basically the conclusion is that if you’re pretty close to achieving your bound (within one standard deviation of it) then you’re better off increasing the mean to get closer to that bound. If on the other hand you’re really far away you’re much better off raising the variance hoping that someone gets lucky.

Interestingly unlike maximizing the expected value this doesn’t depend at all on the number of people. More people increases your chance of someone getting lucky and achieving the goal, but it doesn’t change how you maximize that chance.

 

This entry was posted in Numbers are hard on by .

Why mathematics makes you better at programming (and so does everything else)

There’s been a bunch of discussion on whether mathematics is programming, whether you need to learn mathematics to program, etc. on Twitter recently. Some of it has been quite heated. So naturally I thought I’d wade right into the hornet’s nest because I’ve got no sense of self-preservation.

There are essentially 4 levels of question going on here:

  1. Is programming basically mathematics?
  2. Do you need mathematics to be a good programmer?
  3. Is there a useful mathematics of programming?
  4. Can learning mathematics make you a better programmer?

The answers to these questions are respectively no, no, yes but you probably don’t care most of the time and yes.

Cool, well, that was easy. Are we done?

Not just yet. I’m going to spend some time unpacking the last of these because it’s the most interesting: Does learning mathematics make you a better programmer?

Well, yes. Of course it does.

So does learning to write. So does learning to cook. So does learning about philosophy of ethics, or about feminism. Learning another human language almost certainly helps you to be a better programmer. Your programming will probably improve if you learn Judo, or knitting, or tap dancing.

I’d go as far as to say that it is unusual to find a pair of skills where learning one will not in some way improve your ability in the other.

Part of this is just that whenever you learn a thing, you’re also learning about learning. You learn what works for you and what doesn’t, and the next time you need to learn something you’ll be that much better at it.

But there’s more to it than that, right? Surely mathematics teaches you more about how to program than just about how to learn things.

Well, yes, but it’s not because by learning mathematics you’re necessarily learning anything directly applicable to programming. Sure, you might be, and you might even be doing the sort of programming to which that maths is directly applicable. But if even if you’re not, learning maths will still make you a better programmer than say, reading Derrida (probably. I haven’t read Derrida so I might be wrong) or learning to cook (Almost certainly. I have learned to cook, and mathematics was way more useful. Your brain might work differently than mine).

Why?

Well, in much the same way that there is a sort of generalised skill of learning that you can improve just by learning things, there are all sorts of other generalised skills. For want of a better word (if you have a better word, do tell) I call them “microskills”. They’re mostly not things that people think of as skills – it’s things like breaking a problem down into parts, abstract thinking, learning to ask the right questions, perseverance when stuck, learning when not to persevere when stuck, and thousands upon thousands of others.

People tend not to think of these as skills because they think of them as innate abilities. Stuff you’re just good at or bad at and you’ve got to cope with that. And I mean, it’s certainly possible that you can be innately good or bad at these (and I may be foolish enough to touch this debate but I’m not touching that one with a barge pole), but regardless of innate ability they all share a characteristic that makes them skills: they get better if you practice them.

Almost every skill you learn will depend on some of these microskills. Which ones is not at all fixed: People compensate by trading off one for the other all the time. You might be less good at working memory, so you practice your organisation skills and keep meticulous notes. You might be easily distractable and lack the ability to methodically push through a problem but make up for it by creative flashes of insight (yes you can learn to have creative flashes of insight, no creativity is not a fixed thing that you can never get any better at).

But it is at least biased. Different skills will encourage you to draw on different microskills. Mathematics will draw heavily on analytical ones – abstract problem solving, hypothesis generation, breaking down a problem into smaller parts, etc. All of these are things that are super useful in programming. So by learning mathematics you’ve developed all these microskills that will later come in super handy while programming.

But, of course, you can develop the same microskills when programming, right?

Wellll….

OK, no. I’m going to go with yes. You will absolutely develop all of the relevant microskills in programming. Learning to be good at programming by doing programming is 100% a viable strategy and lots of people successfully follow it.

The thing is, you won’t necessarily develop them to the same degree. Some you will develop more, some you will develop less. And this isn’t necessarily because the ones you would develop less in programming wouldn’t be useful to be better at.

There is essentially one key thing required for learning to get better at something (HORRIBLE OVERSIMPLIFICATION ALERT): Feedback

Feedback is how clearly and strongly you discover whether you did well or badly at a thing. If you can just try something and immediately find out if it worked or failed, that’s great feedback. If you try something and 3 months later you get a positive result that might be the result of that something (or it might be the result of something else that happened in the intervening 3 months), well… not so much.

Feedback is essentially how you get better because it lets you know that you are getting better, and that’s how you direct your learning. As you get better at a thing, the amount of feedback you tend to get degrades – and this isn’t necessarily because the benefit you’re getting from improving is decreasing (though it often does) it’s often because the benefit is moving further down the line, or into things that are harder to quantify (for example, the benefits of “a sense of good taste” in programming beyond a basic avoidance of complete spaghetti code are often not immediately apparent until you’ve sat atop a mountain of technical debt).

And this is where learning other skills can be super useful, because they provide different feedback loops. Sometimes they even let you develop microskills that appeared completely irrelevant and then once you had them turned out to be super useful (silly example: Strength training to develop core strength. Totally inapplicable to programming, right? Well, yeah, except that it turns out that actually not being distracted by lower back pain while you program is quite handy), but most of the time it’s just that the feedback cycle on them is different and the incentive structure for developing that skill is different – mathematics will lean more on some microskills than programming does and less on others. Similarly it will provide better feedback for some, worse for others.

I could provide some specific examples of which things I think mathematics will help you develop better, but I won’t. It’s very hard to judge these things, and people can argue endlessly about which is better for what. But the specifics don’t matter: Basically all forms of abstract thinking are useful for programming, and the only way that we could reasonably believe that mathematics and programming could provide exactly the same set of feedback loops and incentive structures as each other would be if we believed that they were the same thing.

And that would be silly.

This entry was posted in Numbers are hard, programming on by .