As a Martian, we’ve long ago licked this “singularity” thing, so I’m logically omniscient, but I’m also bored. I decide to invade Earth, mostly for the lulz. I land my flying saucer, pull out my ray gun, and prepare to get invading.

I come across two earthlings in heated debate. They turn to me and look excited.

“Ah, excellent. A Martian! You can serve as an impartial adjudicator to our debate.”

I am, naturally perplexed.

“Uh, you do know I’m here to invade you right? I have this ray gun?”

“Yes yes. It’s very impressive, but this is much more important.”

“More important than the fact that I’m planning to enslave you all and force you to obey my every whim?”

“Sure, sure. That’s just *politics.* We don’t care about that. We’re too concerned with matters of real import. We want to know what you think about the *sportsball tournament*“.

“The what?”

“Ah ha! Excellent. A truly impartial adviser! Tell me, what do you think the chances of the Las Venturas Bandits winning the sportsball tournament are?”

Obviously I have no idea about sportsball. I’m a Martian! We don’t play such things. Still, from a position of complete ignorance I must presume that each team is equally likely to win, so I need only find out how many teams there are to determine my probabilities. So I ask them.

“Oh no. We couldn’t tell you *that*. If we give you information about sportsball you’d no longer be an impartial judge.”

Their reasoning is odd, but I shall for now adhere to their quaint local customs and apply the obvious prior of \(\frac{1}{n(n+1)}\) on the number of sportsball teams there could be (I leave it as an exercise to the reader why this would be obvious. If you’re logically omniscient like me you already know, if not it should be a simple calculation). From there, the answer is a straightforward calculation.

“Well, naturally the probability is \(\frac{\pi^2}{6} – 1\), or about \(0.645\) if you want to be all numerical about it.”

“Gosh, so confident! But what, then, is the chance of the Beaneaters winning?”

The answer is equally immediate: About \(0.290\).

“Interesting, interesting. But what, pray tell, is the chance of the Jersey Boomers winning?”

“0.185, naturally”

At this point the quieter of the two is starting to look slightly confused, but the loud one keeps talking.

“Truly Martians are very insightful. Can you tell me, how likely do you think it is that the Swedish Meatballs will win?”

“Oh, about 0.135 I would say”

At this point the one who has been quiet can no longer contain himself.

“I say! These probabilities you’re giving us don’t make any sense. I’ve been counting, and they add up to much more than one! You’re at 1.25 already!”

“Well, naturally. I had to update my probabilities as you gave me new information. At this point the probability of *any* individual one of the teams you’ve mentioned winning is about 0.135, with about a 0.459 chance that some other team you haven’t mentioned yet wins”.

“But we didn’t give you any information! The whole point was you were supposed to be impartial!”

“Of course you did. You gave me a lower bound on how many teams there were.”

At this point they both turn bright red and start shouting again at each other about how they’ve ruined everything. Thankfully, this gives me the insight I needed to calculate a truly accurate probability for who will win this game.

“Excuse me?”, I interrupt. They both turn to me.

“I’ve now understood the problem more thoroughly and can tell you the chances of each of your teams winning”

“Really? Do tell”.

They both look very eager to have the debate resolved.

So I shoot them.

And then I conquer the planet and ban the whole silly game of sportsball.

The answer is of course, zero. When you make me play silly guessing games, *everyone* loses.

What’s going on here?

Well, the Martian’s prior doesn’t matter really. If you must know, it’s based on a hierarchical Bayesian model where the number of teams is geometric, but the geometric parameter is also unknown and is distributed uniformly on \([0, 1]\). Really this example would work with almost any reasonable prior though (say, any proper prior which assigns non-zero probability to every positive integer). I just wanted something concrete so I could put numbers to it.

So we have a prior on the number of teams. Call this number of teams \(T\). The probability of a given team winning given \(T = n\) is \(\frac{1}{n}\), so the probability of a single named team winning is \(E(\frac{1}{T})\).

But we don’t have a single named team. We’ve been asked a series of questions, each of which tells us the existence of a named team. Therefore our answer to the nth question is not \(E(\frac{1}{T})\) but instead \(E(\frac{1}{T} | T \geq n)\), because we know at that point that there must be at least \(n\) named teams.

For our particular example this is always decreasing, but even in general it must eventually head towards zero because \(E(\frac{1}{T} | T \geq n) \leq \frac{1}{n}\). So as time progresses and we become aware of the existence of more and more teams your probability that you assign to any given one of them must go down.

This is a neat example because it demonstrates how if you insist on Bayesian rather than Knightian uncertainty here, the mere asking of questions can itself be a source of information which causes you to update your probabilities. As we are asked questions we are given an idea of the scope of possibilities available, and this forces us to adjust our updates. This happens even though we are logically omniscient – the questions are not forcing us to consider new possibilities, they are genuinely providing us with new information about the world.

(It is of course possible that the questioners are making names up – say by some ludicrous procedure like just picking names at random from Wikipedia’s list of fictional sports teams – but you could include a term in your probability distribution for that if you wished. I didn’t because that would have made the example harder to follow)

]]>

(The above is a hopefully accurate but very simplified summary. Feel free to tell me if you think it’s unfair and I will update it).

The VNM axioms are generally presumed to apply to “logically omniscient” beings. i.e. entities who have fully thought through all the implications of their belief and can instantly answer any purely logical question. This post is purely about what happens when you try to apply them to non-logically-omniscient beings, which is all we have access to when implementing decisions on actual people. It could be regarded as exploring the boundaries of just how logically omniscient you have to actually be in order to use these axioms in certain contexts.

One of the problems for applying this process to us non logically omniscient beings is that it ignores the fact that these questions have to be asked in actual time – it switches between an instantaneous notion of preference and the Dutch book sequence which happens in time, during which you might update your preference (in the Dutch book context this doesn’t even require a failure of omniscience because you’re getting new information – that someone is prepared to make this bet, but I’m going to focus on the failure of logical omniscience).

In particular it ignores that merely asking the question can change your preferences. I have spotted a nice example of this:

You: “Would you like $100?”

Me: *reaches for money*

You: Orrrrr…. you can have TWO HUNDRED dollars, but you’ll only get it if this set of 1000 boolean formulae is simultaneously satisfiable.

Me: Yes I am totally going to do an NP-hard problem for the possibility of an extra $100. Give me the easy money.

Alternatively:

You: “Would you like $100?”

Me: *reaches for money*

You: Orrrrr…. you can have TWO HUNDRED dollars, but you’ll only get it if this set of 1000 boolean formulae is satisfied by this set of assignments.

Me: Hmm. That’s just about worth it. *does the calculation*. Oh, cool. Yes that’s totally a satisfaction. I’ll take $200 please.

You: Cool. Now for the second question. $100 or $200 if (the same boolean problem) is satisfiable?

Me: … welll…. I’ve literally just seen a satisfaction for this, so I’ll take the $200 please.

This is essentially transitivity at work: It is clear that the answer to the question “Is this satisfiable?” is true if the answer to “Is this a satisfaction?” is ever true, but you have to find the *right* instance of the latter class of question in order to make this implication. Essentially, requiring transitivity forces you to turn a reasonable amount of computation into an unreasonable amount of computation if you want all your preferences to be consistent in advance. Otherwise you introduce path dependence into the value function you get out of a sequence of questions.

If we insisted on maintaining the same order over all time and that you had to be transitive, if we’d asked the two questions in the other order consistency would have forced you to not do the calculation and just take the $100. A decision to always take the $100 is of course perfectly consistent, but then you’re losing out in some cases where you could have turned an easy profit.

Note: This example extracted from my answer to this question I asked on the cstheory stack exchange, which I probably wouldn’t have spotted without the answers from Emil and Denis coming first.

]]>One of its major points is the existence of SPRs for a lot of tasks – simple linear rules that add up some weighted sum of various easy to measure statistics and provide a simple binary decision based on that score. For many tasks, SPRs not only out-perform human experts, they out-perform human experts who have access to the result of the SPR. i.e. when someone looks at the result of the SPR and decides to defect from its decision, they’re wrong more often than they’re right.

I have no trouble believing this, but I have *some* trouble believing that this is true even after people spend time specifically trying to learn how to outsmart the SPR. In particular I also recently read How to measure anything, and one of the points it makes is that it’s actually relatively easy to train people to give calibrated probability estimates (e.g. I am 90% sure this value lies within this range). It seems that if you can teach people to calibrate like that you should also be able to teach them to calibrate on their defection from SPRs.

I was thinking about how you might design an experiment to test this and I accidentally designed a computer game instead. It’s in some ways pretty similar to Papers Please with different incentive structures and a twist.

The idea is this: You are a psychologist employed by the justice system. Your job is to interview potential parolees and decide whether they are likely to re-offend. You have an SPR which tells you what it thinks and it’s right about 70% of the time. You then get to conduct a short several minute interview with them. At the end of which you have to state whether you believe they are likely to re-offend.

If you say they are unlikely to re-offend they will get released. At some point later – maybe immediately, maybe quite a while down the line – you may discover that they have done something violent and your recommendation was less than stellar.

Your goal is to not get fired. More accurately, your goal is to last long enough without getting fired that this job looks good on your CV and you can get the hell out of there and go do something you don’t hate.

There are three things that will get you fired:

- Releasing too many people who turn out to be violent. The more violent they are the worse this is – someone who gets into a bar fight won’t count against you much, someone who shoots up a crowded public area will basically get you instant-fired. Also, cases where the rule predicted violence and you disagreed will be judged extra harshly.
- Not releasing enough people. Prisons are crowded, y’know, and we need to be shown to be fair and impartial. To start with you’ll be pretty free on this front but as the game goes on you’ll be under pressure to release more people.
- Agreeing with the rule too often. The rule’s accuracy and the game’s loss thresholds should be calibrated so that you will do pretty well with the above conditions if you always follow the rule – you’ll release enough people and most of them won’t be
*very*violent. You’ll probably last quite a while, but at some point your boss will notice that you’re adding literally no value over this really much cheaper rule they could be using instead and decide they might as well just do that.

The game is a nice mix of an educational game about calibration and learning and a game which forces you to confront the fact that its perverse incentives are making you behave like a terrible person. At the end you should probably also get a score which is the number of people who would have been just fine but whom you cruelly kept locked up.

My major concern would be that it might turn out that you *can’t* learn to do better than the SPR, but I feel like this is relatively easily soluble in a game at least by introducing a couple hidden variables that the SPR doesn’t have access to but are relatively easy for you to deduce, or by making the decision boundary more complex than a simple linear rule can easily adapt to.

In my last post I outlined a ludicrously over-simplified model for why you might want to consider high variance strategies.

I’ve been thinking over some of the modelling assumptions and wondering whether it could be made a bit less over-simplified. The only ones that are obviously easy to weaken are the assumptions on the distribution shape.

Here’s an example that shows you need *some* assumptions on the distribution shape. Consider a standard distribution \(Z\) with \(P(Z = 1) = P(Z = -1) = \frac{1}{2}\) and suppose we can choose strategies of the form \(\mu + \sigma Z\). Note that \(E(Z) = 0\) and \(\mathrm{Var}(Z) = 1\) so these really are the mean and standard deviation of our distributions.

But \(E(\max\limits_{1 \leq k \leq n} Z_i) = 1 (1 – 2^{-n}) – 2^{-n} = 1 – 2^{1 – n}\) (because the maximum takes the value \(-1\) only if all of the individual values are \(-1\), which happens with probability \(2^{-n}\)). So \(E(\max\limits_{1 \leq k \leq n} \mu + \sigma Z_i) = \mu + (1 – 2^{1 – n} \sigma\). \(1 – 2^{1 – n} < 1\), so you’re always better off raising \(\mu\) rather than \(\sigma\).

The interesting feature of this example is that \(P(X_k \leq \mu + \sigma) = 1\). If this happens then it will always be the case that \(E(\mathrm{max}(X_k) ) \leq \mu + \sigma\) so there’s no real benefit to raising \(\sigma\) instead of \(\mu\) (note: It’s conceivable that there’s some complicated dependency on \(\mu\) as a parameter, but I’m just going to assume that \(\mu\) is purely positional and not worry about that).

You only need to go slightly beyond that to show that for some sufficiently large group you’ll always eventually be better off raising \(\sigma\) rather than \(\mu\).

Suppose all our strategies are drawn from some distribution \(X = \mu + Z^\sigma\) with \(E(Z^\sigma) = 0\). The only dependency on \(\sigma\) that we care about is that \(P(Z^\sigma \geq (1 + \epsilon)\sigma \geq p\). for fixed \(\epsilon > 0, 0 < p < 1\) and all \(\sigma > 0\) (this is trivially satisfied by the normal distribution for example).

Then we have \(E(\max\limits_{1 \leq k \leq n} X_n) = \mu +E(\max\limits_{1 \leq k \leq n} Z^\sigma_n)\).

So we now just want to find some lower bounds on \(T_n = E(\max\limits_{1 \leq k \leq n} Z^\sigma_n)\). We’ll split this up as three variables. Let \(T_n = U_n + V_n + W_n\) where \(U_n = T_n \mathbb{1}_{T_n \leq 0}\), \(V_n = T_n \mathbb{1}_{0 < T_n < (1 + \epsilon) \sigma }\) and \(W_n = T_n \mathbb{1}_{(1 + \epsilon) \sigma \leq T_n }\).

Because \(U_n \geq 0\) and \(W_n \leq (1 + \epsilon) \sigma\) this gives us the lower bound \(E(T_n) \geq E(U_n) + (1 + \epsilon)\sigma P(W_n \geq (1 + \epsilon) \sigma) \geq E(U_n) + (1 + \epsilon)\sigma (1 – p)^n\).

We now just need to bound \(U_n\) below. But \(U_n \geq U_1 \mathbb{1}_{T_k \leq 0, k \geq 2}\). But these two random variables are independent so \(E(U_n) \geq E(U_1) P(Z \leq 0)^{n – 1}\). Therefore \(E(T_n) \geq + (1 + \epsilon)\sigma (1 – p)^n\)

This lower bound lets us show a much less pretty version of our last result:

Given a strategy \(\mu, \sigma\) being employed by \(n\) people, and given some increase \(a\) which could go to either \(\mu\) or \(\sigma\) there exists some sufficiently large \(m\) such that for \(m\) people, changing the strategy to \(\mu, \sigma + a\) would beat changing the strategy to \(\mu + a, \sigma\).

Yeah, that phrasing is kinda gross to me too.

Note though that if we go back to the previous case where \(\sigma\) is just a scaling parameter and are just dropping the normality strategy, we can use our lower bound on \(E(T_n)\) to find some \(n\) for which \(E(T_n|\sigma = 1) > 1\) and for all \(m \geq n\) it will be beneficial to increase \(\sigma_n\).

Note by the way the crucial role of \(\epsilon\). I think if you consider a distribution that takes with equal probability the values \(\sigma, -\sigma, \sigma + 2^{-\sigma}, -\sigma – 2^{-\sigma}\) (note that \(\sigma\) is not the standard deviation here) then it’s not actually helpful to raise \(\sigma\) instead of \(\mu\), even though \(P(Z^\sigma > \sigma) = \frac{1}{4}\). I have not bothered to work out the details.

]]>Lets adopt an overly simplistic mathematical model and say that how “good” the result of the strategy is is determined by a normal distribution \(N(\mu, \sigma^2)\).

Lets further simplify and say that all we’re actually interested in to determine whether a strategy is good or not is the expected value of this goodness.

So how should we choose the best strategy?

Well we pick the one with the highest value of \(\mu\). The \(\mu\) parameter is exactly the expected value.

Now, suppose there are multiple people all working independently at the same task. What strategy should they adopt? Should they just all adopt the individually best strategy?

Well… it depends.

Specifically it depends on how these values combine.

If they just add together (e.g. say you’re doing something whose output is measured in quality adjusted life years and there are more than enough people you should be helping that you’re not meaningfully going to start interfering with eachother) then yes, you should adopt the same strategy. The sum of independent normal random variables is another normal random variable with expectation equal to the sum of the expectations of its components.

But not everything is additive. For example, suppose you’re all trying to come up with a new type of invention, or to write the best essay, or anything where you can simply take the best result and replicate it as many times as you want.

What strategy should you adopt then?

Suppose that each of \(N\) people adopt a normally distributed strategy as above. Say the results of person \(n\) are \(X_n\). Then the overall result is \(R = \mathrm{max} X_n\).

By scaling, the expected value of this is \(\mathbb{E}(R) = \mu + \sigma \mathbb{E}(\mathrm{max} Z_n)\) where the \(Z_n\) are standard normal random variables.

And it turns out that for \(N \geq 4\), \(\mathbb{E}(\mathrm{max} Z_n) > 1\). So for groups of at least 4 people, increasing the standard deviation actually improves the best result more than increasing the mean by the same amount does. This shouldn’t be terribly surprising – at 4 people there’s just over a 50% chance that someone will get at least one standard deviation above the mean.

So for groups of more than a couple people who are working independently and whose results can simply be replicated, the optimal strategy is not in fact to optimise your local strategy but to pursue higher variance ones. You should still try to produce good strategies of course, but by pursuing higher variance at some cost of optimality you may significantly improve the results.

Obviously this model is overly simplistic – quality isn’t scalar, it’s probably not normally distributed, and independence can be hard to achieve in practice, but I hope it is at least suggestive.

]]>