Category Archives: Uncategorized

Value functions lead to non-transitive preferences

Update: After a conversation with Paul Crowley I am somewhat rethinking my position here. I don’t think (as he does) that what I have written is obviously wrong, but I’m at least convinced it is less obviously right. Possibly more to follow later, possibly not.

So there’s a thing called the Von Neumann–Morgenstern utility theorem, which implies that if you have a bunch of states of the world and you have a transitive preference amongst lotteries amongst those states of the world (i.e. we end up in state \(A_k\) with probability \(p_k\)), plus some other assumptions about that preference, then there is a real valued utility function such that you prefer one lottery to another if its expected utility is higher.

I was thinking about this and I realised something. If you have a real valued utility function over states of the world then your decision procedure amongst lotteries should not be expected utility and the actual decision procedure you should take is as follows: Given lotteries A and B, you should prefer the lottery that you expect to give you a higher value.

Read what I said carefully.

Expect to give you a higher value is not the same relationship as give you a higher expected value.

We want to choose the A if \(P(f(A) > f(B)) > P(f(B) > f(A))\) (where \(f\) is our value function). This is not the same relationship as \(\mathbb{E}(f(A)) > \mathbb{E}f(B)\). Moreover, this relationship is non-transitive!

(Note: It may be reasonably argued that this is not true, and that where the expected values are different you should still choose the one with the higher expected value and the lower probability. However I think this doesn’t matter because we can then construct examples where the expected values are all equal and replace this with the decision procedure “Pick the highest expected value if there is one, else pick the one which maximizes this probability”)

Don’t believe me? The proof is very simple.

Pick nine outcomes \(A_1, \ldots, A_9\) such that their values are distinct (if you have fewer distinct values than this, replace some of these with lotteries to give distinct expected values, but you probably don’t have fewer distinct values than this – e.g. I suspect you assign distinct values to “N people die” for quite a wide range of N)

Now use those 9 outcomes to construct non-transitive dice. Put an outcome one each side.

Rolling these non-transitive dice  to get the outcome on their side is then a set of three lotteries over events such that the preference relationships between them based on your value function is non-transitive

This entry was posted in Uncategorized on by .

Book review: What The Best College Teachers Do

Someone told me something a while ago. I wish I remembered who it was. I think it was someone on IRC – probably in #haskell on Freenode or #math on EFNet.

Teaching is impossible. All you can do is help people to learn.

At the time I likely dismissed it as trite and meaningless, because I have a tendency to do that that’s taken me a long time to reign in, but it was a good turn of phrase so it stuck with me, and over time I’ve come to realise just how true it is.

What The Best College Teachers Do, by Ken Bain, is a book about how you can help people to learn.

It starts from the novel perspective that rather than trying to fit education to some preconceived notion about how things should work, we should instead look at what actually happens in practice and see what did and didn’t work.

I tend to disagree with people. Even when I think they’re mostly right there’s usually something I think they’ve not thought through properly, or that is simpler or more complicated than they think. It’s nothing personal, it’s just a thing that happens.

That’s not what’s going to happen here. I pretty much agree with everything in this book. And not in a “Oh, well, yes, that’s obvious” sort of way. It’s more of the excited jumping up and down sort of way where you’re shouting “This. Yes. This. What they are saying is exactly what I wanted to say only properly thought out and way better put”, only where that’s usually about one particular turn of phrase or short essay this is an entire book. Very little of the information in it was surprising per se, but it’s the lack of surprise you get from having a whole bunch of things you knew but didn’t know you knew suddenly clicking together into a coherent framework.

If I had to draw some highlights from the book other than “Go read the entire thing”, they’d be as follows:

  • Attempting to pour knowledge into peoples’ brains simply doesn’t work
  • Very few people have learned to learn or to engage critically with a subject, because there’s no class for it and you’re just expected to sort of pick it up (or, worse, it’s not considered important for you to pick it up). This is such an essential skill that in order to teach someone anything you must also teach them this
  • Students have built up mental models of how things work which may be wrong and may need to be unlearned. It is important to make sure that they are not just learning to parrot knowledge but are learning a new way of thinking about the world.
  • The best way for a student to learn a subject is for them to be interested and engaged with it
  • The best way to get someone interested is to connect it to questions and subjects they care about and to use it to study those questions
  • The easiest way to kill someone’s interest is to destroy their confidence in their ability
  • You can bolster peoples’ confidence by setting high but achievable standards and maintaining unwavering confidence that they can achieve them. This is also a good way to avoid problems of stereotype threat, where someone is reminded that “people like them” aren’t “supposed to” be good at this subject and do worse as a result
  • Treat your students as human beings and engage with them as equals.

The above is only a bit of a random and paraphrased collection of some of the best bits of the book. Seriously, if you have any interest in the subject of how to teach people, go read the whole thing.

This entry was posted in Books, Uncategorized on by .

A bugfix for my queuing business model

I’ve been thinking about the queuing model of business I posted the other day and I’ve realised it has a bit of a flaw that I’d missed because it’s not really relevant in the original medical context I came up with this in. Fortunately, some thinking around this flaw has caused me to come up with a model that I think is actually better, even for the original medical application.

The problem comes in two parts, but really they’re the same core issue: Your highest paying customer can always hog resources (this was much less of a problem for the medical case because you’re naturally limited in terms of how many resources you can actually consume as an individual).

Consider the following two problem cases:

  • Your top customer wants to transcode 1000 videos
  • Your top customer is asking for the meeting room every day

In the first case your customer is spamming the queue with a large number of jobs. Each job takes up a position in the head of the queue and they all have to be cleared for any jobs from other customers to be cleared.

In the second case you are not filling the queue at any point, but the fact that you are repeatedly queue jumping means you’re getting a highly disproportionate share of the resources.

So how do we fix this?

The basic idea remains the same: You have a FIFO queue and a second queue which allows people who pay more to get processed sooner. However we make two modifications:

  1. The second queue is a queue of customers, not jobs. Whenever a customer’s turn comes up, their earliest submitted job gets processed.
  2. The second queue isn’t really a queue at all. It’s a probability distribution.

The second point might require some elaboration…

The idea is simple: If we have three customers with scores 2, 1 and 1 respectively, we want to give the first customer 50% of the resources, but no more than 50% of the resources. The second two should split the remainder equally.

So what we do? We run a lottery! (yaaaay).

When we want to process a paying customer, we pick a customer at random with probability proportional to their score. So someone with twice as high a score gets picked twice as often.

And, yeah, that’s pretty much it.

As well as being fairer as less prone to pissing your customers off, this also has a really nice business benefit: People always have an incentive to pay you more. In the old model there was very little benefit to increasing the amount you pay unless you increase the amount you pay by a lot. In the new model every penny you add increases the percentage of the priority resources you get and thus is potentially of benefit to you.

This entry was posted in Uncategorized on by .

A possible more principled approach to generation and simplification in hypothesis

Currently the way generation and simplification in hypothesis work are very ad hoc and undisciplined. The API is spread across various places, and the actual behaviour is very under-specified.

There are two perations exposed, produce and simplify. produce takes a size parameter and produces values of the specifed type of about the provided size, whileas simplify takes a value of the specified type and produces a generator over simplified variants of it.

The meaning of these terms is explicit meaning of these terms is deliberately undefined: There is no specified meaning for “about that size” or “simplified variants”.

This post is an attempt to sketch out some ideas about how this could become better specified. I’m just going to use maths rather than Python here – some of the beginnings of this has made it into code, but it’s no more than a starting point right now.

Let \(S\) be a set of values we’ll call our state space. We will call a search tactic for \(S\) a triple:

\[\begin{align*}
\mathrm{complexity} & : S \to [0, \infty) \\
\mathrm{simplify} & : S \to [S]^{< \omega}\\ \mathrm{produce} & : [0, \infty) \to \mathrm{RV}(S) \\ \end{align*}\] Where \(\mathrm{RV}(S)\) is the set of random variables taking values in \(S\) and \([S]^{<\omega}\) is the set of finite sequences taking values in \(S\). These operations should satisfy the following properties:

  1. \(\mathrm{h}(\mathrm{produce}(x)) = \min(x, \log(|S|))\)
  2. \(x \to \mathbb{E}(\mathrm{complexity}(\mathrm{produce}(x)))\) should be monotone increasing in X
  3. \(\mathrm{complexity}(y) \leq \mathrm{complexity}(x)\) for \(y \in \mathrm{simplify}(x)\)

In general it would be nice if the distribution of \(\mathrm{produce}(x)\) minimized the expected complexity, but I think that would be too restrictive.

Where \(\mathrm{h}\) is the entropy function.

The idea here is that the entropy of the distribution is a good measure of how spread out the search space is – a low entropy distribution will be very concentrated, whileas a high entropy distribution will be very spread out. This makes it a good “size” function. The requirement that the expected complexity be monotone increasing captures the idea of the search space spreading out, and the requirement that simplification not increase the complexity captures the idea of moving towards the values more like what you generated at low size.

Here are some examples of how you might produce search strategies:

A search strategy for positive real numbers could be:

\[\begin{align*}
\mathrm{complexity}(x) & = x \\
\mathrm{simplify}(x) & = x, \ldots, x – n \mbox{ for } n < x\\ \mathrm{produce}(h) & = \mathrm{Exp}(e^h - 1) \\ \end{align*}\] The exponential distribution seems to be a nice choice because it's a maximum entropy distribution for a given expectation, but I don't really know if it's a minimal choice of expectation for a fixed entropy. Another example. Given search strategies for \(S\) and \(T\) we can produce a search strategy for \(S \times T\) as follows: \[\begin{align*} \mathrm{complexity}(x, y) & = \mathrm{complexity}_S(x) + \mathrm{complexity}_T(y)\\ \mathrm{simplify}(x, y) & = [(a, b) : a \in \mathrm{simplify}_S(x), b \in \mathrm{simplify}_T(y)] \\ \mathrm{produce}(h) & = (\mathrm{produce}_S(\frac{1}{2}h), \mathrm{produce}_T(\frac{1}{2}h)) \\ \end{align*}\] The first two should be self-explanatory. The produce function works because the entropy of a product of independent variables is the sum of the entropy of its components. It might also be potentially interesting to distribute the entropy less uniformly through the components, but this is probably simplest. You can also generate a search strategy for sequences given a search strategy for natural numbers \(\mathbb{N}\) and a search strategy for \(S\) we can generate a search strategy for \([S]^{< \omega}\): If we define the complexity of a sequence as the sum of the complexities of its components, we can define its simplifications as coordinatewise simplifications of subsequences. The produce is the only hard one to define. Its definition goes as follows: We will generate length as a random variable \(N = \mathrm{produce}_{\mathbb{N}}(i)\) for some entropy \(i\). We will then allocate a total entropy \(j\) to the sequence of length \(N\). So the coordinates will be generated as \(x_i = \mathrm{produce}_S(\frac{j}{N})\). Let \(T\) be the random variable of the sequence produced. The value of \(N\) is completely specified by \(S\), so \(h(S) = h(S,N)\). We can then use conditional entropy to calculate this: We have that \(h(S | N = n) = j\) because we allocated the entropy equally between each of the coordinates.

So \(h(S) = h(N) + h(S | N) = i + j\)

So we can allocate the entropy between coordinates and length as we wish – either an equal split, or biasing in favour of shorter sequences with more complex coordinates or short sequences with complex coordinates.

Anyway, those are some worked examples. It seems to work reasonably well, and is more pleasantly principled than the current ad hoc approach. It remains to be seen how well it works in practice.

This entry was posted in Hypothesis, Numbers are hard, programming, Uncategorized on by .

Another new hypothesis feature: flags

This has the virtue of being the first hypothesis feature that isn’t blatantly stolen, excuse me, “ported”, from Scalacheck.

Instead it’s blatantly stolen from a post by John Regehr about Csmith.

Update: John Regehr pointed out in the comments that I was misattributing this idea to him, and it was in fact due to Alex Groce.

What’s the idea?

The idea is basically to change the distribution of the search inputs. Rather than exploring the whole space more or less uniformly at each size, each probe into the space is more focused. This produces examples which are in some sense more “interesting”.

How does this work?

The idea is that in any given run of the producers we have a set of flags which may be turned on and off. Let me show an example:

@produces(int)
def produce_int(self,size):
    p = 1.0 / (size + 1)
    n =  int(log(random()) / log(1 - p))
    if self.is_flag_enabled("allow_negative_numbers", size):
        if random() <= 0.5:
            n = -n
    return n

So on some runs we will be allowed to produce negative numbers, on some we won’t.

This means that we can, for example, produce very long lists of only positive integers because negative integers were turned off for this run. Previously a list would be all-positive with probability \(2^{-n}\) (where \(n\) is the length of the list), so for example the following would have been unfalsifiable:

from hypothesis import assume, falsify
 
def long_lists_contain_negative_elements(xs):
    assume(len(xs) >= 50)
    assert any((a < 0 for a in xs))
    return True
 
falsify(long_lists_contain_negative_elements, [int])

But with the new flag based code it’s easy to falsify.

The example that actually motivated this was the stateful testing one in the previous post. If we add a “clear” operation to it and allow that as a test step then we become unable to find falsifying examples because the clear step stomps on things too regularly to ever generate long enough examples.

This is also illustrative of why this is useful in general: Bugs will typically be exhibited by the interaction of one or two features. When we’re looking at all the features we may not exercise the interaction of those two enough to actually test them properly, but when we selectively disable features we can study pairwise interactions much more thoroughly.

Currently the way this is implemented internally is quite gross. I wouldn’t look at it if I were you. I’m mostly focused on hypothesis externals right now, and will think about how to better redesign its insides at a later date.

Update: This feature has been temporarily removed due to me being really unhappy with the implementation and it getting in the way of some internals I’m refactoring at the moment. It will be back before the next released version of hypothesis.

This entry was posted in Hypothesis, Uncategorized on by .