This is a heuristic for how to choose what steps to take during a debugging or problem solving process. I think it is too labourious to calculate in practice, but it might be interesting for either machine learning or for some sort of interactive problem solving system which talks you through the steps (indeed my friend Dave and I have talked about building such a thing. We’ve tentatively settled on the name Roboduck).

I also suspect I’m reinventing somebody’s wheel here, but I’m not sure whose. If nothing else it pretty strongly resembles some of the maths you would use for building decision trees.

It started after I read John Regehr’s how to debug. In it he illustrates the debugging process with what are basically probability distributions over likely causes of bugs.

One particularly interesting paragraph is about how you choose what experiments to perform in order to eliminate hypotheses:

Hopefully the pattern is now clear: based on our intuition about where the bug lies, we design an experiment that will reject roughly 50% of the possible bugs. Of course I’m oversimplifying a bit. First, experiments don’t always have yes/no results. Second, it may be preferable to run an easy experiment that rejects 15% of the possible bugs rather than a difficult one that rejects exactly 50% of them. There’s no wrong way to do this as long as we rapidly home in on the bug.

It occurred to me that, although there was no wrong way to do it, clearly some ways of doing it were better than others, and we can actually make this empirically precise.

Basically, when you have your probability distribution about what you think the state of the world is, it is possible to assign a number that almost literally means “How confused am I?” (actually it means “How much additional information do I expect to need to figure out what’s going on if I proceed optimally?”). The Shannon entropy.

Your goal is to get the entropy of your current set of hypotheses down to 0 (I know what is going on).

This means that a fairly sensible way to proceed is to maximizes the rate of change of entropy (it isn’t necessarily correct because this is a purely local search which you might be able to leap out of by finding a better starting point, but it should be pretty good).

So, how do you choose which experiment to perform? You look at the expected decrease in entropy and divide it by the time it will take you to perform the experiment.

That’s pretty much it, but lets do some worked examples.

Suppose we have \(N\) equally likely possibilities, and an experiment will let us determine if it’s one of \(k\) of them. That is, if the experiment gives one answer we will have \(k\) equally likely possibilities, and if it gives the other we will have \(N – k\) equally likely possibilities. Let’s write \(p = \frac{k}{N}\) for convenience.

Our current entropy is \(\log(N)\) (all logs will be base 2).

We expect to get the first answer with probability \(p\), and the second answer with probability \(1 – p\).

For the first answer we have entropy \(\log(k)\). For the second we have entropy \(\log(N – k\).

So the expected entropy of the result is

\[

\begin{align*}

E(h) &= \log(N) – p \log(k) – (1 – p) \log(N – k) \\

&= p(\log(N) – \log(k)) + (1 – p)(\log(N) – \log(N – k)) \\

&= -p \log(p) – (1 – p) \log(1 – p)\\

\end{align*}

\]

Which it turns out is the negative of the entropy of the experimental result (that is, our experimental result is a random variable which takes one value with value \(p\) and the other with value \(1 – p\) and this is the entropy of such a variable).

This is a nice result which is unfortunately completely false in general. In order to see this, consider running an experiment where we flip a coin and observe the outcome: This experiment has entropy \(log(2)\), which is at least as high as the entropy of the preceding experiment, but it gives us precisely 0 information.

Lets continue with our example and use it to analyze the example in the quote:

Suppose we have an experiment that would let us eliminate 50% of cases. This gives us an expected information gain of \(1\). Now suppose we have an experiment that would let us eliminate 15% or 85% of cases. This has an information gain of \(-\log(0.15) * 0.15 + -\log(0.85) * 0.85 \approx 0.43\). So the former experiment is about \(2.37\) times as good as the latter experiment, and should be preferred unless you can perform the latter experiment at least \(2.37\) times as quickly.

Pingback: Best of drmaciver.com | David R. MacIver