David R. MacIver's Blog: A possible more principled approach to generation and simplification in hypothesis

A possible more principled approach to generation and simplification in hypothesis

20 March 2013

Currently the way generation and simplification in hypothesis work are very ad hoc and undisciplined. The API is spread across various places, and the actual behaviour is very under-specified.

There are two perations exposed, produce and simplify. produce takes a size parameter and produces values of the specifed type of about the provided size, whileas simplify takes a value of the specified type and produces a generator over simplified variants of it.

The meaning of these terms is explicit meaning of these terms is deliberately undefined: There is no specified meaning for “about that size” or “simplified variants”.

This post is an attempt to sketch out some ideas about how this could become better specified. I’m just going to use maths rather than Python here - some of the beginnings of this has made it into code, but it’s no more than a starting point right now.

Let \(S\) be a set of values we’ll call our state space. We will call a search tactic for \(S\) a triple:

\[\begin{align*}\ \mathrm{complexity} & : S \to [0, \infty) \\\ \mathrm{simplify} & : S \to [S]^{< \omega}\\\ \mathrm{produce} & : [0, \infty) \to \mathrm{RV}(S) \\\ \end{align*}\]

Where \(\mathrm{RV}(S)\) is the set of random variables taking values in \(S\) and \([S]^{<\omega}\) is the set of finite sequences taking values in \(S\).

These operations should satisfy the following properties:

\(\mathrm{h}(\mathrm{produce}(x)) = \min(x, \log(|S|))\)
\(x \to \mathbb{E}(\mathrm{complexity}(\mathrm{produce}(x)))\) should be monotone increasing in X
\(\mathrm{complexity}(y) \leq \mathrm{complexity}(x)\) for \(y \in \mathrm{simplify}(x)\)

In general it would be nice if the distribution of \(\mathrm{produce}(x)\) minimized the expected complexity, but I think that would be too restrictive.

Where \(\mathrm{h}\) is the entropy function.

The idea here is that the entropy of the distribution is a good measure of how spread out the search space is - a low entropy distribution will be very concentrated, whileas a high entropy distribution will be very spread out. This makes it a good “size” function. The requirement that the expected complexity be monotone increasing captures the idea of the search space spreading out, and the requirement that simplification not increase the complexity captures the idea of moving towards the values more like what you generated at low size.

Here are some examples of how you might produce search strategies:

A search strategy for positive real numbers could be:

\[\begin{align*}\ \mathrm{complexity}(x) & = x \\\ \mathrm{simplify}(x) & = x, \ldots, x - n \mbox{ for } n < x\\\ \mathrm{produce}(h) & = \mathrm{Exp}(e^h - 1) \\\ \end{align*}\]

The exponential distribution seems to be a nice choice because it’s a maximum entropy distribution for a given expectation, but I don’t really know if it’s a minimal choice of expectation for a fixed entropy.

Another example. Given search strategies for \(S\) and \(T\) we can produce a search strategy for \(S \times T\) as follows:

\[\begin{align*}\ \mathrm{complexity}(x, y) & = \mathrm{complexity}_S(x) + \mathrm{complexity}_T(y)\\\ \mathrm{simplify}(x, y) & = [(a, b) : a \in \mathrm{simplify}_S(x), b \in \mathrm{simplify}_T(y)] \\\ \mathrm{produce}(h) & = (\mathrm{produce}_S(\frac{1}{2}h), \mathrm{produce}_T(\frac{1}{2}h)) \\\ \end{align*}\]

The first two should be self-explanatory. The produce function works because the entropy of a product of independent variables is the sum of the entropy of its components. It might also be potentially interesting to distribute the entropy less uniformly through the components, but this is probably simplest.

You can also generate a search strategy for sequences given a search strategy for natural numbers \(\mathbb{N}\) and a search strategy for \(S\) we can generate a search strategy for \([S]^{< \omega}\):

If we define the complexity of a sequence as the sum of the complexities of its components, we can define its simplifications as coordinatewise simplifications of subsequences. The produce is the only hard one to define. Its definition goes as follows:

We will generate length as a random variable \(N = \mathrm{produce}_{\mathbb{N}}(i)\) for some entropy \(i\). We will then allocate a total entropy \(j\) to the sequence of length \(N\). So the coordinates will be generated as \(x_i = \mathrm{produce}_S(\frac{j}{N})\). Let \(T\) be the random variable of the sequence produced.

The value of \(N\) is completely specified by \(S\), so \(h(S) = h(S,N)\). We can then use conditional entropy to calculate this: We have that \(h(S | N = n) = j\) because we allocated the entropy equally between each of the coordinates.

So \(h(S) = h(N) + h(S | N) = i + j\)

So we can allocate the entropy between coordinates and length as we wish - either an equal split, or biasing in favour of shorter sequences with more complex coordinates or short sequences with complex coordinates.

Anyway, those are some worked examples. It seems to work reasonably well, and is more pleasantly principled than the current ad hoc approach. It remains to be seen how well it works in practice.

Comments

The horror lurking at the heart of the new hypothesis | David R. MacIver on 2013-03-23 18:07:22:

[...] I’m much happier with the new design than I was with the old one. It takes the idea of the search strategy I mentioned in my previous post and makes it the fundamental building block of the library. This [...]