David R. MacIver's Blog: Another new hypothesis feature: flags

Another new hypothesis feature: flags

20 March 2013

This has the virtue of being the first hypothesis feature that isn’t blatantly stolen, excuse me, “ported”, from Scalacheck.

Instead it’s blatantly stolen from a post by John Regehr about Csmith.

Update: John Regehr pointed out in the comments that I was misattributing this idea to him, and it was in fact due to Alex Groce.

What’s the idea?

The idea is basically to change the distribution of the search inputs. Rather than exploring the whole space more or less uniformly at each size, each probe into the space is more focused. This produces examples which are in some sense more “interesting”.

How does this work?

The idea is that in any given run of the producers we have a set of flags which may be turned on and off. Let me show an example:

@produces(int)
def produce_int(self,size):
    p = 1.0 / (size + 1)
    n =  int(log(random()) / log(1 - p))
    if self.is_flag_enabled("allow_negative_numbers", size):
        if random() <= 0.5:
            n = -n
    return n

So on some runs we will be allowed to produce negative numbers, on some we won’t.

This means that we can, for example, produce very long lists of only positive integers because negative integers were turned off for this run. Previously a list would be all-positive with probability \(2^{-n}\) (where \(n\) is the length of the list), so for example the following would have been unfalsifiable:

from hypothesis import assume, falsify

def long_lists_contain_negative_elements(xs):
    assume(len(xs) >= 50)
    assert any((a < 0 for a in xs))
    return True

falsify(long_lists_contain_negative_elements, [int])

But with the new flag based code it’s easy to falsify.

The example that actually motivated this was the stateful testing one in the previous post. If we add a “clear” operation to it and allow that as a test step then we become unable to find falsifying examples because the clear step stomps on things too regularly to ever generate long enough examples.

This is also illustrative of why this is useful in general: Bugs will typically be exhibited by the interaction of one or two features. When we’re looking at all the features we may not exercise the interaction of those two enough to actually test them properly, but when we selectively disable features we can study pairwise interactions much more thoroughly.

Currently the way this is implemented internally is quite gross. I wouldn’t look at it if I were you. I’m mostly focused on hypothesis externals right now, and will think about how to better redesign its insides at a later date.

Update: This feature has been temporarily removed due to me being really unhappy with the implementation and it getting in the way of some internals I’m refactoring at the moment. It will be back before the next released version of hypothesis.

Comments

John Regehr on 2013-03-20 16:04:05:

Awesome! And random testers are allowed to be gross :).

david on 2013-03-20 16:41:49:

Yeah, but right now it’s gross in a way that probably makes it harder to use and modify than it should be rather than just algorithmically gross. :-)

One thing you might find interesting by the way is the way this feature interacts with the search strategy.

When searching, hypothesis has a “size” parameter that roughly controls the complexity of the search space (it’s ad-hoc and unspecified exactly what this means and is interpreted differently for different types, but basically low size parameters should produce a distribution which is tightly clustered around some point). When searching we start the size parameter low, then raise it until we find a counterexample, then lower it until we stop finding counterexamples.

This size parameter also controls the number of flags that are turned on: At low size parameters most are turned off, at high size parameters most are turned on.

John Regehr on 2013-03-21 00:44:15:

Sounds excellent. Also I should add that this was all Alex’s idea in the first place.