This has the virtue of being the first hypothesis feature that isn’t blatantly stolen, excuse me, “ported”, from Scalacheck.
Instead it’s blatantly stolen from a post by John Regehr about Csmith.
Update: John Regehr pointed out in the comments that I was misattributing this idea to him, and it was in fact due to Alex Groce.
What’s the idea?
The idea is basically to change the distribution of the search inputs. Rather than exploring the whole space more or less uniformly at each size, each probe into the space is more focused. This produces examples which are in some sense more “interesting”.
How does this work?
The idea is that in any given run of the producers we have a set of flags which may be turned on and off. Let me show an example:
@produces(int) def produce_int(self,size): p = 1.0 / (size + 1) n = int(log(random()) / log(1 - p)) if self.is_flag_enabled("allow_negative_numbers", size): if random() <= 0.5: n = -n return n |
So on some runs we will be allowed to produce negative numbers, on some we won’t.
This means that we can, for example, produce very long lists of only positive integers because negative integers were turned off for this run. Previously a list would be all-positive with probability \(2^{-n}\) (where \(n\) is the length of the list), so for example the following would have been unfalsifiable:
from hypothesis import assume, falsify def long_lists_contain_negative_elements(xs): assume(len(xs) >= 50) assert any((a < 0 for a in xs)) return True falsify(long_lists_contain_negative_elements, [int]) |
But with the new flag based code it’s easy to falsify.
The example that actually motivated this was the stateful testing one in the previous post. If we add a “clear” operation to it and allow that as a test step then we become unable to find falsifying examples because the clear step stomps on things too regularly to ever generate long enough examples.
This is also illustrative of why this is useful in general: Bugs will typically be exhibited by the interaction of one or two features. When we’re looking at all the features we may not exercise the interaction of those two enough to actually test them properly, but when we selectively disable features we can study pairwise interactions much more thoroughly.
Currently the way this is implemented internally is quite gross. I wouldn’t look at it if I were you. I’m mostly focused on hypothesis externals right now, and will think about how to better redesign its insides at a later date.
Update: This feature has been temporarily removed due to me being really unhappy with the implementation and it getting in the way of some internals I’m refactoring at the moment. It will be back before the next released version of hypothesis.