David R. MacIver's Blog: A pathological example for test-case reduction

A pathological example for test-case reduction

20 January 2018

This is an example I cut from a paper I am currently writing about test-case reduction because I needed to save space and nobody except me actually cares about the algorithmic complexity of test-case reduction.

Suppose you have the following test case:

from hypothesis import given, strategies as st

K = 64

INTS = st.integers(0, 2 ** K - 1)

@given(INTS, INTS)
def test_are_not_too_far_apart(m, n):
    assert abs(m - n) > 1

Then if this test ever fails, with initial values \(m\) and \(n\), if reduction can replace an int with its predecessor but can’t change two ints at once, the reduction will take at least \(\frac{m + n}{2}\) test executions: The fastest possible path you can take is to at each step reduce the larger of \(m\) and \(n\) by two, so each step only reduces \(m + n\) by \(2\), and the whole iteration takes \(\frac{m + n}{2}\) steps.

(Side note: You can get lucky with Hypothesis and trigger some special cases that make this test faster. if you happen to have \(m = n\) then it can reduce the two together, but if you start with \(m = n \pm 1\) then it will currently never trigger because it will not ever have duplicates at the entry to that step. Hypothesis will also actually find this bug immediately because it will try it with both examples set to zero. Trivial modifications to the test can be made to avoid these problems, but I’m going to ignore them here).

The interesting thing about this from a Hypothesis point of view is that \(m + n\) is potentially exponential in \(k\), and the data size is linear in \(k\), so Hypothesis’s test case reduction is of exponential complexity (which doesn’t really cause a performance problem in practice because the number of successful reductions gets capped, but does cause an example quality problem because you then don’t run the full reduction). But this isn’t specifically a Hypothesis problem - I’m morally certain every current property-based testing library’s test case reduction is exponential in this case (except for ones that haven’t implemented reduction at all), possibly with one or two patches to avoid trivial special cases like always trying zero first.

Another way to get around this is to almost never trigger this test case with large values! Typically property-based testing libraries will usually only generate an example like this with very small values. But it’s easy for that not to be the case - mutational property-based testing libraries like Hypothesis or crowbar can in theory fairly easily find this example for large \(m\) and \(n\) (Hypothesis currently doesn’t. I haven’t tried with Crowbar). Another way you could easily trigger it is with distributions that special case large values.

One thing I want to emphasise is that regardless of the specific nature of the example and our workarounds for it, this sort of problem is inherent. It’s easy to make patches that avoid this particular example (especially in Hypothesis which has no problem making simultaneous changes to \(m\) and \(n\)).

But if you fix this one, I can just construct another that is tailored to break whatever heuristic it was that you came up with. Test-case reduction is a local search method in an exponentially large space, and this sort of problem is just what happens when you block off all of the large changes your local search method tries to make but still allow some of the small ones.

You can basically keep dangling the carrot in front of the test-case reducer going “Maybe after this reduction you’ll be done”, and you can keep doing that indefinitely because of the size of the space. Pathological examples like this are not weird special cases, if anything the weird thing is that most examples are not pathological like this.

My suspicion is that we don’t see this problem cropping up much in practice for a couple of reasons:

Existing property-based testing libraries are very biased towards finding small examples in the first place, so even when we hit cases with pathological complexity, \(n\) is so small that it doesn’t actually matter.
This sort of boundary case relies on what are essentially “weird coincidences” in the test case. They happen when small local changes unlock other small local changes that were previously locked. This requires subtle dependency between different parts of the test case, and where that subtle dependency exists we are almost never finding it. Thus I suspect the fact that we are not hitting exponential slow downs in our test case reduction on a regular basis may actually be a sign that there are entire classes of bug that we are just never finding because the probability of hitting the required dependency combination is too low.
It may also be that bugs just typically do not tend to have that kind of sensitive dependencies. My suspicion is that this is not true given the prevalence of off-by-one errors.
It may also be that people are hitting this sort of problem in practice and aren’t telling us because they don’t care that much about the performance of test case reduction or example quality.