Reevaluating some testing philosophy

Over the past year or so I’ve started to have serious doubts about some of my previous attitudes on testing. I still think it’s good, I still don’t particularly believe in TDD, but I also think some of my previous approaches and opinions are a bit misguided.

This is about one I just encountered today which is causing me to re-evaluate some things. I previously strongly held the two following opinions:

  1. You should test to the public API rather than having your tests depending on internals.
  2. Randomized quickcheck style testing is awesome. Although you probably want to turn failing quickcheck tests into deterministic tests, and sometimes it’s worth writing tests that are hard to write in this manner, quickcheck will probably do a better job of testing your code than you will.

These two stances turn out to be in conflict. It’s not impossible to reconcile them, but it requires a significant amount of work. I think that amount of work might be worth doing because it makes your code better, but it’s worth keeping an eye on.

As I’ve mentioned previously, my testing strategy for intmap is as follows:

  1. I have a randomized test case generator that models the behaviour the library should have and generates test cases to check that it does.
  2. I have a set of deterministic tests that run before the randomized ones. The main benefit of these is they’re reproducible and a hell of a lot faster. They’re mostly extracted from failing randomized test cases.

Today I was writing some exceedingly trivial performance tests so that I could do some profiling and testing of whether some of the optimisations I was performing were actually wins (at least in these benchmarks the answer is that some of them really are, some of them really aren’t). Then I wrote a new benchmark and it segfaulted. Given that the test coverage was supposed to be pretty comprehensive, and the test suite was passing, this was pretty disappointing.

How did this happen?

Well the proximate cause of the segfault was that my allocator code was super buggy because I’ve never written an allocator before and it turns out that writing allocators is harder than I thought. If you’re interested in the specifics, here is the commit that fixes it. But why didn’t the tests catch it?

Randomized testing essentially relies on two things in order to be work.

  1. It has a not-tiny probability of triggering any given bug
  2. It runs enough times that that decent probability is inflated to a significant probability of triggering it.

What does this end probability of triggering a bug look like? Lets do some maths!

Define:

  • q(b) is the probability of a single run triggering bug b
  • t is the average amount of time it takes for a run
  • T is the total amount of time we have to devote to running randomized tests
  • p(b) is the probability of a set of test runs finding the bug.

If we have T time and each run takes t, then we make approximately \(\frac{T}{t}\) runs (this isn’t really right, but assume low-variance in the length of each run). Then the probability of finding this bug is \(p(b) = 1 – (1 – q(b))^{\frac{T}{t}} \approx q(b) \frac{T}{t}\).

This formula basically shows the two reasons why testing only the public API is sometimes going to produce much worse results than testing internal APIs.

The first is simple: In most cases (and very much in this one), testing the public API is going to be slower than just testing the internal API. Why? Well, because it’s doing a lot of stuff that isn’t in the internal API. Not doing stuff is faster than doing stuff. If the difference isn’t huge, this doesn’t matter, but in my case I was doing a phenomenal amount of book-keeping and unrelated stuff, so the number of iterations I was performing was much lower than it would have been if I’d just been testing the allocator directly.

The second is somewhat more subtle: Testing the public API may substantially reduce \(q(b)\). If it reduces it to 0 and your testing coverage is good enough that it can trigger any conceivable usage of the public API, who cares. A bug in the internal API that can never be triggered by the public API is a non-issue. The danger case is when it reduces it to small enough that it probably won’t be caught in your testing, because things which aren’t caught in testing but are not impossible will almost certainly happen in production – in the most harmless case, your users will basically fuzz-test your API for you by throwing data you never expected at it, in the most harmful case your users are actively adversarial and are looking for exploits.

How does this happen?

Essentially it happens because q(b) is intimately depending on both b and the shape of your distribution. The space of all valid examples is effectively infinite (in reality it’s limited by computer memory, but it’s conceptually infinite), which means that it’s impossible to have a uniform distribution, which means that your distribution is going to be peaky – it’s going to cluster around certain smaller regions with high probability.

In fact, this peakiness is not just inevitable but it’s desirable, because some regions are going to be more buggy than others: If you’ve tested what happens on a few thousand random integers between 0 and a billion, testing a few thousand more random ones is probably not going to be very useful. But you probably want to make sure you test 0 and 1 too, because they’re boundary cases and thus more likely trigger bugs.

So this is what you basically want to do with randomized testing: Arrange it so that you have lots of peaks in places that are more likely to trigger bugs. Most randomized testing does this by basically just generating tests that cluster around edge cases with relatively high probability.

The problem is that edge cases in your public API don’t necessarily translate into edge cases in your private API. In my case, I was doing lots of intmaps unions and intersections, which is really good for triggering edge cases in the basic intmap logic, but this was mostly just translating into really very dull uses of the allocator – it rarely created a new pool and mostly just shuffled stuff back and forth from the free list.

If I had been testing the allocator directly then I would have tuned a test generator that exercised it more thoroughly – by not restricting myself to the sort of allocation patterns I can easily generate from intmaps I would have found these interesting bugs much sooner.

In the short-term I’ve solved this by simply writing some deterministic tests for exercising the allocator a bit better.

In the long-term though I think the solution is clear: the allocator needs to be treated in every way as it if it were a public API. It may not really be public – its intent and optimisations are so tailored I’m not expecting it to be useful to anyone else – but any bugs lurking in it are going to eventually make their way into the public API, and if I don’t test it directly hard to trigger ones are just going to lurk undiscovered until the worst possible moment.

Fortunately I’d already factored out the pool code into its own thing. I hadn’t done this for any especially compelling reasons – it’s just the code was already getting quite long and I wanted to break it up into separate files – but it’s going to be very useful now. Because this is the sort of thing you need to do in order to reconcile my original two beliefs: factor any code you want to test on its own out into its own library. This is generally a good design principle anyway.

Does this mean that the two principles are compatible after all as long as you’re writing good code in the first place? Well… kinda. But only if you define “good code” as “code that doesn’t have any internal only APIs”. At this point the first principle is satisfied vacuously – you’re not testing your internal APIs because you don’t have any. I’m not sure that’s wrong, but it feels a bit extreme, and I think it only works because I’ve changed the definition of what an internal API looks like.

This entry was posted in Uncategorized on by .

2 thoughts on “Reevaluating some testing philosophy

  1. Franklin Chen

    I really enjoyed your report. It inspired me to revisit some issues that I’d put aside for over a decade concerning the notions of “public” and “private” in APIs, programming languages, and just the way we think about systems and access. I think public and private should be thought of much more in relative terms.

  2. pozorvlak

    Reductio ad absurdum: if testing the internals was never a good idea, we’d only write system-level tests for our programs and never unit tests. I’ve worked at shops that did this; it’s a really bad idea, not least because it makes your test suite far too frickin’ slow.

Comments are closed.