Category Archives: Hypothesis

Revising some thoughts on test driven development

Epistemic status: Still thinking this through. This is a collection of thoughts, not an advocacy piece.

I’ve previously been pretty against TDD. It is possible that this has always been based on a straw understanding of what TDD is supposed to be for, but if so that is a straw understanding shared by a number of people who have tried to sell me on it.

I am currently moving towards a more nuanced position of “I still don’t think TDD is especially useful in most cases but there are some cases where it’s really amazingly helpful”.

Part of the source of my dislike of TDD has I think come from underlying philosophical differences. A test suite has two major purposes:

  1. It helps you prevent bugs in your code
  2. It acts as executable documentation for your code

As far as I am concerned, the important one is the first. People who think their test suite is a good substitute for documentation are wrong. Your code is not self-documenting. If you haven’t written actual for reals documentation, your code is not documented no matter how good your test suite is.

And my belief has always been and remains that TDD is actively harmful for using a test suite as the first purpose. Good testing is adversarial, and the number one obstacle to good testing (other than “not testing in the first place”) is encoding the same assumptions in your tests as in your code. TDD couples writing the tests and the code so closely that you can’t help but encode the same assumptions in them, even if it forces you to think about those assumptions more clearly.

I am aware of the counter-argument that TDD is good because it ensures your code is better tested than it otherwise would be. I consider this to be true but irrelevant, because mandating 100% coverage has the same property but forces you to maintain a significantly higher standard of testing.

So if TDD is harmful for the purpose of testing that matters, it must be at best useless and at worst harmful, right?

As far as I’m concerned, right. If your goal is a well tested code base, TDD is not a tool that I believe will help you get there. Use coverage instead.

But it turns out that there are benefits to TDD that have absolutely nothing to do with testing. If you think of TDD as a tool of thought for design which has absolutely nothing to do with testing whatsoever then it can be quite useful. You then still have to ensure your code is well tested, but as long as you don’t pretend that TDD gets you there, there’s nothing stopping you from using it along the way.

Using tests to drive the design of your API lets you treat the computer as an external brain, and provides you with a tool of thought that forces you to think about how people will use your code and design it accordingly.

The way I arrived at finally realising this is via two related design tools I have recently been finding very useful:

  1. Writing documentation and fixing the bits that embarrassed you when you had to explain them
  2. Making liberal use of aspirational examples. Starting a design from “Wouldn’t it be great if this worked?” and see if you can make it work.

TDD turns out to be a great way of combining both of these things in an executable (and thus checkable) format:

  1. The role of tests as executable documentation may not actually be a valid substitute for documentation, but it happily fills the same niche in terms of making you embarrassed when your API is terrible by forcing you to look at how people will use it.
  2. A test is literally an executable aspirational example. You start from “Wouldn’t it be great if this test passed?” and then write code to make the test pass.

When designing new segments of API where I’ve got the details roughly together in my head but am not quite clear on the specifics of how they should all fit together or how this should work, I’ve found using tests for this can be very clarifying, and this results in a workflow that looks close to, but not exactly like, classic TDD.

The workflow in question is as follows:

As per classic TDD, work is centered around features. For example, if I was designing a database API, the following might be features:

  1. I can connect to the database
  2. I can close a connection to the database
  3. I can create tables in the database
  4. I can insert data into the database
  5. I can select data from the database

Most of these are likely to be a single function. Some of them are probably two or three. The point is that as with classic TDD you’re focusing on features not functions. I think this is a bit more coarse grained than advocated by TDD, but I’ve no idea how TDD as she is spoken differs from TDD as described.

Working on an individual feature involves the following:

  1. Start from code. All types and functions you think you’ll need for this stage are defined. Functions should all raise some error. InvalidArgument or similar is a good one, but any fatal condition you can reasonably expect to happen when calling that function is fine. If there is really no possible way a function could raise an exception, return some default value like 0, “” or None.
  2. Write lots of tests, not just one, for all the cases where those functions should be failing. Most of these tests should pass because e.g. they’re asserting that your function raises an invalid argument when you don’t specify a database to connect to. Your function considers all arguments to be invalid, so this test is fine!
  3. Any tests for error conditions that do not currently pass, modify the code to make them pass. This may require you to flesh out some of your types so as to have actual data.
  4. Now write some tests that shouldn’t error. Again, cover a reasonable range of cases. The goal is to sketch out a whole bunch of example uses of your API.
  5. Now develop until those tests pass. Any edge cases you spot along the way should immediately get their own test.
  6. Now take a long hard look at the tests for which bits of the API usage are annoying and clunky. Improve the API until it does not embarrass you. This may and probably will require you to revise earlier stages as well and that’s fine.
  7. If you’re feeling virtuous (I’m often not and leave this to the end) run coverage now and add more tests until you reach 100%. You may find this requires you to change the API and return to step 5.

Apply this to each stage in turn, then apply a final pass of steps 6 and 7 to the thing as a whole.

This isn’t very different from a classic TDD work flow. I think the units are more coarsely grained, and the emphasis on testing error conditions first means that you tend to start with tests which are passing and act as constraints that they should stay passing rather than tests that are failing and which act as drivers to make them pass, but it’s say no more than a standard deviation out from what I would expect normal TDD practice to look like.

The emphasis on error conditions is a personal idiosyncrasy. Where by personal idiosyncrasy I mean that I am entirely correct, everyone else is entirely wrong, and for the love of god people, think about your error conditions, please. Starting from a point of “Where should this break?” forces you to think about the edge cases in your design up front, and acts as something of a counterbalance to imagining only the virtuous path and missing bugs that happen when people stray slightly off it as a result.

So far this approach has proven quite helpful for the cases I’ve used it. I’m definitely not using this for all development, and I wouldn’t want to, but it’s been quite helpful where I need to design a new API from scratch and the requirements are vague enough that it helps to have a tool to help me think tem through.

This entry was posted in Hypothesis, Uncategorized on by .

The war cannot be won, yet still we must fight our little battles

I’ve only half jokingly referred to 2015 as the year I declare war on the software industry.

You could make a David and Goliath analogy here, but the thing is that instead of my felling the giant with my plucky shepherd’s weapon, what’s actually going to happen is that Goliath is going to put his hand on my forehead, hold me out at arm’s length, and laugh as I struggle ineffectually against his vastly superior strength and size.

But maybe, just maybe, if I struggle hard enough and push with all my might, I can get Goliath to take a single step back.

Hypothesis was my opening volley, as an attempt to raise the benchmark for quality – if I make it easy enough to find your bugs, maybe you’ll fix them?

As an opening volley, it’s a pretty weak one. Most of the reason why software is bad isn’t because it’s too hard to write tests, it’s because of social reasons – people are so conditioned to bad software at this point that it’s just not that much of a problem to release broken software into the world because people will use it anyway.

But maybe if we make it easier for the people who care to write quality software on time and on budget, we can start to change the norms. If you can choose between two equally shiny and feature-full pieces of software except that one actually works properly, perhaps you’ll start to care more about software quality, and if your customers start to desert you for the software that actually works, maybe you’ll really start to care about software quality.

Hypothesis alone will never achieve this, but each tool gives Goliath a nudge in the right direction.

Then, tired of the burden of free labour we put on people as an industry, I wrote it’s OK for your open source software to be a bit shitty.

Will it change minds? Maybe a few. The responses on the corresponding reddit thread were really a lot higher quality than I would generally expect of reddit. It certainly seemed to help a whole bunch of people who were concerned about the quality of their own open source work, and hopefully it has given people a weapon to defend themselves when dealing with people who feel entitled to their free labour.

Then I wrote the Two Day Manifesto, an attempt to attack the problem from the other end. If the problem is that companies are built on free labour, maybe they could contribute some paid labour back?

Probably not, but again maybe we can nudge Goliath in the right direction.

Because ultimately all of these efforts are mostly there in the hope that I’ll find just the right way to push the giant, or that I will manage to push at the same time as enough other people, and that maybe he will take just a single step back.

And then someone else, or more likely some other group, will make him take another step back.

And over time he will retreat to the horizon, and we will follow, still pushing.

The horizon retreats ever into the distance, but we can look over our shoulders and see how much territory we’ve reclaimed from the giant, and we will redouble our efforts and push further.

And maybe, maybe, one day we will circle the world, and come back to where we started, and he will meet our forces on the other side and discover he no longer has anywhere to go.

But that day is far in the future. We will never see it, nor will the next generation. We will not win this war, and perhaps we never will.

But we’re still going to push.

This entry was posted in Hypothesis, Uncategorized on by .

Honey I shrunk the clones: List simplification in Hypothesis

Simplification in Hypothesis is something of a vanity feature for me. I spend significantly more development effort on it than it strictly warrants, because I like nice examples. In reality it doesn’t really matter if the simplest value of a list of integers with at least three duplicates is [0, 0, 0] or [-37, -37, -37] but it matters to me because the latter makes me look bad. (To me. Nobody else cares I suspect)

So I was really pleased that I got simultaneous simplification in as a feature for Hypothesis 1.0. If a list contains a bunch of duplicate values (and because templating this is easy to check for – all templates are hashable and comparable for equality) before trying to simplify them individually, Hypothesis tries to simplify them in a batch all at once.

As well as solving the above ugly example, this turns out to be really good for performance when it fires. Even if your test case has nothing to do with duplication, a lot of the time there will be elements in the list whose value fundamentally doesn’t matter. e.g. imagine that all that is needed to trigger a failure is a list of more than 100 elements. The individual elements don’t matter at all. If we happen to have produced an example where we have a list of 100 elements of the same value, we can simplify this 100x as fast as if we had to simplify every individual element.

I’m doing a bunch of work on simplification at the moment, and as a result I have lots of tests for example quality of the form “In this circumstance, Hypothesis should produce an example that looks like this”. Some of them had a tendency to get into really pathological performance problems because they’re in exactly this scenario: They have to do a lot of individual simplifications of values in order to find an optimal solution, and this takes a long time. For example, I have a test that says that the list must contain at least 70 elements of size at least ten. This example was deliberately constructed to make some of the existing most powerful simplification passes in Hypothesis cry, but it ended up wreaking havoc on basically all the passes. The goal is that you should get a list of 70 copies of 10 out at the end, and the simplification run should not take more than 5 seconds.

This obviously is going to work well with simultaneous simplification: If you have a set of duplicate indices greater than 10, simultaneous simplification will move them all to 10 at once.

Unfortunately the chances of getting an interesting amount of duplication for integers larger than 10 is pretty low, so this rarely fires and we usually have to fall back to individual simplification which ends up taking ages (I don’t actually know how long – I’ve got a hard 5 second timeout on the test that it usually hits, but eyeballing it it looked like it got about halfway in that time).

So the question is this: Can we design another simplification pass that is designed to deliberately put the list into a state where simultaneous simplification can fire?

On the face of it the answer is obviously yes: You can just change a bunch of elements in the list into duplicates. Then if any of those are also falsifying examples we end up in a state where we can apply simultaneous simplification and race to the finish line.

Naturally there’s a wrinkle. The problem is that simplification in Hypothesis should be designed to make progress towards a goal. Loops aren’t a problem, but things which cause unbounded paths in the simplify graph will cause it to spend a vast amount of time not doing anything very useful until it gets bored of this simplification, declares enough is enough, and gives you whatever it’s got at the time (a lesson I should probably learn from my own creation).

Or, to put it more directly: How can we tell if a list of cloned elements is actually simpler. If we have [1, 2, 3, 4, 5, 6, 7, 8, 9999999999999999] we’d really much rather clone 1 all over the place than 9999999999999999.

The solution is to extend SearchStrategy with yet another bloody method (it has a default, fortunately), which allows it to test whether one of two templates should be consider strictly simpler than the other. This is a strict partial order, so x is not strictly simpler than x and it needn’t be the case that for any x and y one of the two is simpler. In general this is intended as a heuristic, so fast is better than high accuracy.

For the particular case of integers the rule is that every positive number is simpler than every negative number, and otherwise a number is simpler if it’s closer to zero.

We can now use this to implement a cloning strategy which always makes progress towards a simpler list (or produces nothing):

  1. Pick a random element of the list. Call this the original.
  2. Find the indices of every element in the list for which the original is strictly simpler.
  3. If that set is empty, nothing to do here.
  4. Otherwise pick a random subset of those indices (the current choice is that we pick an integer uniformly at random between 1 and the size of the list and then pick that many elements. This is not a terribly scientifically chosen approach but seems to work well).
  5. Replace the element in each index we selected with a clone of the original

Empirically this seems to do a very good job of solving the particular example it was designed to solve and does not harm the remaining cases too much (I also have an example which deliberately throws away duplicates. Believe it or not this sped that example up).

It’s unclear how useful this is in real practical tests, but it at the very least shouldn’t hurt and appears it would often help. The list generation code is fairly fundamental in that most other collection generation code (and later on, stateful testing) is built on it, so it’s worth putting in this sort of effort.

This entry was posted in Hypothesis, Python, Uncategorized on by .

Hypothesis 1.0 is out

The following is the release announcement I sent to the testing in python mailing list and the main python mailing list:

Hypothesis is a Python library for turning unit tests into generative tests,
covering a far wider range of cases than you can manually. Rather than just
testing for the things you already know about, Hypothesis goes out and
actively hunts for bugs in your code. It usually finds them, and when it
does it gives you simple and easy to read examples to demonstrate.

Hypothesis is based on Quickcheck (
https://wiki.haskell.org/Introduction_to_QuickCheck2) but is designed to
have a naturally Pythonic API and integrate well with Python testing
libraries.

It’s easy to use, extremely solid, and probably more devious than you
are at finding
edge cases.

The 1.0 release of Hypothesis has a stable and well documented public API.
It’s more than ready for you to use and it’s easy to get started.

Full documentation is available at
http://hypothesis.readthedocs.org/en/latest/, or if you prefer you can skip
straight to the quick start guide:
http://hypothesis.readthedocs.org/en/latest/quickstart.html

It seems to be a hit – less on the mailing list, but the link to it is doing the rounds of the internet and has the dubious honour of being the first time in ages I’ve been on the front page of Hacker News.

This entry was posted in Hypothesis, Python, Uncategorized on by .

A sketch of a new simplify API for Hypothesis

One of the decisions I’m quite happy I made is that the SearchStrategy API in Hypothesis is not public for 1.0. This gives me some freedom to break it when I figure out ways to make it better.

Which is good, because I’ve already figured out a way to break it to make it better.

In my fuzzer for yapf, I realised an interesting thing: The minimization happens in two phases, first line based, then character based. Once character based simplification has started we stop bothering with line based simplification altogether: Although in principle character based simplification could get us back into a state where line based simplification would be useful again, in practice this happens so rarely that trying line based simplification is basically never worth going back to.

Meanwhile, back in Hypothesis land, I was staring at the following quality test this morning:

def dedupe(xs):
    result = []
    for x in xs:
        if x not in result:
            result.append(x)
    return result
 
@given(strategy([bytes]).map(dedupe))
def is_sorted(xs):
    assume(len(xs) >= 10)
    assert sorted(xs) == xs

This is an example that is designed to fail and mostly exists to test the quality of the minimization: What’s the simplest list of byte strings it can find such that:

  • The list only has distinct elements
  • The list has at least 10 elements
  • The list is not sorted

It’s not hard to find such an example – generally speaking Hypothesis will find an example within the first couple of throws of the dice – but producing a good example is harder, which is what this test looks for (In the actual test we make a bunch of assertions about the quality of the result we got at the end).

The problem is that sometimes this can get into a bad state where it takes an extraordinarily long time to minimize.

The problem is basically this: Every time it successfully minimizes an example, the simplification process starts again from scratch on the new example. This means that it tries all the usual simplifications for deleting elements from the list. These are a complete waste of time because they always shrink the list, violating the size assumption. But every time we shrink an element (and there are 10 elements to shrink) we do it all again.

I was originally thinking of trying to fix this in terms of making simplify skip simplifications which fail to produce examples too often, but that’s not really very satisfactory and requires a lot of delicate state tracking. It might be worth it in some cases, but I think it probably wouldn’t be.

Then I realised that this problem fits entirely into the approach I mentioned using for yapf, and we can use it to fix this problem.

Right now Hypothesis essentially follows the classic quickcheck API where there is a function shrink which takes values and provides a generator over simpler versions of the value. Simplified, the hypothesis example finding algorithm looks like:

def find_minimal(x, f):
    for s in shrink(x):
       if f(s):
          return find_minimal(s, f)
    return x

(that’s not actual Hypothesis code. The real implementation is more complex in several ways and doesn’t use recursion because Python, but it’s the idea of it).

The idea is to instead change this so that there are multiple shrinkers, each behaving like our shrink function above, and then we apply each shrinker in turn, never returning to an earlier one. This would be something akin to:

def minimize_with_shrinker(x, f, shrinker):
    for s in shrinker(x):
       if f(s):
          return minimize_with_shrinker(s, f, shrinker)
    return x
 
def find_minimal(x, f):
   for shrinker in shrinkers:
      x = minimize_with_shrinker(x, f, shrinker)
   return x

This lets us fix the list example by splitting up shrinking into two phases: First it tries to shrink by removing elements from the list. Then it tries to shrink by shrinking individual elements. Once it starts shrinking elements, the size of the list is fixed – it’s already found the smallest list it’s going to.

This does impact the quality of the examples a little bit. If necessary that could be fixed by iterating to a fixed point: We run the above algorithm then we run it again until it no longer changes the result. A simpler version would just be to run it twice. I don’t plan to do this unless I find the results from a single pass of this are much worse than I expect them to be. The goal here isn’t always to find the single simplest example, only to find a simple enough example in a short amount of time. I think this will be entirely acceptable quality and it should be very aceptable in time in comparison to the existing implementation.

This entry was posted in Hypothesis, Python, Uncategorized on by .