A whirlwind tour of the Hypothesis build

I’m so used to it at this point that I occasionally forget that one of the innovative things about Hypothesis is its ludicrously complicated build set up. Some of it is almost certainly better than what is currently common practice, so I thought I’d do a quick post of some of the highlights.

  1. Everything is run on Travis, but I don’t use travis’s python support. Instead I manage a set of Travis builds myself using pyenv’s installer code (I don’t actually use pyenv other than the installer). This is partly because I have OSX builders turned on (Thanks Travis!) but they don’t support Python, but it also ensures that I have up to date versions of every Python version.
  2. I also have a rather less elaborate version of the build running on Appveyor to do some basic windows testing.
  3. On Travis, the entire process is driven by a combination of tox and make.
  4. I use travis caching quite heavily to ensure that managing my own python installs is fast – most builds do not need to install python because they have it installed from a previous run. The same is also true of building numpy wheels (though I’ve discovered recently I missed a bit).
  5. I only run coverage on one build step, which runs on a single Python version (3.5, currently). This runs a fast subset of the tests for coverage then fails the build if it gets less than 100% coverage. This excludes my compatibility layer (compat.py) and a handful of other lines marked with pragma. The assumption is that if I’ve got coverage on 3.5 that’s probably telling me enough and isn’t worth the build time of trying on other Python versions (this probably does mean I have less than 100% coverage on Python 2, but so far this has never caused a problem).
  6. Note: Hypothesis lives in a src/ directory so that it’s deliberately not on the path when running tox. This is the correct thing to do, because it means you’re definitely testing the installed version of the library. It also means you need some custom coverage config.
  7. I have a custom script to enforce a standardized header for all my python files. The script is a bit hacky, but it works quite well.
  8. I have a check-format build step. This applies a number of formatting operations, include pyformat, isort, and the aforementioned header script. It then runs git diff –exit-code to assert that these formatting options did not make any changes to the code. It also runs flake8 (this could reasonably be part of a different step)
  9. For each dependency (optional in Hypothesis’s case) I have a separate tox step that tests each version of that dependency other than the latest (minor if I trust them, patch if I don’t) that I support. These all run on a single version of Python (3.5, currently). These tests also run against the latest version of the library in the per python builders. I don’t run non-latest versions against each python as an attempt to keep the combinatorics under control.
  10. With this many tests of a randomized API that has a timeout built into it, it’s hard to avoid some flakiness – previously there were a few tests that would harmlessly fail on maybe one builder every couple of runs. This added up to a quite unreliable build. I’ve recently been using the flaky library to mitigate that – a handful of tests are decorated with @flaky to allow them to be rerun if they fail. This has been invaluable for getting a reliably green build.

The overall result is a bit ridiculous. My current travis build time is about two and a half hours depending on the time of day (Travis is slower during US working hours). The actual wait time is less than that because of parallel builders, but it’s still not short.

I think it’s mostly worth it though. The overall results give me an amazing amount of confidence in the code. Hypothesis definitely isn’t bug free (bug free code basically doesn’t exist outside of safety critical industries), but it’s generally regression free code and I tend to find more bugs than Hypothesis users do.

This entry was posted in Python on by .