New improved development experience for Hypothesis

As part of my drive to make Hypothesis more of a community project, one of the things I need to do is to ensure it’s easy for new people to pick up, and easy for people who have a different environment to use.

There are a couple major consistent sources of issues people have with Hypothesis development:

  1. It requires the availability of a lot of different versions of Python. I use pyenv heavily, so this hasn’t been a major problem for me, but other people don’t so are less likely to have, say, both Python 3.5 and Python 3.4 installed (because of reasons some build tasks require one, some the other).
  2. A full test run of Hypothesis takes a very long time. If you don’t parallelise it it’s in the region of 2 hours.
  3. Some of the build steps are very counterintuitive in their behaviour – e.g. “tox -e lint” runs a mix of linting and formatting operations and then errors if you have a git diff. This is perfectly reasonable behaviour for running on a CI, but there’s no separate way of getting the formatter to fix your code for you.

Part of the problem in 3 is that tox is a test runner, not a general task runner, and there was a lack of good unified interface to the different tasks that you might reasonably want to run.

So I’ve introduced a new unified system which provides a much better developer experience, gives a single interface to all of the normal Hypothesis development tasks, and automates a lot of the issues around managing different versions of Python. Better yet, it’s based on a program which is widely deployed on most developers’ machines, so there’s no bootstrapping issue.

I am, of course, talking about a Makefile.

No, this isn’t some sort of sick joke.

Make is actually pretty great for development automation: It runs shell commands, checks if things are up to date, and expresses dependencies well. It does have some weird syntactic quirks, and writing portable shell isn’t exactly straightforward, but as an end user it’s pretty great.

In particular because the makefile can handle installing all of the relevant pythons for you (I shell out to pyenv‘s build plugin for this but don’t otherwise use pyenv for this) the juggling many pythons problem goes away.

Other highlights:

  • ‘make check-fast’ for running a fast subset of the tests
  • ‘make format’ for reformatting your code to the Hypothesis style
  • ‘make check-django’ and ‘make check-pytest’ for testing their respective integrations (there’s also ‘make check-nose’ for checking Hypothesis works under nose and I never giggle when typing that at all).

You can see the full Makefile here, and the CONTRIBUTING.rst documents some of the other common operations.

Here’s an asciinema of it in action:

This entry was posted in Hypothesis, programming, Python on by .

In praise of incremental approaches to software quality

Do you know what the best design decision I made in Hypothesis was?

It wasn’t well thought out. It wasn’t anything clever or theoretically interesting. It was done purely because I was lazy, not for any principled reason. It’s still the best design decision.

It’s this: Hypothesis does not have its own test runner. It just uses whichever one you’re already using.

This turns out to be a great idea, because it means that there’s basically zero cost to using Hypothesis: Install an extra library, write some code that uses @given to feed some data to a function. Done. Now you’re using Hypothesis.

As far as difficult sells go, this this barely even registers.

Compare this to, say, adding static analysis to your process (I’ve never got around to adding pylint to Hypothesis and I care about this stuff, because it gives too many false positives so it’s never been quite worth the up front investment), or rewriting everything in Haskell.

It’s certainly possible that both of those will produce enough of a return on investment from their correctness benefits that they’re worth the up front cost, but unless you’re really confident that they will and can afford to take the time to do it right now they’re things that you can put off basically indefinitely.

Hypothesis on the other hand is so cheap to get started with that you can start benefiting almost immediately, and then incremental amounts more work will produce incremental amounts more improvement.

This turns out to be incredibly valuable if what you actually want to produce better software.

One of the responses to the economics of software development post is that doing everything right up front can be cheaper than the long term cost of the unprincipled hacks that people often do instead.

This is certainly true. If you are lucky enough to find yourself in a position where you have the option of doing everything right up front, I recommend you make a reasonable effort towards doing so. You’ll probably fail (getting everything right is really rather hard), but in failing you will probably still end up in a better position than in not trying.

But, well, this is rarely the situation in which we find ourselves. Even if we got everything right at the beginning, at some point in the project we were rushed, or made a decision that later turned out to be a bad call, or in some other way failed to be perfect, because nobody can be perfect all the time.

If this is not the situation you find yourself in and you are working on a mature, high quality project that has consistently got most things right throughout its lifespan and has fixed the things it does not, good for you. Say hi to the unicorns for me while you’re there.

Most of us are not only not in that situation but that situation is almost entirely unreachable. Saying that if you did everything right at the beginning you wouldn’t have this problem is just another way of saying it sucks to be you.

So what we need are not tools that work great if you use them from the beginning, or tools that work great if you embrace them whole-heartedly, but tools that make it easy to get started and easy to move in the right direction.

This means that “rewrite it in Haskell” is out, but “we can write this microservice in Haskell” might be in. It means “Always have 100% branch coverage” is out, but “merges must never decrease branch coverage” is in. Failing the build if you have any static analysis warnings is out, but failing a merge if the touched lines have any static analysis failures is in.

Incremental quality improvements allow you to move your project to a higher quality state without requiring the huge up front investment that getting everything right from the beginning requires, and they don’t punish you for the fact that you didn’t get everything right before you had this tool to help you get things right.

This is of course often a harder problem, but I think it’s one that’s important to solve. If we want higher quality software we don’t get to live in a fantasy world where everything magically works, we have instead to think about how to move from the world we live in to a reachable world in which everything works better, and we need to think about what the steps along the way look like.

This entry was posted in programming, Python on by .

The economics of software correctness

This post is loosely based on the first half of my “Finding more bugs with less work” talk for PyCon UK.

You have probably never written a significant piece of correct software.

That’s not a value judgement. It’s certainly not a criticism of your competence. I can say with almost complete confidence that every non-trivial piece of software I have written contains at least one bug. You might have written small libraries that are essentially bug free, but the chances of you having written whole programs which are are tantamount to zero.

I don’t even mean this in some pedantic academic sense. I’m talking about behaviour where if someone spotted it and pointed it out to you you would probably admit that it’s a bug. It might even be a bug that you cared about.

Why is this?

Well, lets start with why it’s not: It’s not because we don’t know how to write correct software. We’ve known how to write software that is more or less correct (or at least vastly closer to correct than the norm) for a while now. If you look at the NASA development process they’re pretty much doing it.

Also, if you look at the NASA development process you will pretty much conclude that we can’t do that. It’s orders of magnitude more work than we ever put into software development. It’s process heavy, laborious, and does not adapt well to changing requirements or tight deadlines.

The problem is not that we don’t know how to write correct software. The problem is that correct software is too expensive.

And “too expensive” doesn’t mean “It will knock 10% off our profit margins, we couldn’t possibly do that”. It means “if our software cost this much to make, nobody would be willing to pay a price we could afford to sell it at”. It may also mean “If our software took this long to make then someone else will release a competing product two years earlier than us, everyone will use that, and when ours comes along nobody will be interested in using it”.

(“sell” and “release” here can mean a variety of things. It can mean that terribly unfashionable behaviour where people give you money and you give them a license to your software. It can mean subscriptions. It can mean ad space. It can even mean paid work. I’m just going to keep saying sell and release).

NASA can do it because when they introduce a software bug they potentially lose some combination of billions of dollars, years of work and many lives. When that’s the cost of a bug, spending that much time and money on correctness seems like a great deal. Safety critical industries like medical technology and aviation can do it for similar reasons (buggy medical technology kills people, and you don’t want your engines power cycling themselves midflight).

The rest of us aren’t writing safety critical software, and as a result people aren’t willing to pay for that level of correctness.

So the result is that we write software with bugs in it, and we adopt a much cheaper software testing methodology: We ship it and see what happens. Inevitably some user will find a bug in our software. Probably many users will find many bugs in our software.

And this means that we’re turning our users into our QA department.

Which, to be clear, is fine. Users have stated the price that they’re willing to pay, and that price does not include correctness, so they’re getting software that is not correct. I think we all feel bad about shipping buggy software, so let me emphasise this here: Buggy software is not a moral failing. The option to ship correct software is simply not on the table, so why on earth should we feel bad about not taking it?

But in another sense, turning our users into a QA department is a terrible idea.

Why? Because users are not actually good at QA. QA is a complicated professional skill which very few people can do well. Even skilled developers often don’t know how to write a good bug report. How can we possibly expect our users to?

The result is long and frustrating conversations with users in which you try to determine whether what they’re seeing is actually a bug or a misunderstanding (although treating misunderstandings as bugs is a good idea too), trying to figure out what the actual bug is, etc. It’s a time consuming process which ends up annoying the user and taking up a lot of expensive time from developers and customer support.

And that’s of course if the users tell you at all. Some users will just try your software, decide it doesn’t work, and go away without ever saying anything to you. This is particularly bad for software where you can’t easily tell who is using it.

Also, some of our users are actually adversaries. They’re not only not going to tell you about bugs they find, they’re going to actively try to keep you from finding out because they’re using it to steal money and/or data from you.

So this is the problem with shipping buggy software: Bugs found by users are more expensive than bugs found before a user sees them. Bugs found by users may result in lost users, lost time and theft. These all hurt the bottom line.

At the same time, your users are a lot more effective at finding bugs than you are due to sheer numbers if nothing else, and as we’ve established it’s basically impossible to ship fully correct software, so we end up choosing some level of acceptable defect rate in the middle. This is generally determined by the point at which it is more expensive to find the next bug yourself than it is to let your users find it. Any higher or lower defect rate and you could just adjust your development process and make more money, and companies like making money so if they’re competently run will generally do the things that cause them to do so.

This means that there are only two viable ways to improve software quality:

  1. Make users angrier about bugs
  2. Make it cheaper to find bugs

I think making users angrier about bugs is a good idea and I wish people cared more about software quality, but as a business plan it’s a bit of a rubbish one. It creates higher quality software by making it more expensive to write software.

Making it cheaper to find bugs though… that’s a good one, because it increases the quality of the software by increasing your profit margins. Literally everyone wins: The developers win, the users win, the business’s owners win.

And so this is the lever we get to pull to change the world: If you want better software, make or find tools that reduce the effort of finding bugs.

Obviously I think Hypothesis is an example of this, but it’s neither the only one nor the only one you need. Better monitoring is another. Code review processes. Static analysis. Improved communication. There are many more.

But one thing that won’t improve your ability to find bugs is feeling bad about yourself and trying really hard to write correct software then feeling guilty when you fail. This seems to be the current standard, and it’s deeply counter-productive. You can’t fix systemic issues with individual action, and the only way to ship better software is to change the economics to make it viable to do so.

Edit to add: In this piece, Itamar points out that another way of making it cheaper to find bugs is to reduce the cost of when your users do find them. I think this is an excellent point which I didn’t adequately cover here, though I don’t think it changes my basic point.

This entry was posted in programming, Python on by .

Mergeable compressed lists of integers

Alexander Shorin’s work on more configurable unicode generation in Hypothesis has to do some interesting slicing of ranges of unicode categories. Doing both generation and shrinking in particular either required two distinct representations of the data or something clever. Fortunately I’d previously figured out the details of the sort of data structure that would let you do the clever thing a while ago and it was just a matter of putting the pieces together.

The result is an interesting purely functional data structure based on Okasaki and Gill’s “Fast Mergeable Integer Maps”. I’m not totally sure we’ll end up using it, but the data structure is still interesting in its own right.

The original data structure, which is the basis of Data.IntMap in Haskell, is essentially a patricia trie treating fixed size machine words as strings of 0s and 1s (effectively a crit-bit trie). It’s used for implementing immutable mappings of integers with fast operations on them (O(log(n)) insert, good expected complexity on union).

With some small twists on the data structure you can do some interesting things with it.

  1. Ditch the values (i.e. we’re just representing sets)
  2. Instead of tips being a single key, tips are a range of keys start <= x < end.
  3. Split nodes are annotated with their size and the smallest interval [start, end) containing them.

When using this to represent sets of unicode letters this is extremely helpful – most of the time what we’re doing is we’re just removing one or two categories, or restricting the range, which results in a relatively small number of intervals covering a very large number of codepoints.

Let T be the number of intervals and W the word size. The data structure has the following nice properties:

  1. Getting the size of a set is O(1) (because everything is size annotated or can have its size calculated with a single arithmetic operation)
  2. Indexing to an element in sorted order is O(log(T)) because you can use the size annotation of nodes to index directly into it – when indexing a split node, check the size of the left and right subtrees and choose which one to recurse to.
  3. The tree can be automatically collapse tointervals in many cases, because a split node is equivalent to an interval if end = start + size, which is a cheap O(1) check
  4. Boolean operations are generally O(min(W, T)), like with the standard IntSet (except with intervals instead of values)
  5. Range restriction is O(log(T)).

Note that it isn’t necessarily the case that a tree with intervals [x, y) and [y, z) in it will compress this into the interval [x, z) because their common parent might be further up the tree.

An extension I have considered but not implemented is that you could potentially store very small subtrees as arrays in order to flatten it out and reduce indirection.

In particular the efficient indexing is very useful for both simplification and generation, and the fact that merging efficiently is possible means that we can keep two representations around: One for each permitted category (which helps give a better distribution when generating) and one for the full range (which makes it much easier to simplify appropriately).

Here is an implementation in Python. It’s not as fast as I’d like, but it’s not unreasonably slow. A C implementation would probably be a nice thing to have and is not too difficult to do (no, really. I’ve actually got a C implementation of something similar lying around), but wouldn’t actually be useful for the use case of inclusion in Hypothesis because I don’t want to add a C dependency to Hypothesis just for this.


This entry was posted in Code, programming, Python, Uncategorized on by .

Future directions for Hypothesis

There’s something going on the Hypothesis project right now: There are currently three high quality pull requests open from people who aren’t me adding new functionality.

Additionally, Alexander Shorin (author of the characters strategy one) has a CouchDB backed implementation of the Hypothesis example database which I am encouraging him to try to merge into core.

When I did my big mic drop post it was very unclear whether this was going to happen. One possible outcome of feature freeze was simply that Hypothesis was going to stabilize at its current level of functionality except for when I occasionally couldn’t resist the urge to add a feature.

I’m really glad it has though. There’s a vast quantity of things I could do with Hypothesis, and particularly around data generation and integrations it’s more or less infinitely parallellisable and doesn’t require any deep knowledge of Hypothesis itself, so getting other people involved is great and I’m very grateful to everyone who has submitted work so far.

And I’d like to take this forward, so I’ve updated the documentation and generally made the situation more explicit:

Firstly, it now says in the documentation that I do not do unpaid Hypothesis feature development. I will happily take sponsorship for new features, but for the rest of it I will absolutely help you every step of the way in writing and designing the feature, but it’s up to the community to actually drive the work.

Secondly, I’ve now labelled all enhancements that I think are accessible for someone else to work on. Some of these are large-ish and people will need me (or, eventually, someone else!) to lend a hand with, but I think they all have the benefit of being relatively self-contained and approachable without requiring too deep an understanding of Hypothesis.

Will this work? Only time (and effort) will tell, but I think the current set of pull requests demonstrates that it can work, and the general level of interest I see from most people I introduce Hypothesis to seems to indicate that it’s got a pretty good fighting chance.

This entry was posted in Hypothesis, Python on by .