Category Archives: programming

The economics of software correctness

This post is loosely based on the first half of my “Finding more bugs with less work” talk for PyCon UK.

You have probably never written a significant piece of correct software.

That’s not a value judgement. It’s certainly not a criticism of your competence. I can say with almost complete confidence that every non-trivial piece of software I have written contains at least one bug. You might have written small libraries that are essentially bug free, but the chances of you having written whole programs which are are tantamount to zero.

I don’t even mean this in some pedantic academic sense. I’m talking about behaviour where if someone spotted it and pointed it out to you you would probably admit that it’s a bug. It might even be a bug that you cared about.

Why is this?

Well, lets start with why it’s not: It’s not because we don’t know how to write correct software. We’ve known how to write software that is more or less correct (or at least vastly closer to correct than the norm) for a while now. If you look at the NASA development process they’re pretty much doing it.

Also, if you look at the NASA development process you will pretty much conclude that we can’t do that. It’s orders of magnitude more work than we ever put into software development. It’s process heavy, laborious, and does not adapt well to changing requirements or tight deadlines.

The problem is not that we don’t know how to write correct software. The problem is that correct software is too expensive.

And “too expensive” doesn’t mean “It will knock 10% off our profit margins, we couldn’t possibly do that”. It means “if our software cost this much to make, nobody would be willing to pay a price we could afford to sell it at”. It may also mean “If our software took this long to make then someone else will release a competing product two years earlier than us, everyone will use that, and when ours comes along nobody will be interested in using it”.

(“sell” and “release” here can mean a variety of things. It can mean that terribly unfashionable behaviour where people give you money and you give them a license to your software. It can mean subscriptions. It can mean ad space. It can even mean paid work. I’m just going to keep saying sell and release).

NASA can do it because when they introduce a software bug they potentially lose some combination of billions of dollars, years of work and many lives. When that’s the cost of a bug, spending that much time and money on correctness seems like a great deal. Safety critical industries like medical technology and aviation can do it for similar reasons (buggy medical technology kills people, and you don’t want your engines power cycling themselves midflight).

The rest of us aren’t writing safety critical software, and as a result people aren’t willing to pay for that level of correctness.

So the result is that we write software with bugs in it, and we adopt a much cheaper software testing methodology: We ship it and see what happens. Inevitably some user will find a bug in our software. Probably many users will find many bugs in our software.

And this means that we’re turning our users into our QA department.

Which, to be clear, is fine. Users have stated the price that they’re willing to pay, and that price does not include correctness, so they’re getting software that is not correct. I think we all feel bad about shipping buggy software, so let me emphasise this here: Buggy software is not a moral failing. The option to ship correct software is simply not on the table, so why on earth should we feel bad about not taking it?

But in another sense, turning our users into a QA department is a terrible idea.

Why? Because users are not actually good at QA. QA is a complicated professional skill which very few people can do well. Even skilled developers often don’t know how to write a good bug report. How can we possibly expect our users to?

The result is long and frustrating conversations with users in which you try to determine whether what they’re seeing is actually a bug or a misunderstanding (although treating misunderstandings as bugs is a good idea too), trying to figure out what the actual bug is, etc. It’s a time consuming process which ends up annoying the user and taking up a lot of expensive time from developers and customer support.

And that’s of course if the users tell you at all. Some users will just try your software, decide it doesn’t work, and go away without ever saying anything to you. This is particularly bad for software where you can’t easily tell who is using it.

Also, some of our users are actually adversaries. They’re not only not going to tell you about bugs they find, they’re going to actively try to keep you from finding out because they’re using it to steal money and/or data from you.

So this is the problem with shipping buggy software: Bugs found by users are more expensive than bugs found before a user sees them. Bugs found by users may result in lost users, lost time and theft. These all hurt the bottom line.

At the same time, your users are a lot more effective at finding bugs than you are due to sheer numbers if nothing else, and as we’ve established it’s basically impossible to ship fully correct software, so we end up choosing some level of acceptable defect rate in the middle. This is generally determined by the point at which it is more expensive to find the next bug yourself than it is to let your users find it. Any higher or lower defect rate and you could just adjust your development process and make more money, and companies like making money so if they’re competently run will generally do the things that cause them to do so.

This means that there are only two viable ways to improve software quality:

  1. Make users angrier about bugs
  2. Make it cheaper to find bugs

I think making users angrier about bugs is a good idea and I wish people cared more about software quality, but as a business plan it’s a bit of a rubbish one. It creates higher quality software by making it more expensive to write software.

Making it cheaper to find bugs though… that’s a good one, because it increases the quality of the software by increasing your profit margins. Literally everyone wins: The developers win, the users win, the business’s owners win.

And so this is the lever we get to pull to change the world: If you want better software, make or find tools that reduce the effort of finding bugs.

Obviously I think Hypothesis is an example of this, but it’s neither the only one nor the only one you need. Better monitoring is another. Code review processes. Static analysis. Improved communication. There are many more.

But one thing that won’t improve your ability to find bugs is feeling bad about yourself and trying really hard to write correct software then feeling guilty when you fail. This seems to be the current standard, and it’s deeply counter-productive. You can’t fix systemic issues with individual action, and the only way to ship better software is to change the economics to make it viable to do so.

Edit to add: In this piece, Itamar points out that another way of making it cheaper to find bugs is to reduce the cost of when your users do find them. I think this is an excellent point which I didn’t adequately cover here, though I don’t think it changes my basic point.

This entry was posted in programming, Python on by .

Mergeable compressed lists of integers

Alexander Shorin’s work on more configurable unicode generation in Hypothesis has to do some interesting slicing of ranges of unicode categories. Doing both generation and shrinking in particular either required two distinct representations of the data or something clever. Fortunately I’d previously figured out the details of the sort of data structure that would let you do the clever thing a while ago and it was just a matter of putting the pieces together.

The result is an interesting purely functional data structure based on Okasaki and Gill’s “Fast Mergeable Integer Maps”. I’m not totally sure we’ll end up using it, but the data structure is still interesting in its own right.

The original data structure, which is the basis of Data.IntMap in Haskell, is essentially a patricia trie treating fixed size machine words as strings of 0s and 1s (effectively a crit-bit trie). It’s used for implementing immutable mappings of integers with fast operations on them (O(log(n)) insert, good expected complexity on union).

With some small twists on the data structure you can do some interesting things with it.

  1. Ditch the values (i.e. we’re just representing sets)
  2. Instead of tips being a single key, tips are a range of keys start <= x < end.
  3. Split nodes are annotated with their size and the smallest interval [start, end) containing them.

When using this to represent sets of unicode letters this is extremely helpful – most of the time what we’re doing is we’re just removing one or two categories, or restricting the range, which results in a relatively small number of intervals covering a very large number of codepoints.

Let T be the number of intervals and W the word size. The data structure has the following nice properties:

  1. Getting the size of a set is O(1) (because everything is size annotated or can have its size calculated with a single arithmetic operation)
  2. Indexing to an element in sorted order is O(log(T)) because you can use the size annotation of nodes to index directly into it – when indexing a split node, check the size of the left and right subtrees and choose which one to recurse to.
  3. The tree can be automatically collapse tointervals in many cases, because a split node is equivalent to an interval if end = start + size, which is a cheap O(1) check
  4. Boolean operations are generally O(min(W, T)), like with the standard IntSet (except with intervals instead of values)
  5. Range restriction is O(log(T)).

Note that it isn’t necessarily the case that a tree with intervals [x, y) and [y, z) in it will compress this into the interval [x, z) because their common parent might be further up the tree.

An extension I have considered but not implemented is that you could potentially store very small subtrees as arrays in order to flatten it out and reduce indirection.

In particular the efficient indexing is very useful for both simplification and generation, and the fact that merging efficiently is possible means that we can keep two representations around: One for each permitted category (which helps give a better distribution when generating) and one for the full range (which makes it much easier to simplify appropriately).

Here is an implementation in Python. It’s not as fast as I’d like, but it’s not unreasonably slow. A C implementation would probably be a nice thing to have and is not too difficult to do (no, really. I’ve actually got a C implementation of something similar lying around), but wouldn’t actually be useful for the use case of inclusion in Hypothesis because I don’t want to add a C dependency to Hypothesis just for this.

 

This entry was posted in Code, programming, Python, Uncategorized on by .

Finding more bugs with less work

I was at PyCon UK this weekend, which was a great conference and I will definitely be attending next year.

Among the things that occurred at this conference is that I gave my talk, “Finding more bugs with less work”. The video is up, and you can see the slides here.

I may do a transcript at some point (like I did for my django talk), but I haven’t yet.

This entry was posted in Hypothesis, programming, Python on by .

Soliciting advice: Bindings, Conjecture, and error handling

Edit: I think I have been talked into a significantly simpler system than the one described here that simply uses error codes plus some custom hooks to make this work. I’m leaving this up for posterity and am still interested in advice, but don’t worry I’ve already been talked out of using setjmp or implementing my own exception handling system.

I’m working on some rudimentary Python bindings to Conjecture and running into a bit of a problem: I’d like it to be possible to run Conjecture without forking, but I’m really struggling to come up with an error handling interface that works for this.

In Conjecture’s current design, any of the data generation functions can abort the process, and are run in a subprocess to guard against that. For testing C this makes absolute sense: It’s clean, easy to use, and there are so many things that can go wrong in a C program that will crash the process that your C testing really has to be resilient against the process crashing anyway so you might as well take advantage of that.

For Python, this is a bit sub-optimal. It would be really nice to be able to run Conjecture tests purely in process just looking for exceptions. os.fork() has to do a bunch of things which makes it much slower than just using C forking straight off (and the program behaves really weirdly when you send it a signal if you try to use the native fork function), and it’s also just a bit unneccessary for 90% of what you do with Python testing.

It would also be good to support a fork free mode so that Conjecture can eventually work on Windows (right now it’s very unixy).

Note: I don’t need forkless mode to handle crashes that are not caused by an explicit call into the conjecture API. conjecture_reject and conjecture_fail (which doesn’t exist right now but could) will explicitly abort the test, but other things that cause a crash are allowed to just crash the process in forkless mode.

So the problem is basically how to combine these interfaces, and every thing I come up with seems to be “Now we design an exception system…”

Here is the least objectionable plan I have so far. It requires a lot of drudge work on my part, but this should mostly be invisible to the end user (“Doing the drudge work so you don’t have to” is practically my motto of good library design)

Step 1: For each draw_* function in the API, add a second draw_*_checked function which has exactly the same signature. This does a setjmp, followed by a call to the underlying draw_* function. If that function aborts, it does a longjmp back to the setjmp and sets a is_aborted flag and returns some default value. Bindings must always call the _checked version of the function, then check conjecture_is_aborted() and convert it into a language appropriate error condition.

Note: It is a usage error to call one checked function from another and this will result in your crashing the process. Don’t do that. These are intended to be entry points to the API, not something that you should use in defining data generators.

Step 2: Define a “test runner” interface. This takes a test, some associated data, and runs it and returns one of three states: Passing test, failing test, rejected test. The forking based interface then becomes a single test runner. Another one using techniques similar to the checked interface is possible. Bindings libraries should write their own – e.g. a Python one would catch all exceptions and convert them into an appropriate response.

Step 3: Define a cleanup API. This lets you register a void (*cleanup)(void *data) function and some data to pass to it which may get called right before aborting. In “crash the process” model it is not required to be called, and it will not get called if your process otherwise exits abnormally. Note: This changes the memory ownership model of all data generation. Data returned to you from generators is no longer owned by you and you may not free it.

I think this satisfies the requirements of being easy to use from both C and other languages, but I’m a little worried that I’m not so much reinventing the wheel as trying to get from point A to point B without even having heard of these wheel things and so I invented the pogo stick instead. Can anyone who has more familiarity with writing C libraries designed to be usable from both C and other languages offer me some advice and/or (heh) pointers?

This entry was posted in Hypothesis, programming, Python on by .

Conjecture WIP: Simplification isn’t simple

Advance note: Read the Conjecture design document first or this post won’t make sense.

I’ve been doing some prototyping of Conjecture in a private fork of Hypothesis, mostly to take advantage of Hypothesis’s fairly brutal test suite and comprehensive range of examples. I figure if I can make the Conjecture approach pass the Hypothesis test suite then that’s a pretty good indication that the idea works.

Spoiler alert: I can’t currently make Conjecture pass the Hypothesis test suite.

Fortunately, the way I can’t make Conjecture pass the Hypothesis test suite isn’t the end of the world. The problem is basically one of performance. For small examples in a lot of cases Conjecture’s approach performs as well as or better than Hypothesis. For large examples it’s just too slow: It gets the right answer eventually, but it takes an order of magnitude more time than Hypothesis does.

The reason for this is pretty simple: The underlying data buffers that conjecture ends up generating for an example of any reasonable size are quite large (10s of kilobytes), and this is going to remain the case even after simplification. So any structure unaware binary simplification process is going to be a bit stuck about what to do. It’ll find the right answer eventually, but there are a lot of things to try.

I think the answer is going to have to be to make it structure aware. I haven’t yet hit on exactly the right approach for this yet, but the minimal abstraction that I think will work is to give it an idea of nesting: Examples can call conjecture_example_start() when they begin and conjecture_example_end() when they end, and this gives us nested ranges of where examples live in our data stream, which should be useful information for providing us with hints about where to start. e.g. one of the best simplifications is to try deleting whole ranges of data, and another good one is to try zeroing whole ranges. Unfortunately there are a lot of ranges in a buffer (O(n^2) in fact) and it’s very hard to pick the good ones. Focusing on ranges that snap to example boundaries would help a lot here.

It also potentially gives us a way to deal with the accidental dependency problem, as we can look for pieces that have moved from one example to another in the course of simplification as likely candidates for deletion. This might be too hard to keep track of the be worth it though as this hasn’t proven to be a major problem in practice.

In the course of thinking about these ideas I also came up with another idea that I initially thought was ridiculous but actually now I’m not so sure and think might be a good idea: Prioritizing examples that print less data as a mitigation for cases where simplicity of the underlying buffer isn’t always reflected by simplicity of example.

The way this would work is that we would still maintain our example search exactly as normal, but we would also maintain a “favoured” example, which is the example with the smallest number of bytes printed, with ties being broken by simplicity of underlying buffer (i.e. if we shrink to an example which prints the same number of bytes we always replace).

The reason I think this is a good idea is that most of what we’re focusing on in examples is readability, and size of example is a pretty good proxy for that – large examples are hard to read, small examples are easy to read. It’s not a perfect proxy for it, which is why we still want the underlying simplification process, but I think it’s a good tool for mitigating that process’s potential flaws.

Another thing that would be useful to investigate is whether a smarter search algorithm is useful here. e.g. simulated annealing. The local search approach was a great idea when you had a fairly abstract data representation and needed a simple API that was easy to extend to new things, but when what you’ve got is a concrete binary representation it’s probably not the best search algorithm.

Basically, all of this is harder than I thought. It’s not intractably hard, and for Conjecture itself which is C, as long as you’re running it on pretty fast tests you can basically brute force through the problem (although I need to update its simplification). I still think the idea is a great one, but great ideas also need hard work and that’s still yet to come.

This entry was posted in programming on by .