# Category Archives: programming

I’ve argued before that most software is broken because of economic reasons. Software which is broken because there is no incentive to ship good software is going to stay broken until we manage to change those incentives. Without that  there will be no budget for quality, and nothing you can do is going to fix it.

But suppose you’re in the slightly happier place where you do have budget for quality? What then? What can you do to make sure you’re actually spending that budget effectively and getting the best software you could be getting out of it?

I don’t have an easy answer to that, and I suspect none exists, but I’ve been doing this software thing for long enough now that I’ve picked up some things that seem to help quality without hurting (and ideally helping) productivity. I thought it would be worth writing them down.

Many of them will be obvious or uncontroversial, but if you’re already doing all of them then your team is probably doing very well.

This is all based somewhat on anecdote and conjecture, and it’s all coloured by my personal focuses and biases, so some of it is bound to be wrong. However I’m pretty sure it’s more right than wrong and that the net effect would be strongly positive.

### Attitude

If you do not care about developing quality software you will not get quality software no matter what your tools and processes are designed to give you.

This isn’t just about your developers either. If you do not reward the behaviour that is required to produce quality software, you will not get quality software. People can read their managers’ minds, and if you say you want quality software but reward people for pushing out barely functioning rubbish, people are smart enough to figure out you don’t really mean that.

Estimated cost: Impossible to buy, but hopefully if you’re reading this article you’re already there. If you’re embedded in a larger context that isn’t, try creating little islands of good behaviour and see if you can bring other people around to your way of thinking.

Estimated benefit: On it’s own, only low to moderate – intent doesn’t do much without the ability to act – but it is the necessary precursor to everything else.

### Automated Testing

Obviously I have quite a few thoughts about automated testing, so this section gets a lot of sub headings.

#### Continuous Integration

You need to be running automated tests in some sort of CI server that checks every ostensibly releasable piece of software and checks whether it passes the tests.

If you’re not doing this, just stop reading this article and go set it up right now, because it’s fundamental. Add a test that just fires up your website and requests the home page (or some equivalent if you’re not writing a website). You’ve just taken the first and most important step on the road from throwing some crap over the wall and seeing if anyone on the other side complains about it landing on them to actual functional software development.

Estimated cost: Hopefully at worst a couple days initial outlay to get this set up, then small to none ongoing.

Estimated benefit: Look, just go do this already. It will give you a significant quality and productivity boost.

#### Local Automated Testing

You need to be able to run a specific test (and ideally the whole test suite) against your local changes.

It doesn’t really matter if it runs actually on your local computer, but it does matter that it runs fast. Fast feedback loops while you work are incredibly important. In many ways the length of time to run a single test against my local changes is the biggest predictor of my productivity on a project.

Ideally you need to  be able to select a coherent group of tests (all tests in this file, all tests with this tag) and run just those tests. Even better, you should be able to run just the subset of whatever tests you ask to run that failed last time. If you’re using Python, I recommend py.test as supporting these features. If you’re currently using unittest you can probably just start using it as an external runner without any changes to your code.

Estimated cost: Depends on available tooling and project. For some projects it may be prohibitively difficult (e.g. if your project requires an entire hadoop cluster to run the code you’re working on), but for most it should be cheap to free.

Estimated benefit: Similarly “look, just go do this already” if you can’t run a single test locally. More specific improvements will give you a modest improvement in productivity and maybe some improvement in quality if they make you more likely to write good tests, which they probably will.

#### Regression Testing

The only specific type of automated testing that I believe that absolutely everybody should be doing is regression testing: If you find a bug in production, write a test that detects that bug before you try to fix the bug. Ideally write two tests: One that is as general as possible, one that is as specific as possible. Call them an integration and a unit test if that’s your thing.

This isn’t just a quality improvement, it’s a productivity improvement. Trying to fix bugs without a reproducible example of that bug is just going to waste your time, and writing a test is the best way to get a reproducible example.

Estimated cost: Zero, assuming you have local testing set up. This is just what you should be doing when fixing bugs because it’s better than the other ways of fixing bugs – it will result in faster development and actually fixed bugs.

Estimated benefit: Depends how much time you spend fixing bugs already, but it will make that process faster and will help ensure you don’t have to repeat the process. It will probably improve quality somewhat by virtue of preventing regressions and also ensuring that buggier areas of the code are better tested.

#### Code Coverage

You should be tracking code coverage. Code coverage is how you know code is tested. Code being tested is how you know that it is maybe not completely broken.

It’s OK to have untested code. A lot of code isn’t very important, or is difficult enough to test that it’s not worth the effort, or some combination of the two.

But if you’re not tracking code coverage then you don’t know which parts of your code you have decided you are OK with being broken.

People obsess a lot about code coverage as a percentage, and that’s understandable given that’s the easiest thing to get out of it, but in many ways it’s the least important part of code coverage. Even the percentage broken down by file is more interesting that, but really the annotated view of your code is the most important part because it tells you which parts of your system are not tested.

My favourite way to use code coverage is to insist on 100% code coverage for anything that is not explicitly annotated as not requiring coverage, which makes it very visible in the code if something is untested. Ideally every pragma to skip coverage would also have a comment with it explaining why, but I’m not very good about that.

As a transitional step to get there, I recommend using something like diff-cover or coveralls which let you set up a ratcheting rule in your build that prevents you from decreasing the amount of code coverage.

Estimated cost: If your language has good tooling for coverage, maybe a couple hours to set up. Long-term, essentially free.

Estimated benefit: On its own, small, but it can be a large part of shifting to a culture of good testing, which will have a modest to large effect.

#### Property-based Testing

Property-based testing is very good at shifting the cost-benefit ratio of testing, because it somewhat reduces the effort to write what is effectively a larger number of tests and increases the number of defects those tests will find.

Estimated cost: If you’re using a language with a good property based testing tool, about 15 minutes to install the package and write your first test. After that, free to negative. If you’re not, estimated cost to pay me to port Hypothesis to a new language is around £15-20k.

Estimated benefit: You will find a lot more bugs. Whether this results in a quality improvement depends on whether you actually care about fixing those bugs. If you do, you’ll see a modest to large quality improvement. You should also see a small to modest productivity improvement if you’re spending a lot of time on testing already.

### Manual Testing

At a previous job I found a bug in almost every piece of code I reviewed. I used a shocking and complicated advanced technique to do this: I fired up the application with the code change and tried out the feature.

Manual testing is very underrated. You don’t have to have a dedicated QA professional on your team to do it (though I suspect it helps a lot if you do), but new features should have a certain amount of exploratory manual testing done by someone who didn’t develop them – whether it’s another developer, a customer service person, whatever. This will find both actual bugs and also give you a better idea of its usability.

And then if they do find bugs those bugs should turn into automated regression tests.

Estimated cost: It involves people doing stuff on an ongoing basis, so it’s on the high side because people are expensive, but it doesn’t have to be that high to get a significant benefit to it. You can probably do quite well with half an hour of testing of a feature that took days to develop. This may also require infrastructure changes to make it easy to do that can have varying levels of cost and difficulty, but worst case scenario you can do it on the live system.

Estimated benefit: You will almost certainly get a moderate quality improvement out of doing this.

### Version Control

You need to be using a version control system with good branching and merging. This is one of the few pieces of advice where my recommendation requires making a really large change to your existing workflow.

I hope that it’s relatively uncontroversial that you should be using version control (not everybody is!). Ideally you should be using good version control. I don’t really care if you use git, mercurial, fossil, darcs, whatever. We could get into a heated argument about which is better but it’s mostly narcissism of small differences at this point.

But you should probably move off SVN if you’re still on it and you should definitely move off CVS if you’re still on it. If you’re using visual source safe you have my sympathies.

The reason is simple: If you’re working on a team of more than one person, you need to be able to incorporate each other’s changes easily, and you need to be able to do that without trashing your own work. If you can’t, you’re going to end up wasting a lot of your time.

Estimated cost: Too project dependent to say. Importer tools are pretty good, but the real obstacle is always going to be the ecosystem you’ve built around the tools. At best you’re going to have a bad few weeks or months while people get used to the new system.

Estimated benefit: Moderate to large. Many classes of problems will just go away and you will end up with a much more productive team who find it much easier to collaborate.

### Monorepos

It’s tempting to split your projects into lots of small repos for libraries and services, but it’s almost always a bad idea. It significantly constrains your ability to refactor across the boundary and makes coordinating changes to different parts of the code much harder in almost every way, especially with most standard tooling.

If you’re already doing this, this is easy. Just don’t change.

If you’re not, just start by either creating or designating an existing repository as the monorepo and gradually move the contents of other repos into it as and when convenient.

The only exception where you probably need to avoid this is specific projects you’re open sourcing, but even then it might be worth developing them in the monorepo with some sort of external repo.

Estimated costs: Too project dependent to say, but can be easily amortised over time.

Estimated benefits: Every time you do something that would have required touching two repos at once, your life will be slightly easier because you are not paying coordination costs. Depends on how frequent that is, but experience suggests it’s at least a modest improvement.

### Static Analysis

I do not know what the right amount of static analysis is, but I’m essentially certain that it’s not none. I would not be surprised to learn that the right amount was quite high and includes a type system of some sort, but I don’t know. However even very dynamic languages admit some amount of static analysis and there are usually tools for it that are worth using.

I largely don’t think of this as a quality thing though. It’s much more a productivity improvement.  Unless you are using a language that actively tries to sabotage you (e.g. C, JavaScript), or you have a really atypically good static analysis system that does much more work than the ones I’m used to (I’m actually not aware of any of these that aren’t for C and/or C++ except for languages with actual advanced type systems), static analysis is probably not going to catch bugs much more effectively than a similar level of good testing.

But what it does do is catch those bugs sooner and localise them better. This significantly improves the feedback loop of development and stops you wasting time debugging silly mistakes.

There are two places that static analysis is particularly useful:

1. In your editor. I use syntastic because I started using vim a decade ago and haven’t figured out how to quit yet, but your favourite editor and/or IDE will likely have something similar (e.g. The Other Text Editor has flycheck). This is a really good way of integrating lightweight static analysis into your workflow without having to make any major changes.
2. In CI. The ideal number of static analysis errors in your project is zero (this is true even when the static analysis system has false positives in my opinion, with the occasional judicious use of ‘ignore this line’ pragmas), but you can use the same tricks as with code coverage to ratchet them down to zero from wherever you’re starting.

Most languages will have at least a basic linting tool you can use, and with compiled languages the compiler probably has warning flags you can turn on. Both are good sources of static analysis that shouldn’t require too much effort to get started with.

Estimated cost: To use it in your editor, low (you can probably get it set up in 10 minutes). To use it in your CI, higher but still not substantial. However depending on the tool it may require some tuning to get it usable, which can take longer.

Estimated benefit: Depends on the tool and the language, but I think you’ll get a modest productivity boost from incorporating static analysis and may get a modest to large quality boost depending on the language (in Python I don’t think you’ll get much of a quality benefit. In C I think you’ll get a huge one even with just compiler warnings).

### Production Error Monitoring

You should have some sort of system that logs all errors in production to something more interactive than a log file sitting on a server somewhere. If you’re running software locally on end users’ computers this may be a bit more involved and should require end user consent, but if you’re writing a web application we’re all used to being pervasively spied on in everything we do anyway so who cares?

I’ve used and like Sentry for this. There are other options, but I don’t have a strong opinion about it.

Estimated cost: Depends on setup, but getting started with sentry is easy and it doesn’t cost a particularly large amount per month (or you can host the open source version for the cost of a server).

Estimated benefit: Much better visibility of how broken your software is in production is the best thing for making your software less broken in production. It will also speed up your debugging process a lot when you do have production errors to fix, so it’s probably a net win in productivity too if you spend much time debugging production errors (and you probably do).

### Code Review

I think all projects with > 1 person on them should put all code changes through code review.

I also think this will not significantly reduce the number of bugs in your software. Code review is just not a cost effective bug finding tool compared to almost anything else.

What it will do is ensure two things:

1. At least one other person understands this code. This is useful both for bus factor and because it ensures that you have written code that at least one other person can understand.
2. At least one other person thinks that shipping this code is a good idea. This is good both for cross checking but also because it forces you to sit down and think about what you ship. This is quite important: Fast feedback loops are good for development, but slow feedback loops for shipping make you pause and think about it.

Estimated cost: You need to get a code review system set up, which is a modest investment and may be trivial. I can’t really recommend anything in this space as the only things I’ve used for this are Github and proprietary internal systems. Once you’ve got that, the ongoing cost is actually quite high because it requires the intervention of an actual human being on each change.

Estimated benefit: It’s hard to say. I have never been part of a code review process that I didn’t think was worth it, but I don’t have a good way of backing that up with measurements. It also depends a lot on the team – this is a good way of dealing with people with different levels of experience and conscientiousness.

### Auto formatting and style checking

Code review is great, but it has one massive failure mode. Consider Wadler’s law:

In any language design, the total time spent discussing a feature in this list is proportional to two raised to the power of its position.
0. Semantics
1. Syntax
2. Lexical syntax

Basically the same thing will happen with code review. People will spend endless time arguing about style checking, layout, etc.

This stuff matters a bit, but it doesn’t matter a lot, and the back and forth of code review is relatively expensive.

Fortunately computers are good at handling it. Just use an auto-formatter plus a style checker. Enforce that these are applied (style checking is technically a subset of static analysis but it’s a really boring subset and there’s not that much overlap in tools).

In Python land I currently use pyformat and isort for auto-formatting and flake8 for style checking. I would like to use something stronger for formatting – pyformat is quite light touch in terms of how much it formats your code. clang-format is extremely good and is just about the only thing I miss about writing C++. I look forward to yapf being as good, but don’t currently find it to be there yet (I need to rerun a variant on my bug finding mission I did for it last year at some point). gofmt is nearly the only thing about Go I am genuinely envious of.

Ideally you would have your entire project be a fixed point of the code formatter. That’s what I do for Hypothesis. If you haven’t historically done that it can be a pain though. Many formatting tools can be applied based on only the edited subset of the code. If you’re lucky enough to have one of those, make that part of your build process and have it automatically enforced.

Once you have this, you can now institute a rule that there should be no formatting discussion in code review because that’s the computer’s job.

Estimated cost: Totally tool dependent, but if you’re lucky it’s basically free.

Estimated benefit: From the increased consistency of your code, small but noticeable. The effect on code review is moderate to large, both in terms of time taken and quality of review.

### Documentation in the Repository

You should have a docs section in your repository with prose written about your code. It doesn’t go in a wiki. Wikis are the RCS of documentation. We’ve already established you should be using good version control and a monorepo, so why would you put your documentation in RCS?

Ideally your docs should use something like sphinx so that they compile to a (possibly internally hosted) site you can just access.

It’s hard to keep documentation up to date, I know, but it’s really worth it. At a bare minimum I think your documentation should include:

• Up to date instructions for how to get started developing with your code
• Detailed post-mortems of major incidents with your product

For most projects they should also include a change log which is updated as part of each pull request/change list/whatever.

It may also be worth using the documentation as a form of “internal blogging” where people write essays about things they’ve discovered about the problem domain, the tools you’re using or the local style of work.

Estimated cost: Low initial setup. Writing documentation does take a fair bit of time though, so it’s not cheap.

Estimated benefit: This has a huge robustness benefit, especially every time your team changes structure or someone needs to work on a new area of the code base. How much benefit you’ll derive varies depending on that, but it’s never none – if nothing else, everybody forgets things they don’t do often, but also the process of writing the documentation can hugely help the author’s understanding.

### Plan to always have more capacity than work

Nick Stenning made an excellent point on this recently: If your team is always working at full capacity then delays in responding to changes will sky rocket, even if they’re coming in at a rate you can handle.

As well as that, it tends to mean that maintenance tasks that can greatly improve your productivity will never get done – almost every project has a back log of things that are really annoying the developers that they’d like to fix at some point and never get around to. Downtime is an opportunity to work on that.

This doesn’t require some sort of formal 20% time arrangement, it just requires not trying to fit a quart into a pint pot. In particular, if you find you’ve scheduled more work than got done, that’s not a sign that you slightly over estimated the amount you could get done that’s a sign that you scheduled significantly too much work.

Estimated Cost: Quite expensive. Even if you don’t formally have 20% time, you’re probably still going to want to spend about 20% of your engineering capacity this way. It may also require significant experimentation to get your planning process good enough to stop overestimating your capabilities.

Estimated Benefit: You will be better able to respond to changes quickly and your team will almost certainly get more done than they were previously getting done in their 100% time.

### Projects should be structured as collections of libraries

Modularity is somewhat overrated, but it’s not very overrated, and the best way of getting it is to structure things as libraries. The best way to organise your project is not as a big pot of code, but as a large number of small libraries with explicit dependencies between them.

This works really well, is easy to do, and helps keep things clean and easy to understand while providing push back against it all collapsing into a tangled mess.

Some people may be tempted to do this as microservices instead, which is a great way to get all the benefits of libraries alongside all the benefits of having an unreliable network and a complicated and fragile deployment system.

There are systems like bazel that are specifically designed around structuring your project this way. I don’t have very fond memories of its origin system, and I’ve not used the open source version yet, but it is a good way of enforcing a good build structure. Otherwise the best way to do this is probably just to create subdirectories and use your language’s standard packing tools (which probably include a development mode for local development. e.g. pip install -e if you’re using Python).

Estimated cost: Quite low. Just start small – factor out bits of code and start new code in its own library. Evolve it over time.

Estimated benefit: Small to moderate productivity enhancement. Not likely to have a massive impact on quality, but it does make testing easier so it should have some.

### Ubiquitous working from home

I’m actually not a huge fan of massively distributed teams, mostly because of time zones. It tends to make late or early meetings a regular feature of peoples’ lives. I could pretend to be altruistic and claim that I disapprove of that because it’s bad for people with kids, which it is, but I also just really hate having to do those myself.

But the ability to work from home is absolutely essential to a productive work environment, for a number of reasons:

1. Open plan offices are terrible. They are noisy distraction zones that make it impossible to get productive work done. Unfortunately, this battle is lost. For whatever reason the consensus is that it’s more cost effective to cram developers in at 50% efficiency than it is to pay for office space. This may even be true. But working from home generally solves this by giving people a better work environment that the company doesn’t have to pay for.
2. Requiring physical presence is a great way for your work force to be constantly sick! People can and should take sick days, but if people cannot work from home then they will inevitably come in when they feel well enough to work but are nevertheless contagious. This will result in other people becoming ill, which will either result in them coming and spreading more disease or staying home and getting nothing done. Being able to work from home significantly reduces the incentive to come in while sick.
3. Not having physical access to people will tend to improve your communication patterns to be lower interrupt and more documentation driven, which makes them work better for everyone both in the office and not.

I do not know what the ideal fraction of work from home to work in the office is, but I’d bet money that if most people are not spending at least two days per week working from home then they would benefit from spending more. Also, things will tend to work better as the fraction increases: If you only have a few people working from home at any given point, the office culture will tend to exclude them. As you shift to it being more normal, work patterns will adapt to accommodate them better.

Estimated cost: There may be some technical cost to set this up – e.g. you might need to start running a VPN – but otherwise fairly low. However there may be quite a lot of political and social push back on this one, so you’re going to need a fair bit of buy in to get it done.

Estimated benefit: Depends on team size and current environment, but potentially very large productivity increase.

### No Long Working Hours

Working longer work weeks does not make for more productive employees, it just results in burning out, less effective work and spending more time in the office failing to get anything done. Don’t have work environments that encourage it.

In fact, it’s better if you don’t have work environments that allow it, because it will tend to result in environments where it goes from optional to implicitly mandatory due to who gets rewarded. It’s that reading managers’ minds thing again.

Estimated cost: Same as working from home: Low, but may require difficult to obtain buy in. Will probably also result in a transitional period of lower productivity while people are still burned out but less able to paper over it.

Estimated benefit: High productivity benefits, high quality benefits. Exhausted people do worse work.

### Good Work Culture

Or the “don’t work with jerks” rule.

People need to be able to ask questions without being afraid. People need to be able to give and receive feedback without being condescending or worrying that the other side will blow up at them or belittle them. People need to be able to take risks and be seen to fail without being afraid of how much it will cost them.

There are two major components to this:

1. Everyone needs to be on board with it and work towards it. You don’t need everyone to be exquisitely polite with everyone else at all times – a certain amount of directness is usually quite beneficial – but you do need to be helpful, constructive and not make things personal.
2. Some people are jerks and you should fire them if they don’t fix their ways.

It is really hard to do the second one, and most people don’t manage, but it’s also really important. Try to help them fix their ways first, but be prepared to let them go if you can’t, because if you’ve got a high performing jerk on the team it might be that they’re only or in large part high performing because they’re making everyone else perform worse.

Note: This includes not just developers but also everyone else in the company.

Estimated cost: High. Changing culture is hard. Firing people is hard, especially if they’re people who as individual performers might look like your best employees.

Estimated benefit: Depending on how bad things are currently, potentially extremely high. It will bring everyone’s productivity up and it will improve employee retention.

### Good Skill Set Mixing

You generally want to avoid both silos and low bus factors.

It’s important to have both overlapping and complementary skills on your team. A good rule of thumb is that any task should have at least two people who can do it, and any two people should have a number of significant tasks where one would obviously be better suited to work on it than another. The former is much more important than the latter, but both are important.

Having overlapping skills is important because it increases your resilience and capacity significantly: If someone is out sick or on holiday you may be at reduced capacity but there’s nothing you can’t do. It also means there is always a second perspective you can get on any problem you’re stuck with.

Having complementary skills is important because that’s how you expand capabilities: Two people with overlapping skills are much better able to work together than two people with nothing in common, but two people with identical skills will not be much better working together than either of them individually. On the other hand two people working together who have different skill sets can cover the full range of either of their skills.

This is a hard one to achieve, but it will tend to develop over time if you’re documenting things well and doing code review. It’s also important to bear in mind while hiring.

Estimated cost: Hard to say because what you need to do to achieve it is so variable, but it will probably require you to hire more people than you otherwise would to get the redundant skill sets you need, so it’s not cheap.

Estimated benefit: Huge improvement in both total team and individual productivity.

### Hire me to come take a look at what might be wrong

I do do software consulting after all. This isn’t what I normally consult on (I’m pretty focused on Hypothesis), but if you’d like the help I’d be happy to branch out, and after accidentally writing 5000 words on the subject I guess I clearly have a few things to say on the subject.

Estimated cost: My rates are very reasonable.

Estimated benefit: You’re probably better placed to answer this one than I am, but if this document sounds reasonable to you but you’re struggling to get it implemented or want some other things to try, probably quite high.

This entry was posted in programming on by .

# Static typing will not save us from broken software

Epistemic status: This piece is like virtually all writing about software and is largely anecdata and opinions. I think it’s more right than not, but then I would.

I learned to program in ML. I know Haskell to a reasonable degree of fluency. I’ve written quite a lot of Scala, including small parts of the compiler and standard library (though I think most or all of that is gone or rewritten by now. It’s been 8 years). I like static typing, and miss using statically typed languages more heavily than I currently do (which is hardly at all).

But I’m getting pretty frustrated with certain (possibly most) static type system advocates.

This frustration stems from the idea that static typing will solve all our problems, or even one specific problem: The ubiquity of broken software. There’s a lot of broken software out there, and the amount keeps going up.

People keep claiming that is because of bad choices of language, but it’s mostly not and static typing will not even slightly help fix it.

(Note: I’m getting a lot of people saying this is a strawman and that’s not what static typing advocates say. This post is in fact a response to several specific comments from specific people, but I didn’t want to name and shame. It’s not a strawman if the people I’m arguing against actually exist).

Broken software is a social and economic problem: Software is broken  because its not worth people’s while to write non-broken software. There are only two solutions to this problem:

1. Make it more expensive to write broken software
2. Make it cheaper to write correct software

Technical solutions don’t help with the first, and at the level of expense most people are willing to spend on software correctness your technical solution has to approach “wave a magic wand and make your software correct” levels of power to make much of an impact: The current level of completely broken software can only arise if there’s almost zero incentive for people to sink time into correctness of their IoT devices and they’re not engaged in even minimal levels of testing for quality.

When you’ve got that level of investment in quality anything that points out errors is more likely to be ignored or not used than it is to improve things.

I think this carries over to moderate levels of investment in correctness too, but for different reasons (and ones I’m less confident of).

“All” static typing tells you is that your program is well-typed. This is good and catches a lot of bugs by enforcing consistency on you. But at entry-level static typing most of those bugs are the sort that ends up with a Python program throwing a TypeError. Debugging those when they happen in production is a complete pain and very embarrassing, but it’s still the least important type of bug: A crash is noticeable if you’ve got even basic investment in monitoring (e.g. a sentry account and 5 lines of code to hook it in to your app). This is more true in some dynamic languages than others – Javascript is terrible for this because so many errors result in a value of undefined rather than an exception – but generally speaking in most languages these are quite straightforward errors both in manifestation and debugging.

Don’t get me wrong: Not having those bugs reach production in the first place is great. I’m all in favour. But because these bugs are relatively minor the cost of finding them needs to be lower than the cost of letting them hit production, else they start to eat into your quality budget and come at the cost of other more important bugs.

For more advanced usage, I’ve yet to be convinced that types are more effective than tests on modestly sized projects.

For large classes of problems, tests are just easier to write than types. e.g. an end to end test of a complicated user workflow is fairly easy to write, but literally nobody is going to encode it in the type system. Tests are also easier to add after the fact – if you find a bug it’s easy and unintrusive to add a test for it, but may require a substantial amount of work to refactor your code to add types that make the bug impossible. It can and often will be worth doing the latter if the bug is an expensive one, but it often won’t be.

In general, trying to encode a particular correctness property in the type system is rarely going to be easier than writing a good test for it, especially if you have access to a good property based testing library. The benefits of encoding it in the type system might make it worth doing anyway, for some bugs and some projects, but given the finite quality budget it’s going to come at the expense of other testing, so it really has to pull its weight.

Meanwhile, for a lot of current statically typed languages static typing ends up coming at the cost of testing in another entirely different way: Build times.

There are absolutely statically typed languages where build times are reasonable but this tends to be well correlated with them having bad type systems. e.g. Go is obsessed with good build times, but Go is also obsessed with having a type system straight out of the 70s which fights against you at every step of the way. Java’s compile times are sorta reasonable but the Java type system is also not particularly powerful. Haskell, Scala or Rust all have interesting and powerful type systems and horrible build times. There are counter-examples – OCaml build times are reportedly pretty good – but by and large the more advanced the type system the longer the build times.

And when this happens it comes with an additional cost: It makes testing much more expensive. I’m no TDD advocate, but even so writing good tests is much easier when the build/test loop is low. Milliseconds it’s bliss, seconds it’s fine, tens of seconds it starts to get a bit painful and if the loop is minutes honestly you’re probably not going to be writing many tests and if you are they’re probably not going to be very good.

So in order to justify its place in the quality budget, if your static types are substantially increasing build times they need to not just be better than writing tests (which, as discussed, they will often not be), they need to be better than all the tests you’re not going to write because of those increased build times.

To recap:

1. The most common bugs caught by static typing are also the least critical sort of bug.
2. In most contexts, catching a bug with a test is going to be cheaper than catching it with types. This is particularly true for bugs found after the fact.
3. Most existing static type systems also come with a build time cost that makes testing in general more expensive.

This means that by and large when the quality budget is constrained I would expect complicated typing to often hurt quality.

This obviously won’t always be true. For many scenarios the opposite will be true. e.g. I’d expect static typing to win out for correctness if:

• bugs (especially crashing bugs) are very expensive so you have a large correctness budget to play with and have already picked the low hanging fruit from testing.
• the project is very large. In these scenarios you may benefit a lot more from the sort of universal guarantees that static typing provides vs writing the same sort of tests over and over again, and the build times are probably already high enough that it’s painful to test well anyway.

The point is not that static typing is going to hurt quality in general, but that it’s a set of complicated trade-offs.

I don’t know how to calculate those trade-offs in general. It’s far from straightforward. But the point is that those trade-offs exist and that people who are pretending that static typing will solve the software quality crisis are ignoring them and, as a result, giving advice that will make the world a worse place.

And anecdotally the trade-off does seem to be a fairly tight one: My general experience of the correctness software written in fancy statically typed languages is not overwhelmingly positive compared to that of software written in dynamic languages. If anything it trends slightly negative. This suggests that for the scale of many projects the costs and benefits are close enough that this actually matters.

But even if that weren’t true, my original point remains: When there’s no budget for quality, tools which catch bugs won’t and can’t help. If static typing genuinely helped improve software quality for most of these projects, the result wouldn’t be that people used static typing and wrote better software as a result, it would be that they’d continue to write broken software and not use static typing as a result.

For the middle ground where we care about software correctness but have a finite budget, there’s the additional problem that the trade-offs change over time – early in the project when we don’t know if it will succeed people are less prepared to invest in quality, later in the project we’ve already picked our language and migrating over to static types is hard (in theory gradual typing systems can help with this. In practice I’ve yet to be convinced by them, but I’m trying to maintain an open mind. Meanwhile there’s always linters I guess).

This is also a lot of why I’ve chosen to work on Hypothesis, and why I think property based testing and similar approaches are probably a better way forward for a lot of us: Rather than having to get things right up front, you can add them to your tool chain and get real benefits from using them without having to first make fundamental changes to how you work.

Because despite the slightly bleak thesis of this post I do think we can write better software. It’s just that, as usual, there is no silver bullet which makes things magically better. Instead we have to make a decision to actually invest in quality, and we have to invest in tools and approaches that will allow us to take incremental steps to get there.

If that’s the situation you find yourself in I’d be more than happy to help you out. I’m available for consulting, remote or on site, and at very reasonable rates. Drop me a line if you’d like some help.

This entry was posted in programming on by .

# It might be worth learning an ML-family language

It’s long been a popular opinion that learning Haskell or another ML-family language will make you a better programmer. I think this is probably true, but I think it’s an overly specific claim because learning almost anything will make you a better programmer, and I’ve not been convinced that Haskell is a much better choice than many other things in terms of reshaping your thinking. I’ve never thought that you shouldn’t learn Haskell of course, I’ve just not been convinced that learning Haskell purely for the sake of learning Haskell was the best use of time.

But… I’ve been noticing something recently when teaching Python programmers to use Hypothesis that has made me reconsider somewhat. Not so much a fundamental reshaping of the way you think as a highly useful microskill that people seem to struggle to learn in dynamically typed languages.

That skill is this: Keeping track of what the type of the value in a variable is.

That may not seem like an important skill in a dynamic language, but it really is: Although functions will tend to be more lenient about what type of value they accept (is it a list or a tuple? Who cares!), they will tend to go wrong in interesting and confusing ways when you get it too wrong, and you then waste valuable debugging time trying to figure out what you did wrong. A good development workflow will typically let you find the problem, but it will still take significantly longer than just not making the mistake in the first place.

In particular this seems to come up when the types are related but distinct. Hypothesis has a notion of a “strategy”, which is essentially a data generator, and people routinely seem to get confused as to whether something is a value of a type, a strategy for producing values of that type, or a function that returns a strategy for producing the type.

It might be that I’ve just created a really confusing API, but I don’t think that’s it – people generally seem to really like the API and this is by far the second most common basic usage error people make with it (the first most common is confusing the functions one_of and sampled_from, which do similar but distinct things. I’m still trying to figure out better names for them).

It took me a while to notice this because I just don’t think of it as a difficult thing to keep track of, but it’s definitely a common pattern. It also appears to be almost entirely absent from people who have used Haskell (and presumably other ML-family languages – any statically typed language with type-inference and a bias towards functional programming really) but I don’t know of anyone who has tried to use Hypothesis knowing an ML-family language without also knowing Haskell).

I think the reason for this is that in an ML family language, where the types are static but inferred, you are constantly playing a learning game with the compiler as your teacher: Whenever you get this wrong, the compiler tells you immediately that you’ve done so and localises it to the place where you’ve made the mistake. The error messages aren’t always that easy to understand, but it’s a lot easier to figure out where you’ve made the mistake than when the error message is instead “AttributeError: ‘int’ object has no attribute ‘kittens'” in some function unrelated to where you made the actual error. In the dynamically typed context, there’s a larger separation between the observed problem and the solution, which makes it harder to learn from the experience.

This is probably a game worth playing. If people are making this error when using Hypothesis, they I’d expect them to be making it in many other places too. I don’t expect many of these errors are making it through to production (especially if your code is well tested), but they’re certainly wasting time while developing.

In terms of which ML-family language to choose for this, I’m not sure. I haven’t actually used it myself yet (I don’t really have anything I want to write in the space that it targets), but I suspect Elm is probably the best choice. They’ve done some really good work on making type errors clear and easy to understand, which is likely exactly what you need for this sort of learning exercise.

This entry was posted in programming, Python on by .

# Contributors do not save time

(This is based off a small tweet storm from yesterday).

There’s this idea that what open source projects need to become sustainable is contributors – either from people working at companies who use the project, or from individuals.

It is entirely wrong.

First off: Contributors are great. I love contributors, and I am extremely grateful to anyone who has ever contributed to Hypothesis or any of my other open source projects. This post is not intended to discourage anyone from contributing.

But contributors are great because they increase capabilities, not because they decrease the effort required. Each contributor brings fresh eyes and experience to the project – they’ve seen something you haven’t, or know something you don’t.

Generally speaking a contribution is work you weren’t going to do. It might be work you were going to do later. If you’re really unlucky it’s work you’re currently in the process of doing. Often it’s work that you never wanted to do.

So regardless of what the nature of the contribution, it creates a sense of obligation to do more work: You have to deal with the contributor in order to support some work you weren’t going to do.

Often these dealings are pleasant. Many contributions are good, and most contributors are good. However it’s very rare that contributions are perfect unless they are also trivial. The vast majority of contributions that I can just say “Thanks!” and click merge on are  things that fix typos. Most of the rest are ones that just fix a single bug. The rest need more than the couple of minutes work (not zero work, mind you) that it took to determine that it was such a contribution.

That work can take a variety of forms: You can click merge anyway and fix it yourself, you can click merge anyway and just deal with the consequences forever (I don’t recommend this one), you can talk the contributor through the process of fixing it themselves, or you can reject the contribution as not really something you want to do.

All of these are work. They’re sometimes a lot of work, and sometimes quite emotionally draining work. Telling someone no is hard. Teaching someone enough of the idiosyncracies of your project to help them contribute is also hard. Code review is hard.

And remember, all of this is at the bare minimum work on something that you weren’t previously going to do just yet, and may be work on something that you were never going to do.

Again, this is not a complaint. I am happy to put in that work, and I am happy to welcome new contributors.

But it is a description of the reality of the situation: Trying to fix the problems of unpaid labour in open source by adding contributors will never work, because it only creates more unpaid labour.

This entry was posted in programming on by .

# Fuzzing through multi-objective shrinking

This is an experiment I’ve been running for the last couple of days (on and off and with a bunch of tinkering). It was intended as a prototype for using glassbox in a next-gen version of Hypothesis, but it’s proven interesting in its own right.

The idea is a specific automated way of using a test case reducer as a fuzzer using branch instrumentation (I’m using afl‘s instrumentation via the afl-showmap command): For every branch we ever observe the program taking, we try to construct a minimal example that hits that branch.

This will tend to produce interesting examples because you throw away a lot of extraneous detail that isn’t required to hit that branch. This is is particularly true of “tolerant” parsers which try to recover from a lot of errors.

### How it works

The core idea is that we take a normal test case reducer and repeatedly apply it in a way that automatically turns it into a multi-objective reducer.

Say we have a function, label, which takes a binary string and returns a set of labels. Labels can be anything, but in the case of using AFL’s instrumentation they’re essentially branches the program can take along with a rough count of how many times that branch was taken (essentially because the branches are hashes so some different branches may end up equated with each other).

We replace the labelling function with a side-effectful version of it which returns the original results but also updates a table which maps each label to its “best” example we’ve seen so far. We consider a string better than another if it is either shorter or the same length but sorts lexicographically before the other (when viewed as a sequence of unsigned 8 bit integers).

We then repeatedly iterate the following process: Pick a label, take the best example for that label, and reduce that test case with respect to the condition that it has that label (updating the other test cases with every call).

There are various heuristics you could use to pick a label. The ones I’ve tried are:

• Pick one of the labels which currently has the best example
• Pick one of the labels which currently has the worst example
• Pick any label, uniformly at random

Uniformly at random seems to work the best: The others have a tendency to get stuck. In the case of ‘best’ there are a lot of small labels and it ends up spending a lot of time trying to shrink them all, not doing very interesting work in the process. In the case of ‘worst’ it tends to spend all its time trying to shrink very hard to shrink labels and not getting very far. Uniformly at random seems to consistently make progress and find interesting results.

#### Optimisations

There are a couple of extra useful things you can do to speed up the process.

The first is that every time label is called you can mark the string as known. Then when shrinking instead of shrinking by whether the string has the label, you shrink by whether the string is either the current best for the label or is unknown.

This works because if the string were simpler than the current best and already known, then the current best would already have been updated to that string.

This is the equivalent of caching the predicate for delta-debugging, but you don’t want to cache the label function because its outputs are complex values (they’re sets of labels, so there are $$2^n$$ distinct values even after interning) so end up consuming a lot of memory if you cache them.

The second is that you can often tell when a label is going to be useless to shrink and skip it. There are two things you can do here:

• If when you tried to shrink a label it made no changes, you can mark that label as ‘finished’. If another shrink later improves the label, you remove the finished mark. A finished label cannot be shrunk further and thus can be skipped.
• By maintaining a counter that is updated every time a label is improved or added to the table, you can tell if an attempt to shrink did anything at all by checking the counter before and after. If it did nothing, you can mark the string as finished. Any labels whose current best string is finished can also be skipped.

This also gives a way of terminating the fuzz when there’s nothing left that’s discoverable: If every label is skippable, you’re done.

### Observations

This seems to work quite well in practice. Starting from a relatively large initial example, it quickly increases the number of labels by about an order of magnitude (some of these are just difference in branch counts, as AFL counts not just whether the branch was hit but also a bucketed version of how many times).

It also works pretty well at finding bugs. I’ve been running it for about 48 hours total (a bit longer by clock time but I turned it off in the middle while I made some changes) and it’s found two bugs in a widely deployed file format parser that’s been stable for a couple of years (I’ve sent both to the author of the parser, and don’t want to say which one it is until I’ve got permission to do so. I don’t think either of them are security issues but hard to say for sure). One of them is confirmed novel, and I haven’t heard back about the other one yet. It found the first one after about 10 hours, but that appears to have been mostly luck – rerunning with a bunch of changes that otherwise improved the process hasn’t refound that bug yet.

Anecdotally, almost all of the examples produced are not valid instances of the format (i.e. the tool under test exits with a non-zero status code). This isn’t very surprising: The expectation is that it will give you just enough of the file to get you to the point you’re looking for and then throw away the rest, which is unlikely to get you a valid file unless the branch you’re looking for is taken after the file validity has already been determined.

### Comparison with AFL

In some ways this is obviously quite similar to AFL, given that it uses the same instrumentation, but in other ways it’s quite different. My suspicion is that overall this approach will work better as an approach to providing a corpus to AFL than it will just on its own, but it’s surprisingly competitive even without that.

In particular it seems like it hits an order of magnitude increase in the number of seen labels much faster than I would expect AFL to. I think it helps that it’s using AFL’s instrumentation much more extensively than AFL itself actually does – AFL just uses the instrumentation for novelty detection, whileas this approach actually treats each label as a target in its own right and thus can take much more advantage of it.

The AFL algorithm is roughly just to repeatedly iterates the following:

1. Pick an example from the corpus and mutate it
2. If the mutated example exhibits any labels that we’ve not previously seen, add it to the corpus

It’s not really interested in the labels beyond novelty detection, and it doesn’t ever prune the corpus down to smaller examples like this does.

This approach also has a “self-optimizing” character that AFL lacks: Because AFL never replaces examples in its corpus, if you start with large examples you’re stuck with large examples forever. Because of this, AFL encourages you to start with very small, fast examples. This approach on the other hand will take whatever large examples you throw at it and will generally turn them into small examples.

To be clear: This isn’t a better approach than AFL. Maybe if it were highly optimized, tuned and refined it would become at least as good, but even then they would both have strengths and weaknesses compared to each other. But it’s not obviously a worse approach either, and even now it has some interesting advantages over the approach that AFL takes.

This entry was posted in programming on by .