Static typing will not save us from broken software

Epistemic status: This piece is like virtually all writing about software and is largely anecdata and opinions. I think it’s more right than not, but then I would.

I learned to program in ML. I know Haskell to a reasonable degree of fluency. I’ve written quite a lot of Scala, including small parts of the compiler and standard library (though I think most or all of that is gone or rewritten by now. It’s been 8 years). I like static typing, and miss using statically typed languages more heavily than I currently do (which is hardly at all).

But I’m getting pretty frustrated with certain (possibly most) static type system advocates.

This frustration stems from the idea that static typing will solve all our problems, or even one specific problem: The ubiquity of broken software. There’s a lot of broken software out there, and the amount keeps going up.

People keep claiming that is because of bad choices of language, but it’s mostly not and static typing will not even slightly help fix it.

Broken software is a social and economic problem: Software is broken  because its not worth people’s while to write non-broken software. There are only two solutions to this problem:

  1. Make it more expensive to write broken software
  2. Make it cheaper to write correct software

Technical solutions don’t help with the first, and at the level of expense most people are willing to spend on software correctness your technical solution has to approach “wave a magic wand and make your software correct” levels of power to make much of an impact: The current level of completely broken software can only arise if there’s almost zero incentive for people to sink time into correctness of their IoT devices and they’re not engaged in even minimal levels of testing for quality.

When you’ve got that level of investment in quality anything that points out errors is more likely to be ignored or not used than it is to improve things.

I think this carries over to moderate levels of investment in correctness too, but for different reasons (and ones I’m less confident of).

“All” static typing tells you is that your program is well-typed. This is good and catches a lot of bugs by enforcing consistency on you. But at entry-level static typing most of those bugs are the sort that ends up with a Python program throwing a TypeError. Debugging those when they happen in production is a complete pain and very embarrassing, but it’s still the least important type of bug: A crash is noticeable if you’ve got even basic investment in monitoring (e.g. a sentry account and 5 lines of code to hook it in to your app).

Don’t get me wrong: Not having those bugs reach production in the first place is great. I’m all in favour. But because these bugs are relatively minor the cost of finding them needs to be lower than the cost of letting them hit production, else they start to eat into your quality budget and come at the cost of other more important bugs.

For more advanced usage, I’ve yet to be convinced that types are more effective than tests on modestly sized projects.

For large classes of problems, tests are just easier to write than types. e.g. an end to end test of a complicated user workflow is fairly easy to write, but literally nobody is going to encode it in the type system. Tests are also easier to add after the fact – if you find a bug it’s easy and unintrusive to add a test for it, but may require a substantial amount of work to refactor your code to add types that make the bug impossible. It can and often will be worth doing the latter if the bug is an expensive one, but it often won’t be.

In general, trying to encode a particular correctness property in the type system is rarely going to be easier than writing a good test for it, especially if you have access to a good property based testing library. The benefits of encoding it in the type system might make it worth doing anyway, for some bugs and some projects, but given the finite quality budget it’s going to come at the expense of other testing, so it really has to pull its weight.

Meanwhile, for a lot of current statically typed languages static typing ends up coming at the cost of testing in another entirely different way: Build times.

There are absolutely statically typed languages where build times are reasonable but this tends to be well correlated with them having bad type systems. e.g. Go is obsessed with good build times, but Go is also obsessed with having a type system straight out of the 70s which fights against you at every step of the way. Java’s compile times are sorta reasonable but the Java type system is also not particularly powerful. Haskell, Scala or Rust all have interesting and powerful type systems and horrible build times. There are counter-examples – OCaml build times are reportedly pretty good – but by and large the more advanced the type system the longer the build times.

And when this happens it comes with an additional cost: It makes testing much more expensive. I’m no TDD advocate, but even so writing good tests is much easier when the build/test loop is low. Milliseconds it’s bliss, seconds it’s fine, tens of seconds it starts to get a bit painful and if the loop is minutes honestly you’re probably not going to be writing many tests and if you are they’re probably not going to be very good.

So in order to justify its place in the quality budget, if your static types are substantially increasing build times they need to not just be better than writing tests (which, as discussed, they will often not be), they need to be better than all the tests you’re not going to write because of those increased build times.

To recap:

  1. The most common bugs caught by static typing are also the least critical sort of bug.
  2. In most contexts, catching a bug with a test is going to be cheaper than catching it with types. This is particularly true for bugs found after the fact.
  3. Most existing static type systems also come with a build time cost that makes testing in general more expensive.

This means that by and large when the quality budget is constrained I would expect complicated typing to often hurt quality.

This obviously won’t always be true. For many scenarios the opposite will be true. e.g. I’d expect static typing to win out for correctness if:

  • bugs (especially crashing bugs) are very expensive so you have a large correctness budget to play with and have already picked the low hanging fruit from testing.
  • the project is very large. In these scenarios you may benefit a lot more from the sort of universal guarantees that static typing provides vs writing the same sort of tests over and over again, and the build times are probably already high enough that it’s painful to test well anyway.

The point is not that static typing is going to hurt quality in general, but that it’s a set of complicated trade-offs.

I don’t know how to calculate those trade-offs in general. It’s far from straightforward. But the point is that those trade-offs exist and that people who are pretending that static typing will solve the software quality crisis are ignoring them and, as a result, giving advice that will make the world a worse place.

And anecdotally the trade-off does seem to be a fairly tight one: My general experience of the correctness software written in fancy statically typed languages is not overwhelmingly positive compared to that of software written in dynamic languages. If anything it trends slightly negative. This suggests that for the scale of many projects the costs and benefits are close enough that this actually matters.

But even if that weren’t true, my original point remains: When there’s no budget for quality, tools which catch bugs won’t and can’t help. If static typing genuinely helped improve software quality for most of these projects, the result wouldn’t be that people used static typing and wrote better software as a result, it would be that they’d continue to write broken software and not use static typing as a result.

For the middle ground where we care about software correctness but have a finite budget, there’s the additional problem that the trade-offs change over time – early in the project when we don’t know if it will succeed people are less prepared to invest in quality, later in the project we’ve already picked our language and migrating over to static types is hard (in theory gradual typing systems can help with this. In practice I’ve yet to be convinced by them, but I’m trying to maintain an open mind. Meanwhile there’s always linters I guess).

This is also a lot of why I’ve chosen to work on Hypothesis, and why I think property based testing and similar approaches are probably a better way forward for a lot of us: Rather than having to get things right up front, you can add them to your tool chain and get real benefits from using them without having to first make fundamental changes to how you work.

Because despite the slightly bleak thesis of this post I do think we can write better software. It’s just that, as usual, there is no silver bullet which makes things magically better. Instead we have to make a decision to actually invest in quality, and we have to invest in tools and approaches that will allow us to take incremental steps to get there.

If that’s the situation you find yourself in I’d be more than happy to help you out. I’m available for consulting, remote or on site, and at very reasonable rates. Drop me a line if you’d like some help.

This entry was posted in programming on by .

Declaring code bankruptcy for the rest of 2016

This is a small PSA.

It probably hasn’t been too visible from the outside, but I’ve not been doing very well recently.

In particular I’ve been finding my productivity has pretty much gone through the floor over the last couple months. 2016 is stressing me out for a whole pile of reasons (about 80% the same ones it’s stressing everyone else out), and I’m not dealing with it very well. This is making it very difficult to stay on track and motivated.

It’s not a problem when I’ve got something external to focus me (e.g. a client), but when I have to self-motivate on a programming project I find that I can’t right now, and the result is that I’m unproductive, which makes me depressed, which makes me even less productive.

So I’ve decided to stop. If I’m not getting any programming done and feeling bad about it, it’s clearly better to not get any programming done and not feel bad about it. Being depressed isn’t doing anyone any favours, let alone me. So, for the rest of the year I will not be writing any code unless someone is explicitly paying me to write it.

I currently have one client and a few potential clients, and my obligations to them will absolutely still be met (and probably be met a lot better than they would have in my previous mood). I am happy to accept new clients, but I probably won’t be actively seeking them out until the new year.

I’m also going to keep reviewing pull requests on Hypothesis and doing Hypothesis releases with other people’s stuff (there will probably be some Hypothesis releases with paid work from me too).

I’m probably also going to keep coding when it’s required to solve an immediate problem I have, and maybe when I need it to answer a question for a blog post or something.

So it’s not a complete cessation, but what it is is a freedom from a sense of obligation: If it doesn’t fall into one of these categories then I shouldn’t be doing it, and as a result I should be looking for something else to do if I don’t have any programming to do in one of those categories rather than procrastinating to avoid some vague sense of obligation to be coding.

This is going to give me a lot of time that I’m currently filling with failing to program in, so I’ll probably end up with some non-programming projects. I don’t yet know what these are going to be, but I’ve got a couple candidates:

  • When I first started working on Hypothesis I was taking a work break in which I’d intended to brush off some mathematics textbooks. This didn’t happen. It might happen this time. I started working through some of the exercises in Bollobas’s Combinatorics earlier and I’m finding it surprisingly enjoyable. I may keep this up.
  • I’ve been working on getting in shape the last couple of months. I don’t want to throw myself into this too vigorously because I’m more likely to just do myself an injury, but I’m likely to step this up at least moderately.
  • The War On Sleep must still be waged.
  • I’ve been thinking of doing NaNoWriMo, but I probably won’t.

Other than that, I don’t know. Watch this space while I found out.

This entry was posted in Hypothesis, life, Python on by .

Some small single transferable vote elections

Attention conservation note: This is very much a stamp collecting post. Even I’m not sure it’s that interesting, but I thought I might as well write it up.

Single Transferable Vote is often referred to as a voting system, but it’s not really: It’s instead a very large family of voting systems, with a near infinite number of dials to turn to get different behaviours.

I’ve never really had much intuition for what the different dials do, so I thought I’d have a bit of play and construct some small elections that get different answers.

The two things I wanted to compare are the Droop quota vs the Hare quota, and the effect of the restart rule in the Wright system (in this system whenever a candidate is disqualified you start the whole process again from scratch with that candidate removed, resetting any candidates who have already been elected).

Droop vs Hare

Suppose we have three candidates, labelled 0, 1 and 2, and are trying to elect two of them. We have the following votes:

  • 6 votes for 0, 1, 2
  • 2 votes for 2, 0, 1
  • 1 vote for 0, 2, 1

Then the Hare quota elects candidates 0 and 2 while the Droop quota elects candidates 0 and 1.

The election plays out like this: In the first round, we elect candidate 0 as the clear winner, as it exceeds both the Hare and Droop quota. Then in the second round, neither remaining candidate has enough  votes to clear the quota so one must drop out. With the Droop quota, candidate 2 drops out and candidate 1 is subsequently elected, with the Hare quota the reverse happens.

The reason this happens is that the Droop quota is slightly lower than the Hare quota (4 vs 4.5), and as a result the voters who voted for 0 retain slightly more of their score for the next round. Because of the strong block voting 0, 1, 2 this means that under the Droop quota 1 beats out 2 in the next round, whileas under the Hare quota it’s reversed.

This seems broadly consistent with the descriptions I’ve seen about the Hare quota favouring smaller parties. I don’t know whether this shifts my opinion of it or not.

Aside note: One thing I hadn’t noticed before is that even with complete votes, the Droop quota need not actually produce a complete result. If you’re trying to elect two out of three candidates and you have three electors, then the droop quota is 2. This means that you “spend” 2 votes electing the first candidate and no longer have enough to elect the second. I ended up excluding this case from the elections I considered.

Wright Restarts

Running with the Droop quota, the following election changes results if you restart it every time someone is disqualified (retaining the set of previously disqualified candidates):

We’re now electing two candidates out of four, once again with nine voters. The votes go as follows:

  • 6 votes for 0, 1, 2, 3
  • 2 votes for 2, 0, 1, 3
  • 1 vote for 3, 0, 2, 1

Without the Wright restarting this elects 0 and 2. With it it elects 0 and 1.

Without restarts the election plays out as follows: We elect 0, then 3 drops out, then 1 drops out, then we elect 2 and are finished.

With restarts what happens is that after 3 drops out, we rerun from scratch and then 2 drops out instead of 1.

think what’s happening here is that the vote “3, 0, 2, 1” has a higher weight at that point without the Wright restart: With the restart it got counted as a vote for 0 in the initial round, and so got down-weighted along with the other votes for 0. This means that when it came time to design between 1 and 2 dropping out, the decision goes the other way.

I feel like this makes sense as a tactical voting mitigation step: Without the restart, I can just put a no-hoper candidate as my first vote, wait out the rounds it takes for them to drop out and then have a stronger vote.


I wrote some code implementing STV with a couple of flags and asked Hypothesis to compare them. You can see the implementation here, but it’s very hacky.

A couple things to note about it:

  • Despite using Hypothesis, it’s not very well tested, so it might be wrong.
  • There are still a lot of variations and elided details about what sort of STV this is. I implemented what I think of as “vanilla” STV but I’ve no idea if that’s an accurate depiction of the status quo for it.
  • One design choice I made was to throw out all elections that caused any ambiguous behaviour, for either choice of the flag. The reasoning for this is that these small elections are really proxies for large elections where each voter is thousands of real voters, so the ties would end up being broken by small random variation in almost all practical cases.
  • I was actually surprised how good a job Hypothesis did at generating and minimizing these. I thought I might have to write a custom shrinker but I didn’t.


I don’t know. This was less enlightening than I hoped it would be.

I feel like I’m slightly more in favour of Wright restarts than I was, but I was already modestly in favour of them, and I don’t really feel like I’ve shifted my opinion about Hare vs Droop one way or the other.

It might be interesting to expand this to other STV variations (e.g. Meek’s method) but they differ more in implementation than these simple flags did, so I didn’t feel like implementing that right now.

This entry was posted in voting on by .

On racist intermediaries in hiring processes

Up front disclaimer: I am not currently hiring. Right now I’m a one man shop just about managing to pay myself. This is based on past experience, and intended for the benefit of others. This just came to mind for a variety of reasons recently so I felt the need to write it down.

I’ve been part of a moderate number of hiring processes. I’ve not generally had final say, but my voice was pretty loud in the decisions and the designs of some of these processes.

In that time, I have never been part of hiring a black person. This is pretty bad, particularly given that I was entirely hiring for jobs in London which is about 13% black by population.

I’d like to fix it if I’m part of a hiring process in future, but at the time I didn’t even understand why it was happening. I’ve since partially figured it out.

It certainly wasn’t a deliberate choice on my part. I’m able to say with reasonable confidence in this case that it’s not the result of subconscious racial bias, or even the result of an unintentionally biased interview process.

That’s not to say I didn’t have those things. I’ve done my best to minimize them, but they’re hard to spot and eradicate completely. But regardless, in this case they weren’t the factor for a much simpler reason: To the best of my knowledge, no black candidates have ever applied for a position I was interviewing for (it’s possible that some have and not made it past the pre-screening stage. I don’t think so, but I don’t know for sure).

I don’t present this as a defence, but as the place to start when fixing this problem.

It took me far too long to even realise this was happening, and even then I didn’t really figure out what was going on until after I stopped being part of hiring processes.

I’ve mostly been part of small companies, and the biggest reason a small company doesn’t hire anyone is quite simple: The company haven’t heard of the candidate, and the candidate hasn’t heard of the company (or maybe has heard of the company but didn’t know they were hiring / didn’t think to apply there).

This is sourcing: The process of getting candidates into your hiring pipeline in the first place.

The biggest clue in retrospect that sourcing was the main problem was a hiring process I was part of that did a very good job of removing bias: A lot of it was done with blind review, so we we didn’t know who the candidate was when we were reviewing that part of the work. All the questions were kept consistent across different candidates and had relatively objective scoring. The words “culture fit” never appeared at any point in the process. Our job ads were maybe not great, but we’d run them by people and they weren’t awful and lacked obvious red flags.

The result was that although we still hired entirely white men, we got a lot of older people and a lot of eastern Europeans. This wasn’t a total failure: Ageism and prejudice against eastern Europeans are certainly problems in the tech industry, and the fact that this was quite different to our usual set of hires suggested that the system had worked, but it’s still pretty telling that the result remained a bunch of white men.

And this is the big problem of sourcing: You can (and should) remove all the bias from your interview process you want, but if the set of candidates entering the process isn’t diverse then the set of candidates you hire won’t be either.

Sourcing is really hard. It’s a specialised skill I don’t have, and I can’t teach you how to do it. If you want to know a bit more about sourcing, Eva Gonzalez did a good talk about it at PyCon UK.

Most small companies aren’t hiring regularly enough that they can afford to have someone on staff who has this specialised skill. As a result, we tend to work around our lack of skill in it in one of two ways:

  1. We hire our friends and friends of friends.
  2. We use external recruiters

(This is often how it works at larger companies too, but I have less experience of hiring there so I won’t talk about it)

The first is problematic because it tends to reinforce demographic problems: The friends we hire tend to be from “the tech scene”, which tends to be even more young, male and white than the background of tech overall, and even without that friendships tend to be skewed along racial lines.

On the other hand, it’s basically impossible to persuade people not to do it, because it’s a good way of getting good candidates and most hiring processes end up caring far more about the candidates you get rather than the candidates you exclude. So I won’t talk about that further.

Recruiters though can be a major source of unintended bias.

The problem is that at the receiving end, you only really see the people who the recruiters let through. You don’t get to see the people they don’t send your way, and that’s where a lot of bias can creep in.

This doesn’t even necessarily have to be the result of direct racism on the part of the recruiters (I wouldn’t be surprised to learn that it often is. Certainly I know from female or non-binary friends that they’ve experienced a lot of sexism at the hands of recruiters. I… actually don’t really personally know any black people in tech inside the UK, so I can’t ask them), but there are plenty of other ways it can occur. e.g. a lot of recruiters select for only people who went to “top” universities, which tends to select for a certain amount of socioeconomic class (which is correlated with race). In all honesty, these universities are probably still doing better at diversity than tech is, but when you combine the two factors it makes things even worse. Another issue is that recruiters often get candidates in the same way that tech firms do by spreading along social lines – e.g. recommending the recruiter who found you a job to your friends, so small effects get magnified again.

There are almost certainly more sources of bias at the recruiter level, but the key point is that it is very difficult for me to know what these are: When hiring, I have little to no visibility of them. When going through a recruitment process I don’t experience them because I’m white. There probably are good written resources on this, but I don’t know what they are despite a moderate amount of googling (I’ve read things like Hire More Women In Tech, which have good advice if you’re sourcing yourself but mostly seem to stop at “hire a professional recruiter” for this problem).

I also don’t know how to solve this. I didn’t really figure out what the problem was until it’s too late last time, so I’ve no practical experience of making this work.

But here are some things I’d try next time that might be worth other:

  1. Listening to responses from other people to this blog post! “I used this recruiter and they appeared to do a good job of sourcing candidates who weren’t just white men” or “I used this recruiter and they weren’t racist to me” style recommendations are particularly welcome. More generally, I’m happy to hear any sort of recommendations on this subject at all.
  2. Prefer recruiters on retainer rather than on commission. I don’t know for sure that this will help, but this post by Thayer Prime on why she hates recruiting on commission is pretty good, and it lines up with a number of problems I’ve seen. Recruiters who are on commission will tend to be fairly aggressively uninterested in doing anything that decreases the hiring rate of people they send you even if it also decreases the false negative rate, which is why you get behaviours like selecting based on the university people went to represented so heavily.
  3. Look for recruiters who are, themselves, black or other people of colour. I think every recruiter I’ve worked with so far has been white (and most have been men), and I don’t think that’s a coincidence.
  4. Have a long hard talk with recruiters I’m using when I notice this is happening. If that doesn’t go satisfactorily, find a different recruiter.

I’ve no idea which of those are good ideas (OK, I’m pretty sure the first one is a good idea), and I don’t know which of them will help, but I do think this is a problem that needs solving, so we can but try.

It’s also likely that once the sourcing problem has been fixed (or at least improved) all other sorts of problems in the hiring process will be made visible, but that’s at least progress.


This entry was posted in Uncategorized on by .

Two extremes of problem solving

(Side note: I swear I’ve written this blog post before but if so I can’t find it).

Back in university a friend and I were often partnered on supervisions. As a result, I got to compare and contrast our problem solving styles a lot. They were quite different.

If I had to characterize the key difference between our styles (and bear in mind this is filtered by time, perspective and biases) it was at what point in solving the problem understanding happened.

For me I would tend to mull over a problem, trying to get to the core nature of it. Eventually once I had what I though was the “key idea” of the problem I would split the problem open along those lines, everything would fall into place and I’d end up with a nice tidy solution.

(Sometimes of course I would try to split the problem along those lines, find that I’d been wrong about the key idea, and have to go back to the drawing board. Possibly even often).

I don’t have access to his internal state, but I’d be surprised if that’s what he did. His approach looked more like bulldozing a path to a solution, always solving the part immediately in front of him.

The resulting solutions were typically longer than mine and, at least to me, much harder to follow.

Lets call these two approaches “theory building” and “direct problem solving”.

At the time I thought my theory building approach was better. These days I’m not so sure.

One distinguishing feature of our two styles was that my solutions were nicer and tidier. But the other distinguishing feature is that he got to his solutions sooner and would often succeed in cases where I failed.

And I think this is often the case with these two approaches. In many ways it’s surprising how often theory building works – many problems just aren’t nice, and no amount of thinking about them is going to make them so. Many problems are actually nice but require tools that are so far from obvious a priori that you’ll never see them except in retrospect. Having tools which can just soldier on and deal with these sorts of things is incredibly important, because otherwise you’ll just get stuck and fail to make progress.

I do still think the theory building solutions are better, but sometimes what you really need is just to get things done and they’ll tend to fall down there. And sometimes the theory only becomes visible once you’ve seen the direct solution and can refine it down to its essence.

The feedback also goes the other way: Once you’ve done theory building, you’ve provided yourself with a tool you can use the next time you want to do direct problem solving.

I’m talking about mathematics here, but this carries over almost verbatim to programming. A lot of Hypothesis has been constructed theory building style, but there are parts of it where there was nothing to do but just brute force my way into a solution. I’m currently trying to finally crack py.test function scoped fixtures, and while there ended up being a clever idea or two in there that I needed to come up with to crack specific problems, by and large there’s nothing to do here but sheer brute force.

Ultimately I don’t think either approach is actually better, because I don’t think you can get very far if you try to make do with just one or the other. Excessive reliance on direct problem solving will sometimes lead you to some very strange and unnecessary places, while excessive reliance on theory building will eventually lead to you getting nowhere fast.

So the real solution is to let both work in tandem and refactor mercilessly: Theory build where you can and it’s easy or worthwhile, direct problem solve where you can’t. But when you engage in direct problem solving, the theory building should be sitting there at the back of your mind trying to see what’s really going on, and then maybe can come in and pick of the pieces and replace it with something nicer if it turns out to be possible.

This entry was posted in Uncategorized on by .