# Can you write correct software in a statically typed language?

Epistemic status: Mostly trolling. Mostly.

Attention Conservation Notice: Dispatches from the type system wars.

I’ve been writing Rust recently. There have been some frustrating aspects to it (mostly syntax), but one of the things I’m really enjoying is having the productivity benefits of a good type system again.

The fast feedback loop from the type checker really improves my ability to develop software – it’s almost impossible to write code without making lots of trivial errors, and having an automated system that catches those errors and tells me about them in a detailed, localised, way, really smooths the entire development process, and the focus on modularity and ability to make certain behaviours actually impossible rather than merely discouraged really cuts down on a lot of worries that would otherwise slow me down. It’s something I’ve missed a lot over the last fiveish years of mostly writing in dynamic languages.

But I have a concern…

One of the things I’ve gotten used to in my time in dynamic language land is the correctness benefits. How much of that am I giving up for this productivity boost?

I’m not saying you can’t write correct software in statically typed languages. Of course you can. You can write correct software in anything if you’re prepared to try hard enough (except maybe Malbolge).

Of course you can write bad software in dynamic languages. It’s never going to be hard to write bad software.

But it sure seems harder to write correct software in a statically typed language, doesn’t it?

Writing correct software requires that you get a lot of the project workflow right. All other things being equal, static checks help, but they’re mostly dominated by other far more significant factors: Code review, run time assertions, a focus on user experience, processes, check-lists and, obviously, testing, but hopefully that you should be testing your code is uncontroversial at this point. Not even the static typing advocates really believe they can write correct code with less than 100% coverage, do they?

The problem is that even though we like to pretend that “If It Compiles It Works!” is just a tasteless joke and that nobody actually believes it, people don’t seem to act like they entirely disbelieve it either, and as a result they come to rely on the compiler for things the compiler never really promised to give them.

I think it’s that the type system makes you sloppy. It sits there telling you “Hey I checked your code. LGTM” and even though you know it’s not a substitute for a real code review, it’s still your most common piece of feedback because it happens on every change you make to your code, and over time you start to believe it.

It also tells you that you can fix it later. One of the great productivity benefits of static typing is how that fast feedback loop helps you change your code. I spent almost an hour doing major surgery on this project yesterday with the code in a completely non-compiling state. It would have been hard doing that in a dynamic language, because the feedback when your project isn’t working at all is just too non-local and coarse grained: If nothing works then of course your tests are all going to fail, and they way they fail isn’t going to be that informative.

So given a choice between shipping and putting in place the measures that are required for correctness, static typing says “Oh, don’t worry, it’s probably good enough, and if it’s not you can fix it later”.

Alas, this is a lie. Code that hasn’t been reviewed, tested, and put in front of a user before shipping may be more likely to be correct in a statically typed language, but that still isn’t likely, and static types are a lovely adjunct to but inadequate substitute for dynamic checks (possibly unless you’re writing in something like Idris, which I do still want to try seriously at some point).

You can do all of these things in a statically typed language, and I’m certain responsible developers do, but they all somehow seem less urgent, don’t they? 95% coverage is probably good enough, and it compiles, so it’s probably fine, and the compiler did a pretty good job of checking the obvious so a detailed code review doesn’t matter that much.

And there’s always that subconscious feeling that if these things do turn out to matter you can fix it later.

Unfortunately, that too is a lie. Statically typed code is easy to refactor, but the thing you’re going to need to change later isn’t the code, it’s the process, and the process is made of people. People are surprisingly resistant to change, as we find out every time Twitter shifts something a pixel to the right and eleven billion people complain that Twitter just don’t understand their user base and how dare they move that thing.

You can fix process later, but the will to do that needs to be there, and there’s always the sense that the status quo is probably good enough, especially with the compiler sitting there telling you its sweet lies about how everything is fine.

In contrast, a dynamic language gives you no such illusions. You’re not just operating without a safety net, you’re operating with a Tullock Spike. That fast feedback that the statically typed language gave you about your code, the dynamically typed language will give you about your process: If you care about correctness (and I acknowledge that many people don’t, even among those who chose to write in dynamically typed languages), you feel the pain points in your process fast, and you fix them early while you’re project is still flexible.

And yet, I really enjoy the productivity benefits of static typing, and I’m reluctant to give them up.

I think the thing to do here is probably to just try to carry the correctness lessons of dynamic typing over when writing in statically typed languages. In the same way that learning Haskell can improve your productivity when writing in Python, maybe once you’ve learned Python it’s possible to write correct code in Haskell.

I said I was writing in Rust recently, and that’s true, but it’s actually more complicated than that: It’s a hybrid Ruby/Rust project, so I’ll get to enjoy the benefits (and drawbacks) of both.

This might offer another way to have my cake and eat it too, because it means that I can choose which language to write each part of the code in depending on its function and trade offs. Code that I really need to be correct can go in Ruby, but where I need flexibility and freedom to experiment, I’m still able to write it in a more productive language such as Rust.

### Postscript

People seem to be failing to get the joke, so I’m going to explain it. Apologies if this ruins it for you.

Should you conclude from reading this post that it is impossible to write correct code in a statically typed language? No. And if you concluded that you should, frankly, feel bad about your reasoning skills (even disregarding the fact that I said right at the top of the post that I was trolling).

I did not present any empirical evidence. I didn’t even present a balanced non-empirical picture of the landscape: I cherry-picked a bunch of effects, added a bit of hyperbole, backed it all up with condescension and anecdata, and from there concluded that because these factors existed they must dominate the outcome.

You know who else does that? Every single person arguing that static types are required for correctness, (and every single person arguing that dynamic types are better for productivity, but honestly my impression having spent quite a lot of time on both sides of the language fence is that it’s mostly the static typing people who are picking the fights). I just thought it would be interesting to flip the argument on its head for once.

So, if you found my arguments unconvincing, good. But you should also have found the arguments for why static typing helps correctness or hurts productivity equally unconvincing.

Does static typing help correctness? Does dynamic typing help productivity?

Probably not, no. But the converse is probably not any more true.

The empirical studies on the question are really pretty inconclusive. Partly that’s because we’re rubbish at doing empirical software engineering, but maybe it’s because there really isn’t that much of a difference to observe, or because other differences dominate.

Either way, if you’re going to condescendingly claim that your side is obviously correct and the other side is awful, maybe you should wait until you have some actual evidence backing you up.

And maybe you should stop pretending that the things that we know do help correctness are somehow less necessary in statically typed languages.

This entry was posted in programming on by .

I was talking to a lecturer in the computational optimisation group about a problem I have in Hypothesis the other day. It took a while to convince her I had an interesting problem. Once I had, I started talking to her colleague about it and the confusion immediately reoccurred.

After a while, it dawned on me why this might be happening, and I realised that it is probably a thing that has been causing subtle confusion in my conversations for a while.

The probable source of the confusion is the word “set”.

The way I framed my problem was roughly as follows:

I have a set of strings defined by some oracle function and I am trying to find a shortlex minimal element of the set.

What I mean by this is that I have a function f that takes a string, and I’m trying to find a shortlex (shortest length, lexicographically minimal among those) string such that f(s) is True. There’s no implication that I am holding all of the values in memory at once – The set in question is technically finite, but only in the sense that it hasn’t got more than about $$10^{20000}$$ elements in it.

So when I say “set” it’s very much a mathematician’s notion of a set – it’s some abstract-ish definition that defines a bunch of values, not a concrete thing for working with those values. You could enumerate it if you wanted to, but there’s no implication that doing so is easy.

But what computer scientists typically mean by set is “A data structure that supports efficient membership queries”. The set is fundamentally something you can enumerate – you’ve got a collection of stuff right there that you’ve put in your data structure. It’s not that they’re unfamiliar with the mathematical notion of a set, it’s just not the thing that leaps to mind when they hear the word.

So when I say “set” and mean it in the mathematical sense to mean “Here is the shape of the problem”, computer scientists will often hear it in the computer scientist sense to mean “The problem is small enough that I can fit everything in memory”.

So, given that confusion, it’s not surprising that my problem sounded trivial! It’s just O(n), right? Take your n items and calculate the minimum of them in the desired ordering, bam. Done.

There’s a line (probably from George Bernard Shaw) that the United States and Great Britain are two countries separated by a common language. The same seems to be true of mathematics and computer science – we share a lot of words in our jargon, and some of them mean the same thing and some of them don’t, so there are a lot of false cognates.

It’s almost worse than the problem with natural languages, because as far as I can tell most of our shared jargon almost means the same thing in each. You probably won’t go very long without learning that you’re not in fact pregnant, merely embarrassed, but as I’ve discovered you can go quite some time saying “set” without realising that people are hearing something other than what you intended.

This entry was posted in programming on by .

# The other half of binary search

Binary search is one of those classic algorithms that most people who know about algorithms at all will know how to do (and many will even be able to implement correctly! Probably fewer than think they can though – it took me a long time to go to thinking I could implement binary search correctly to actually being able to implement it correctly).

Some of this is because the way people think about binary search is somewhat flawed. It’s often treated as being about sorted arrays data, when that’s really only one application of it. So lets start with a review of what the right way to think about binary search is.

We have two integers $$a$$ and $$b$$ (probably non-negative, but it doesn’t matter), with $$a < b$$. We also have some function that takes integers $$a \leq i \leq b$$, with $$f(a) \neq f(b)$$. We want to find $$c$$ with $$a \leq c < b$$ such that $$f(c) \neq f(c+ 1)$$.

i.e. we’re looking to find a single exact point at which a function changes value. For functions that are monotonic (that is, either non-increasing or non-decreasing), this point will be unique, but in general it may not be.

To recover the normal idea of binary search, suppose we have some array $$xs$$ of length $$n$$. We want to find the smallest insertion point for some value $$v$$.

To do this, we can use the function $$f(i)$$ that that returns whether $$xs[i] < v$$. Either this function is constantly true (in which case every element is < v and v should be inserted at the end), constantly false (in which case v should be inserted at the beginning), or the index i with $$f(i) \neq f(i + 1)$$ is the point after which $$v$$ should be inserted.

This also helps clarify the logic for writing a binary search:

def binary_search(f, lo, hi):
# Invariant: f(hi) != f(lo)
while lo + 1 < hi:
assert f(lo) != f(hi)
mid = (lo + hi) // 2
if f(mid) == f(lo):
lo = mid
else:
hi = mid
return lo


Every iteration we cut the interval in half - because we know the gap between them is at least one, this must reduce the length. If $$f$$ gives the same value to the midpoint as to lo, it must be our new lower bound, if not it must be our new upper bound (note that generically in this case we might not have $$f(mid) = f(hi)$$, though in the typical case where $$f$$ only takes two values we will).

Anyway, all of this is besides the point of this post, it's just scene setting.

Because the point of this post is this: Is this actually optimal?

Generically, yes it is. If we consider the functions $$f_k(i) = i < k$$, each value we examine can only cut out half of these functions, so we must ask at least $$\log_2(b - a)$$ questions, so binary search is optimal. But that's the generic case. In a lot of typical cases we have something else going for us: Often we expect change to be quite frequent, or at least to be very close to the lower bound. For example, suppose we were binary searching for a small value in a sorted list. Chances are good it's going to be a lot closer to the left hand side than the right, but we're going to do a full $$\log_2(n)$$ calls every single time. We can solve this by starting the binary search with an exponential probe - we try small values, growing the gap by a factor of two each time, until we find one that gives a different value. This then gives us a (hopefully smaller) upper bound, and a lower bound somewhat closer to that.

def exponential_probe(f, lo, hi):
gap = 1
while lo + gap < hi:
if f(lo + gap) == f(lo):
lo += gap
gap *= 2
else:
return lo, lo + gap
return lo, hi


We can then put these together to give a better search algorithm, by using the exponential probe as the new upper bound for our binary search:

def find_change_point(f, lo, hi):
assert f(lo) != f(hi)
return binary_search(f, *exponential_probe(f, lo, hi))


When the value found is near or after the middle, this will end up being more expensive by a factor of about two - we have to do an extra $$\log_2(n)$$ calls to probe up to the midpoint - but when the heuristic wins it potentially wins big - it will often take the $$\log_2(n)$$ factor (which although not huge can easily be in the 10-20 range for reasonable sized gaps) and turn it into 1 or 2. Complexity wise, this will run in $$O(\log(k - lo)$$, where $$k$$ is the value returned, rather than the original $$O(hi - lo)$$.

This idea isn't as intrinsically valuable as binary search, because it doesn't really improve the constant factors or the response to random data, but it's still very useful in a lot of real world applications. I first came across it in the context of timsort, which uses this to find a good merge point when merging two sublists in its merge step.

Edit to note: It was pointed out to me on Twitter that I'm relying on python's bigints to avoid the overflow problem that binary search will have if you implement it on fixed sized integers. I did know this at one point, but I confess I had forgotten. The above code works fine in Python, but if int is fixed size you want the following slightly less clear versions:

def midpoint(lo, hi):
if lo <= 0 and hi >= 0:
return (lo + hi) // 2
else:
return lo + (hi - lo) // 2

def binary_search(f, lo, hi):
# Invariant: f(hi) != f(lo)
while lo + 1 < hi:
assert f(lo) != f(hi)
mid = midpoint(lo, hi)
if f(mid) == f(lo):
lo = mid
else:
hi = mid
return lo

def exponential_probe(f, lo, hi):
gap = 1
midway = midpoint(lo, hi)
while True:
if f(lo + gap) == f(lo):
lo += gap
if lo >= midway:
break
else:
gap *= 2
else:
hi = lo + gap
break
return lo, hi


These avoid calculating any intermediate integers which overflow in the midpoint calculation:

• If $$lo \leq 0$$ and $$hi \geq 0$$ then $$lo \leq hi + lo \leq hi$$, so is representable.
• If $$lo \geq 0$$ then $$0 \leq hi - lo \leq hi$$, so is representable.

The reason we need the two different cases is that e.g. if $$lo$$ were INT_MIN and $$hi$$ were INT_MAX, then $$hi - lo$$ would overflow but $$lo + hi$$ would be fine. Conversely if $$lo$$ were INT_MAX - 1 and $$hi$$ were INT_MAX, $$hi - lo$$ would be fine but $$hi + lo$$ would overflow.

The following should then be a branch free way of doing the same:

def midpoint(lo, hi):
large_part = lo // 2 + hi // 2
small_part = ((lo & 1) + (hi & 1)) // 2
return large_part + small_part


We calculate (x + y) // 2 as x // 2 + y // 2, and then we fix up the rounding error this causes by calculating the midpoint of the low bits correctly. The intermediate parts don't overflow because we know the first sum fits in $$[lo, hi]$$, and the second fits in $$[0, 1]$$. The final sum also fits in $$[lo, hi]$$ so also doesn't overflow.

I haven't verified this part too carefully, but Hypothesis tells me it at least works for Python's big integers, and I think it should still work for normal C integers.

This entry was posted in programming, Python on by .

# Linear time (ish) test case reduction

A problem I’m quite interested in, primarily from my work on Hypothesis, is test case reduction: Taking an example that produces a bug, and producing a smaller version that triggers the same bug.

Typically a “test-case” here means a file that when fed to a program triggers a crash or some other wrong behaviour. In the abstract, I usually think of sequences as the prototypical target for test case reduction. You can view a file as a sequence in a number of usefully different ways – e.g. as bytes, as lines, etc. and they all are amenable to broadly similar algorithms.

The seminal work on test-case reduction is Zeller and Hildebrant’s “Simplifying and Isolating Failure-Inducing Input“, in which they introduce delta debugging. Delta-debugging is essentially an optimisation to the greedy algorithm which removes one item at a time from the sequence and sees if that still triggers the bug. It repeatedly applies this greedy algorithm to increasingly fine partitions of the test case, until the partition is maximally fine.

Unfortunately the greedy algorithm as described in their paper, and as widely implemented, contains an accidentally quadratic bug. This problem is not present in the reference implementation, but it is present in many implementations of test case reduction found in the wild, including Berkeley delta and QuickCheck. Hypothesis gets this right, and has for so long that I forgot until quite recently that this wasn’t more widely known.

To see the problem, lets look at a concrete implementation. In Python, the normal implementation of the greedy algorithm looks something like this:

def greedy(test_case, predicate): while True: for i in range(len(test_case)): attempt = list(test_case) del attempt[i] if predicate(attempt): test_case = attempt break else: break return test_case

We try deleting each index in turn. If that succeeds, we start again. If not, we’re done: No single item can be deleted from the list.

But consider what happens if our list is, say, the numbers from 1 through 10, and we succeed at deleting a number from the list if and only if it’s even.

When we run this algorithm we try the following sequence of deletes:

• delete 1 (fail)
• delete 2 (succeed)
• delete 1 (fail)
• delete 3 (fail)
• delete 4 (succeed)
• delete 1 (fail)

Every time we succeed in deleting an element, we start again from scratch. As a result, we have a classic accidentally quadratic bug.

This is easy to avoid though. Instead of starting again from scratch, we continue at our current index:

def greedy(test_case, predicate): deleted = True while deleted: deleted = False i = 0 while i < len(test_case): attempt = list(test_case) del attempt[i] if predicate(attempt): test_case = attempt deleted = True else: i += 1 return test_case

At each stage we either successfully reduce the size of the test case by 1 or we advance the loop counter by one, so this loop makes progress. It stops in the same case as before (when we make it all the way through with no deletions), so it still achieves the same minimality guarantee that we can’t delete any single element.

The reason this is only linear time “ish” is that you might need to make up to n iterations of the loop (this is also true with the original algorithm) because deleting an element might unlock another element. Consider e.g. the predicate that our sequence must consist of the numbers 1, …, k for some k – we can only every delete the last element, so we must make k passes through the list. Additionally, if we consider the opposite where it must be the numbers from k to n for some k, this algorithm is quadratic while the original one is linear!

I believe the quadratic case to be essential, and it’s certainly essential if you consider only algorithms that are allowed to delete at most one element at a time (just always make the last one they try in any given sequence succeed), but anecdotally most test cases found in the wild don’t have nearly this high a degree of dependency among them, and indexes that previously failed to delete tend to continue to fail to delete.

A model with a strong version of this as its core assumption shows up the complexity difference: Suppose you have $$k$$ indices out of $$n$$ which are essential, and every other index can be deleted. Then this algorithm will always run in $$n + k$$ steps ($$n$$ steps through the list the first time, $$k$$ steps at the end to verify), while the classic greedy algorithm will always run in $$O(k^2 + n)$$ steps.

Although this model is overly simplistic, anecdotally examples found in the wild tend to have something like this which acts as an “envelope” of the test case – there’s some large easily deletable set and a small essential core. Once you’re down to the core you have to do more work to find deletions, but getting down to the core is often the time consuming part of the process.

As a result I would be very surprised to find any instances in the wild where switching to this version of the algorithm was a net loss.

This entry was posted in programming, Python on by .

# Ordering and mapping regular languages

I’ve had the following question for a while: How do I create a mapping of keys to values where the keys are regular expressions, and two regular expressions are considered equivalent if they correspond to the same language?

An example of why you might want to do this is e.g. when constructing a minimal deterministic finite automaton for a regular language you end up labelling states by regular expressions that represent the language matched when starting from that state. In order for the automaton to be minimal you need to have any two equivalent regular expressions correspond to the same state, so you need a way of taking a new regular expression and finding out which state it should go to.

It’s easy (if potentially expensive) to test regular expression equivalence once you know how, so the naive implementation of this is just to do a linear scan of all the regular expressions you’ve found so far. It’s O(n) lookup but it’s at least an existence proof.

In the past I’ve ended up implementing a crude hash function for regular languages and then just used a hash table. It works, but collisions are common unless you’re quite clever with your hashing, so it doesn’t work well.

But it turns out that there is a better way! Rather than using hashed data structures you can used ordered ones, because it turns out that there is a natural and easy to compute (or at least not substantially harder than testing equivalence) total ordering over the set of regular languages.

That way is this: If you have two regular languages $$L$$ and $$M$$ that are not equivalent, there is some $$x \in L \triangle M$$, the symmetric difference. That is, we can find and $$x$$ which is in one but not the other. Let $$x$$ be the shortlex minimal such word (i.e. the lexicographically first word amongst those of minimal length). Then $$L < M$$ if $$x \in L$$, else $$M < L$$.

The work in the previous post on regular language equivalence is thus enough to calculate the shortlex minimal element of an inequivalent pair of languages (though I still don’t know if the faster of the two algorithms gives the minimal one. But you can use the fast algorithm for equivalence checking and then the slightly slower algorithm to get a minimal refutation), so we can readily compute this ordering between two regular expressions. This, combined with any sort of ordered collection type (e.g. a balanced binary tree of some sort) gives us our desired mapping.

But why does this definition provide a total order?

Well, consider the enumeration of all words in increasing shortlex order as $$w_0, \ldots, w_n, \ldots$$. Let $$l_n = 1$$ if $$w_n \in L$$, else $$l_n = 0$$. Define $$m_n$$ similarly for $$M$$.

Then the above definition is equivalent to the reverse of the lexicographical ordering between $$l$$ and $$m$$! If $$w_k$$ is the smallest word in the symmetric difference then $$k$$ is the first index at which $$l$$ and $$m$$ differ. If  $$w_k \in L$$ then $$l_k = 1$$ and $$m_k = 0$$, so $$l > k$$, and vice versa. The lexicographical order is a total order, and the reverse of a total order is a total order, so the above definition is also a total order.

This definition has a number of nice properties:

• Any language containing the empty word sorts before any language that doesn’t
• The function $$L \to \overline{L}$$ is order reversing.
• $$L \cap M \leq L, M \leq L \cup M$$

I originally thought that union was coordinate-wise monotonic, but it’s not. Suppose we have four words $$a < b < c < d$$, and consider the languages $$L = \{a, d\}, M = \{b, c\}$$. Then $$L < M$$ because the deciding value is $$a$$. But now consider $$P = \{a, b\}$$. Then $$L \cup P > M \cup P$$ because the deciding element now ends up being $$c$$, which is in $$M$$.

I’ve yet to try this in practice, so it might turn out that there are some interestingly pathological failure modes for this comparison function, but I don’t think there are likely to be any that aren’t also present in testing regular expression equivalence itself.

Another open question here is which sorted map data structure to use? The comparison is relatively expensive, so it might be worth putting a bit of extra effort in to balance it. As such an AVL tree might be a fairly reasonable choice. I’m not sure.

Want more blog posts like this? Join the 30 others who are supporting my writing on Patreon!

This entry was posted in Automata Theory, programming on by .