Category Archives: programming

(Some of) my problems with correctness research

Epistemic status: I think this is mostly right, but I’m sure there are gaps in my understanding of the field. Also you should be intrinsically suspicious of any argument that concludes that the author’s own field of research is going to change the world in a way that everyone else’s failed to do.

My previous post on looking for grandiose introductions in correctness research got a very mixed receptions. A lot of people who I normally mostly agree with objected to it, and several people agreed with it for reasons that I disagree with vehemently. This wasn’t the universal reaction, and there were plenty of decent responses to it, but it left me feeling that I’d been really unclear about what I actually meant, so this is my attempt to unpack it.

One response that I thought was a nice piece of writing in its own right but I didn’t really think worked as addressing my problems with correctness research was from John Regehr. To quote in full:

I tried to write a software correctness paper introduction that would make @DRMacIver happy:

The market for software, generally speaking, does not reward correct software so much as software that is acceptably correct and that can be built cheaply and rapidly. Thus, while our work on software correctness is highly unlikely to improve the quality of software, it may plausibly make it possible to create acceptable software a bit more cheaply and rapidly.

I broadly agree with this sentiment, though I flip-flop a bit on whether I think tools can actually make it possible to make more correct software or whether they just let you spend the time you’d have spent on correctness on other things. The technical term for this is Risk Compensation – when you increase safety measures, people behave in a more risky manner to improve performance. When this entirely compensates for the safety measure you have Risk Homeostasis. The general conclusion seems to be that you usually have Risk Compensation and rarely or never have Risk Homeostatis. (This paragraph has entirely exhausted my current knowledge of the subject, but I plan to fix that in the near future).

The main reason this doesn’t actually make me happy is that despite my previous post, it’s not really the introductions I object to. In many ways I like the introductions, and I’m certainly not above a bit of grandiosity myself. The Hypothesis documentation literally says that our goal in developing it is “to drag the world kicking and screaming into a new and terrifying age of high quality software“.

The problem is that I don’t think that correctness research is currently doing a very good job of matching its means to its ends.

In discussion about this, Paolo G. Giarrusso made the following distinction:

there’s a difference between fundamental and applied research (with a fluid boundary). And then there’s “fake applied” research, which is applied to fake problems

I think this is a good distinction. I’d maybe hesitate to label any particular problem as “fake” – I can’t actually point to any specific research in the field of correctness and say “this problem doesn’t exist” (and even if I could I’d prefer not to do so in public), but I do think a lot of people are working on problems that most software developers wouldn’t really care about. It may still be perfectly good fundamental research, and it’s not necessarily obvious until you’ve made some progress in the problem whether it’s actually useful, but saying “Software is terrible and this is our solution to it” is not a great look if you’re working on something that will never actually solve the problem of bad software.

When I say “solve the problem” I do not mean that academics have an obligation to produce tools for industry. The job of an academic is not to do software engineering, it’s to explore the space of ideas. I do think that it’s good for academics to produce relatively solid and easy to use tools – I think it was John Regehr who pointed out to me that by doing so you make it much easier for other people to use you as a baseline and get a bajillion citations as a result, so there are even good incentives for this – but as I’ve discovered the hard way there’s a really large gap between doing that and actually making an impact on industry.

Instead I mean something closer to developing ideas such that you can “Just add engineering (and sales, and marketing, and all the rest)” to go from the research idea to actually producing such a tool. In this sense the original QuickCheck academia-solves the problem, and Quviq industry-solves the problem. Often you find out in the course of going from academia-solves to industry-solves that there were other problems you needed to academia-solve before you could complete the project, but that’s OK (personally I love it when this happens).

I think there is a lot of research that no matter how much engineering, sales, and marketing, you added to them, people just wouldn’t care about the solution. There is no path from academia-solves to industry-solves for this research, because the result would essentially have been yet another startup that spends five years trying to tell you about their great idea before ending up on Our Incredible Journey.

That may be fine! I’ve spent enough time as a pure mathematician to be entirely happy with people working on problems that will never be of use to anyone. Also they turn out to be useful after all surprisingly often because you can reuse their ideas in other areas that are useful.

But the disconnect between the promise (We want to change the world!) and the solutions (by implementing this tool that would make almost every software developer yawn and/or go “huh?” when we tried to explain it to them) does make me unhappy.

It’s not necessarily that I want any of the specific research I’m complaining about not to exist – like I said, I’m fine with it as foundational research. I even get why they feel the need to claim it’s practical research – there’s a certain social and financial obligation to do so in our field – and if they know that’s what they’re doing and are just gaming the system to do something they think is worthwhile, good for them.

But what makes me sad is that there are gaps in the research that I think would not be there if the crisis of buggy software were actually a problem we were trying to solve, and I think these gaps mostly exist because of misaligned incentives (admittedly I always think that).

The problem, I think, with most correctness research is that it has a major false consensus problem: Researchers believe (or at least act like they believe) that software developers at large are roughly like them, and so they build tools that they think are useful to those hypothetical versions of themself that just happen to be working on different problems.

This is by no means an academia specific problem. It’s a problem that almost every single software development team has to some greater or lesser extent. There’s a reason the user experience people have to shout “You Are Not The User” so often in meetings (if they even get invited to meetings, and if they’re not then you have real problems. PS. Your company does have user experience people, right?).

But I think it shows up in academic software correctness research in one really important way (I’m sure it shows up in many ways, but this is the one that hits closest to home for me): The Oracle Problem.

Colloquially, the Oracle Problem is the issue that you can’t tell whether a piece of software is correct unless you know what it’s supposed to do.

When testing, every test has two parts (these may be quite intertwined, but they’re still in some sense logically distinct:

  1. A test case: the concrete set of inputs that the program is to be run on
  2. A test oracle: a way of determining whether the program is incorrect on those inputs (typically this is only one way: i.e. the oracle can’t say for sure that the program is exactly right on those inputs, but it can prove various ways in which it might be wrong)

And this split also defines a very large split between academic research in software correctness and potential consumers of software correctness tools: For an academic, the problem of test-case generation is an interesting object of study which they know a great deal about, but the problem of finding a test-case oracle is typically boring and would require a great deal of non-research legwork to figure out what it is that this software is actually supposed to do.

Typically research works around this problem in one of a couple of ways:

  1. Use a test oracle that points out only things that are obviously bad – segfaults, assertion failures, deadlocks, etc.
  2. Use a specification to derive a test oracle (e.g. via metamorphic testing – finding changes to the input that should have no effect but do. My group at Imperial do a lot of this)
  3. Use multiple implementations of the same specification to do differential testing – run each implementation on the same input and see if you get different results (this is typically how e.g Csmith is used)
  4. Try to guess an oracle – e.g. assume that the current behaviour is correct and use it for regression testing, try to build a simple model of the behaviour and treat deviations as potential bugs, etc.

The first can be useful, especially if you’re writing in an unsafe language, and the second and third are very useful when you are in the limited set of circumstances to which they can be applied. I have yet to be convinced that I have ever had a use case that would be served by tools in the fourth category, although I do believe they exist.

If you want to know more about these solutions, I can recommend “The Oracle Problem in Software Testing: A Survey“. I won’t go into them more here.

But there’s a fifth category of solution to the problem (which is also mentioned in the survey article), which is the one that QuickCheck and its descendants adopt: Make the end user write the test oracle.

In many ways the most valuable insight implicit in the success of QuickCheck is that this order of priorities is reversed in software developers: Software developers don’t really want to write test-case generators, but as long as you let them write them in their host language then they don’t mind writing test oracles because they’re already used to doing that every time they write a test! (In contrast they do seem to mind writing test cases, which they also have to do when writing a test. I think the distinction is that each test oracle feels unique, while writing test cases feels like drudge work, but I’m speculating wildly here).

From my point of view, this is a great solution. If you want to adequately test the behaviour of a piece of software then at some point you’re going to have to find out what it is supposed to do. Why not just let the end user tell you that?

I also think that if we want to make a dent on the bad software problem then this is the only solution that can possibly work. The overwhelming majority of software does not have any specification to speak of, but many software developers are willing and able to write tests (much software is completely untested. There is almost nothing we can do as correctness researchers to fix this problem, because if the authors of said software cared about the problem then they probably wouldn’t be in this situation). So correctness tools which actually make an impact on real world correctness have to get the developers to say what it is the software is actually supposed to do (and doing it in their host language is a much easier sell, though doing it in a derived language is viable, and people have shown willing to write tests in languages like Gherkin).

So, great, we make the users do all the work on solving this hard problem and we go back to working on the bits that we care about. What’s the problem?

Well, there are two big problems:

  1. It’s not the job of academia to build usable tools, and we/they usually don’t. So we don’t have any users to write these oracles.
  2. Even when tools have users, getting good empirical information about them is like pulling teeth. Especially when the tool is integrated into their development workflow so they don’t necessarily notice specific events involving it.

My personal research solves the first problem by starting from a widely used (518 open source users and counting, squee) tool rather than a research project. As to how to solve the second problem? Uh, I’ll get back to you.

Unfortunately, this means that the most useful version of the problem to work on is one that it is actually very hard for academia to do anything with – we mostly don’t have users, and if we did have users it would be hard to get a good set of examples for empirical evaluation out of them (especially if they’re corporate users).

These problems are surmountable, but doing so is a lot more work than just picking an open source project, running your tool against it, and seeing what happens.

How much of a problem is this? I don’t know. One could argue that it’s not one:

  • In some sense test-case generation is test-case generation, and a test oracle is just a way of making a program that crashes on some invalid behaviour, so there’s no reason you can’t take an arbitrary test-case generator and hook it up to an arbitrary “property-based test” that runs based on it.
  • Certainly when writing such tools for industry (which there seems to be very little financial incentive to do, which is a whole different problem. I’m not any more impressed with software correctness industry than I am with software correctness academia), you can still draw on a lot of the academic research that didn’t have this sort of focus – e.g. I learned a lot from reading the compiler testing literature despite not doing compiler testing.
  • The situation is not completely without hope – there’s a decent amount (though less than I’d like) of QuickCheck derived literature ticking along and e.g. Alex Groce’s TSTL is in another interesting corner of this space that isn’t a QuickCheck derived work (Update: Alex points out in the comments that in some sense it is a QuickCheck derived work, but I think it’s sufficiently different from QuickCheck in usage that I’d still count it as its own thing). That’s more or less it as far as examples I know about, but I hope there are others.

But I do think it creates certain significant gaps. For example:

  1. There’s been very little research on making it easy to write good, composable, test-case generators outside the context of QuickCheck (and there hasn’t been that much in that context and I’m not super happy with any of the answers I’ve seen). Most research I’ve seen on test-case generation works on the assumption that there’s more or less only one thing that you’re generating.
  2. It’s actually not true that a test-case generator is a test-case generator. Nezha is an example of how the fact that you’re doing differential testing can usefully inform your test case generation (mutational fuzzing in this case). Is the fact that we’re mostly testing on trivial properties in some way distorting the sorts of test cases we generate?

How do we fix this? Gods, I don’t know. The number of incentives you have to change to fix it is just appalling, and I lack both the lever and the fulcrum to move them.

I do think the sort of thing I’m doing with Hypothesis of starting a research program based on an existing widely used tool is probably a good one, and that open source can form a lot of the bridge – both by providing the engineering that it’s not academia’s job to provide (not that it’s our job to provide it either, we’re just suckers who like doing work for free. Go us), and also by providing a ready source of visible users who can at least lower the bar for evaluation (even if it’s still not easy), but I don’t think this is even close to sufficient.

So fixing it? Pass. I bring you problems, not solutions.

It sure would be nice if we could at least acknowledge that this is a problem though.

This entry was posted in programming on by .

An Unkind Question about Software Correctness Research

Dear Lazyweb,

I’m about to be something of a jerk, and I’d like your help in doing it.

I have noticed a common pattern in a lot of software correctness research, which is that they often follow the following format:

  1. Oh god software, it’s everywhere.
  2. And it doesn’t work very well at all, amirite?
  3. This is only going to get worse.
  4. We’re basically all doomed.
  5. Therefore in order to fix this we propose the following.
  6. We have made an improvement.
  7. To a category of tool that literally nobody uses.
  8. Happy to help.
  9. Love, The Authors.

What are your “favourite” examples of this genre? I’m particularly interested in examples from software testing research but I’ll take anything vaguely correctness related.

The “That nobody actually uses” part of it is also somewhat inessential – I’m mostly interested in good examples of the motivation section, and in many ways having examples from things people actually use is quite helpful because it gives me papers I can point to without feeling too mean about it!

Answers accepted in comments, emails, tweets, whatever.


Edit to add: I do not mean to suggest that such papers are necessarily bad papers. Many of them are interesting and intelligent research.

I’m also not interesting in being mean to individual papers or authors (I originally approved a comment which I have since retracted because I decided it was too far in that direction).

What I want to call into a question is our priorities. We have all of these introductions which claim that what we’re trying to do is solve these real world problems, but if we’re not addressing the fundamental question of why is nobody using our stuff then maybe we need to admit that that’s not actually what we’re trying to do.

As a pure mathematician I’m perfectly on board with a program of pure research, but as a software developer I am much more interested in the question of what a research program that actually cared about these questions would look like.

This entry was posted in programming on by .

Can you write correct software in a statically typed language?

Epistemic status: Mostly trolling.

Attention Conservation Notice: Dispatches from the type system wars.

I’ve been writing Rust recently. There have been some frustrating aspects to it (mostly syntax), but one of the things I’m really enjoying is having the productivity benefits of a good type system again.

The fast feedback loop from the type checker really improves my ability to develop software – it’s almost impossible to write code without making lots of trivial errors, and having an automated system that catches those errors and tells me about them in a detailed, localised, way, really smooths the entire development process, and the focus on modularity and ability to make certain behaviours actually impossible rather than merely discouraged really cuts down on a lot of worries that would otherwise slow me down. It’s something I’ve missed a lot over the last fiveish years of mostly writing in dynamic languages.

But I have a concern…

One of the things I’ve gotten used to in my time in dynamic language land is the correctness benefits. How much of that am I giving up for this productivity boost?

I’m not saying you can’t write correct software in statically typed languages. Of course you can. You can write correct software in anything if you’re prepared to try hard enough (except maybe Malbolge).

Of course you can write bad software in dynamic languages. It’s never going to be hard to write bad software.

But it sure seems harder to write correct software in a statically typed language, doesn’t it?

Writing correct software requires that you get a lot of the project workflow right. All other things being equal, static checks help, but they’re mostly dominated by other far more significant factors: Code review, run time assertions, a focus on user experience, processes, check-lists and, obviously, testing, but hopefully that you should be testing your code is uncontroversial at this point. Not even the static typing advocates really believe they can write correct code with less than 100% coverage, do they?

The problem is that even though we like to pretend that “If It Compiles It Works!” is just a tasteless joke and that nobody actually believes it, people don’t seem to act like they entirely disbelieve it either, and as a result they come to rely on the compiler for things the compiler never really promised to give them.

I think it’s that the type system makes you sloppy. It sits there telling you “Hey I checked your code. LGTM” and even though you know it’s not a substitute for a real code review, it’s still your most common piece of feedback because it happens on every change you make to your code, and over time you start to believe it.

It also tells you that you can fix it later. One of the great productivity benefits of static typing is how that fast feedback loop helps you change your code. I spent almost an hour doing major surgery on this project yesterday with the code in a completely non-compiling state. It would have been hard doing that in a dynamic language, because the feedback when your project isn’t working at all is just too non-local and coarse grained: If nothing works then of course your tests are all going to fail, and they way they fail isn’t going to be that informative.

So given a choice between shipping and putting in place the measures that are required for correctness, static typing says “Oh, don’t worry, it’s probably good enough, and if it’s not you can fix it later”.

Alas, this is a lie. Code that hasn’t been reviewed, tested, and put in front of a user before shipping may be more likely to be correct in a statically typed language, but that still isn’t likely, and static types are a lovely adjunct to but inadequate substitute for dynamic checks (possibly unless you’re writing in something like Idris, which I do still want to try seriously at some point).

You can do all of these things in a statically typed language, and I’m certain responsible developers do, but they all somehow seem less urgent, don’t they? 95% coverage is probably good enough, and it compiles, so it’s probably fine, and the compiler did a pretty good job of checking the obvious so a detailed code review doesn’t matter that much.

And there’s always that subconscious feeling that if these things do turn out to matter you can fix it later.

Unfortunately, that too is a lie. Statically typed code is easy to refactor, but the thing you’re going to need to change later isn’t the code, it’s the process, and the process is made of people. People are surprisingly resistant to change, as we find out every time Twitter shifts something a pixel to the right and eleven billion people complain that Twitter just don’t understand their user base and how dare they move that thing.

You can fix process later, but the will to do that needs to be there, and there’s always the sense that the status quo is probably good enough, especially with the compiler sitting there telling you its sweet lies about how everything is fine.

In contrast, a dynamic language gives you no such illusions. You’re not just operating without a safety net, you’re operating with a Tullock Spike. That fast feedback that the statically typed language gave you about your code, the dynamically typed language will give you about your process: If you care about correctness (and I acknowledge that many people don’t, even among those who chose to write in dynamically typed languages), you feel the pain points in your process fast, and you fix them early while you’re project is still flexible.

And yet, I really enjoy the productivity benefits of static typing, and I’m reluctant to give them up.

I think the thing to do here is probably to just try to carry the correctness lessons of dynamic typing over when writing in statically typed languages. In the same way that learning Haskell can improve your productivity when writing in Python, maybe once you’ve learned Python it’s possible to write correct code in Haskell.

I said I was writing in Rust recently, and that’s true, but it’s actually more complicated than that: It’s a hybrid Ruby/Rust project, so I’ll get to enjoy the benefits (and drawbacks) of both.

This might offer another way to have my cake and eat it too, because it means that I can choose which language to write each part of the code in depending on its function and trade offs. Code that I really need to be correct can go in Ruby, but where I need flexibility and freedom to experiment, I’m still able to write it in a more productive language such as Rust.

Postscript

People seem to be failing to get the joke, so I’m going to explain it. Apologies if this ruins it for you.

Should you conclude from reading this post that it is impossible to write correct code in a statically typed language? No. And if you concluded that you should, frankly, feel bad about your reasoning skills (even disregarding the fact that I said right at the top of the post that I was trolling).

I did not present any empirical evidence. I didn’t even present a balanced non-empirical picture of the landscape: I cherry-picked a bunch of effects, added a bit of hyperbole, backed it all up with condescension and anecdata, and from there concluded that because these factors existed they must dominate the outcome.

You know who else does that? Every single person arguing that static types are required for correctness, (and every single person arguing that dynamic types are better for productivity, but honestly my impression having spent quite a lot of time on both sides of the language fence is that it’s mostly the static typing people who are picking the fights). I just thought it would be interesting to flip the argument on its head for once.

So, if you found my arguments unconvincing, good. But you should also have found the arguments for why static typing helps correctness or hurts productivity equally unconvincing.

Does static typing help correctness? Does dynamic typing help productivity?

Probably not, no. But the converse is probably not any more true.

The empirical studies on the question are really pretty inconclusive. Partly that’s because we’re rubbish at doing empirical software engineering, but maybe it’s because there really isn’t that much of a difference to observe, or because other differences dominate.

Either way, if you’re going to condescendingly claim that your side is obviously correct and the other side is awful, maybe you should wait until you have some actual evidence backing you up.

And maybe you should stop pretending that the things that we know do help correctness are somehow less necessary in statically typed languages.

This entry was posted in programming on by .

Lets talk about sets

I was talking to a lecturer in the computational optimisation group about a problem I have in Hypothesis the other day. It took a while to convince her I had an interesting problem. Once I had, I started talking to her colleague about it and the confusion immediately reoccurred.

After a while, it dawned on me why this might be happening, and I realised that it is probably a thing that has been causing subtle confusion in my conversations for a while.

The probable source of the confusion is the word “set”.

The way I framed my problem was roughly as follows:

I have a set of strings defined by some oracle function and I am trying to find a shortlex minimal element of the set.

What I mean by this is that I have a function f that takes a string, and I’m trying to find a shortlex (shortest length, lexicographically minimal among those) string such that f(s) is True. There’s no implication that I am holding all of the values in memory at once – The set in question is technically finite, but only in the sense that it hasn’t got more than about \(10^{20000}\) elements in it.

So when I say “set” it’s very much a mathematician’s notion of a set – it’s some abstract-ish definition that defines a bunch of values, not a concrete thing for working with those values. You could enumerate it if you wanted to, but there’s no implication that doing so is easy.

But what computer scientists typically mean by set is “A data structure that supports efficient membership queries”. The set is fundamentally something you can enumerate – you’ve got a collection of stuff right there that you’ve put in your data structure. It’s not that they’re unfamiliar with the mathematical notion of a set, it’s just not the thing that leaps to mind when they hear the word.

So when I say “set” and mean it in the mathematical sense to mean “Here is the shape of the problem”, computer scientists will often hear it in the computer scientist sense to mean “The problem is small enough that I can fit everything in memory”.

So, given that confusion, it’s not surprising that my problem sounded trivial! It’s just O(n), right? Take your n items and calculate the minimum of them in the desired ordering, bam. Done.

There’s a line (probably from George Bernard Shaw) that the United States and Great Britain are two countries separated by a common language. The same seems to be true of mathematics and computer science – we share a lot of words in our jargon, and some of them mean the same thing and some of them don’t, so there are a lot of false cognates.

It’s almost worse than the problem with natural languages, because as far as I can tell most of our shared jargon almost means the same thing in each. You probably won’t go very long without learning that you’re not in fact pregnant, merely embarrassed, but as I’ve discovered you can go quite some time saying “set” without realising that people are hearing something other than what you intended.

This entry was posted in programming on by .

The other half of binary search

Binary search is one of those classic algorithms that most people who know about algorithms at all will know how to do (and many will even be able to implement correctly! Probably fewer than think they can though – it took me a long time to go to thinking I could implement binary search correctly to actually being able to implement it correctly).

Some of this is because the way people think about binary search is somewhat flawed. It’s often treated as being about sorted arrays data, when that’s really only one application of it. So lets start with a review of what the right way to think about binary search is.

We have two integers \(a\) and \(b\) (probably non-negative, but it doesn’t matter), with \(a < b\). We also have some function that takes integers \(a \leq i \leq b\), with \(f(a) \neq f(b)\). We want to find \(c\) with \(a \leq c < b\) such that \(f(c) \neq f(c+ 1)\).

i.e. we’re looking to find a single exact point at which a function changes value. For functions that are monotonic (that is, either non-increasing or non-decreasing), this point will be unique, but in general it may not be.

To recover the normal idea of binary search, suppose we have some array \(xs\) of length \(n\). We want to find the smallest insertion point for some value \(v\).

To do this, we can use the function \(f(i)\) that that returns whether \(xs[i] < v\). Either this function is constantly true (in which case every element is < v and v should be inserted at the end), constantly false (in which case v should be inserted at the beginning), or the index i with \(f(i) \neq f(i + 1)\) is the point after which \(v\) should be inserted.

This also helps clarify the logic for writing a binary search:

def binary_search(f, lo, hi):
    # Invariant: f(hi) != f(lo)
    while lo + 1 < hi:
        assert f(lo) != f(hi)
        mid = (lo + hi) // 2
        if f(mid) == f(lo):
            lo = mid
        else:
            hi = mid
    return lo

Every iteration we cut the interval in half - because we know the gap between them is at least one, this must reduce the length. If \(f\) gives the same value to the midpoint as to lo, it must be our new lower bound, if not it must be our new upper bound (note that generically in this case we might not have \(f(mid) = f(hi)\), though in the typical case where \(f\) only takes two values we will).

Anyway, all of this is besides the point of this post, it's just scene setting.

Because the point of this post is this: Is this actually optimal?

Generically, yes it is. If we consider the functions \(f_k(i) = i < k\), each value we examine can only cut out half of these functions, so we must ask at least \(\log_2(b - a)\) questions, so binary search is optimal. But that's the generic case. In a lot of typical cases we have something else going for us: Often we expect change to be quite frequent, or at least to be very close to the lower bound. For example, suppose we were binary searching for a small value in a sorted list. Chances are good it's going to be a lot closer to the left hand side than the right, but we're going to do a full \(\log_2(n)\) calls every single time. We can solve this by starting the binary search with an exponential probe - we try small values, growing the gap by a factor of two each time, until we find one that gives a different value. This then gives us a (hopefully smaller) upper bound, and a lower bound somewhat closer to that.

def exponential_probe(f, lo, hi):
    gap = 1
    while lo + gap < hi:
        if f(lo + gap) == f(lo):
            lo += gap
            gap *= 2
        else:
            return lo, lo + gap
    return lo, hi

We can then put these together to give a better search algorithm, by using the exponential probe as the new upper bound for our binary search:

def find_change_point(f, lo, hi):
    assert f(lo) != f(hi)
    return binary_search(f, *exponential_probe(f, lo, hi))

When the value found is near or after the middle, this will end up being more expensive by a factor of about two - we have to do an extra \(\log_2(n)\) calls to probe up to the midpoint - but when the heuristic wins it potentially wins big - it will often take the \(\log_2(n)\) factor (which although not huge can easily be in the 10-20 range for reasonable sized gaps) and turn it into 1 or 2. Complexity wise, this will run in \(O(\log(k - lo)\), where \(k\) is the value returned, rather than the original \(O(hi - lo)\).

This idea isn't as intrinsically valuable as binary search, because it doesn't really improve the constant factors or the response to random data, but it's still very useful in a lot of real world applications. I first came across it in the context of timsort, which uses this to find a good merge point when merging two sublists in its merge step.

Edit to note: It was pointed out to me on Twitter that I'm relying on python's bigints to avoid the overflow problem that binary search will have if you implement it on fixed sized integers. I did know this at one point, but I confess I had forgotten. The above code works fine in Python, but if int is fixed size you want the following slightly less clear versions:

def midpoint(lo, hi):
    if lo <= 0 and hi >= 0:
        return (lo + hi) // 2
    else:
        return lo + (hi - lo) // 2

def binary_search(f, lo, hi):
    # Invariant: f(hi) != f(lo)
    while lo + 1 < hi:
        assert f(lo) != f(hi)
        mid = midpoint(lo, hi)
        if f(mid) == f(lo):
            lo = mid
        else:
            hi = mid
    return lo

def exponential_probe(f, lo, hi):
    gap = 1
    midway = midpoint(lo, hi)
    while True:
        if f(lo + gap) == f(lo):
            lo += gap
            if lo >= midway:
                break
            else:
                gap *= 2
         else:
            hi = lo + gap
            break
    return lo, hi

These avoid calculating any intermediate integers which overflow in the midpoint calculation:

  • If \(lo \leq 0\) and \(hi \geq 0\) then \(lo \leq hi + lo \leq hi\), so is representable.
  • If \(lo \geq 0\) then \(0 \leq hi - lo \leq hi\), so is representable.

The reason we need the two different cases is that e.g. if \(lo\) were INT_MIN and \(hi\) were INT_MAX, then \(hi - lo\) would overflow but \(lo + hi\) would be fine. Conversely if \(lo\) were INT_MAX - 1 and \(hi\) were INT_MAX, \(hi - lo\) would be fine but \(hi + lo\) would overflow.

The following should then be a branch free way of doing the same:

def midpoint(lo, hi):
    large_part = lo // 2 + hi // 2
    small_part = ((lo & 1) + (hi & 1)) // 2
    return large_part + small_part

We calculate (x + y) // 2 as x // 2 + y // 2, and then we fix up the rounding error this causes by calculating the midpoint of the low bits correctly. The intermediate parts don't overflow because we know the first sum fits in \([lo, hi]\), and the second fits in \([0, 1]\). The final sum also fits in \([lo, hi]\) so also doesn't overflow.

I haven't verified this part too carefully, but Hypothesis tells me it at least works for Python's big integers, and I think it should still work for normal C integers.

This entry was posted in programming, Python on by .