# (Some of) my problems with correctness research

Epistemic status: I think this is mostly right, but I’m sure there are gaps in my understanding of the field. Also you should be intrinsically suspicious of any argument that concludes that the author’s own field of research is going to change the world in a way that everyone else’s failed to do.

My previous post on looking for grandiose introductions in correctness research got a very mixed receptions. A lot of people who I normally mostly agree with objected to it, and several people agreed with it for reasons that I disagree with vehemently. This wasn’t the universal reaction, and there were plenty of decent responses to it, but it left me feeling that I’d been really unclear about what I actually meant, so this is my attempt to unpack it.

One response that I thought was a nice piece of writing in its own right but I didn’t really think worked as addressing my problems with correctness research was from John Regehr. To quote in full:

I tried to write a software correctness paper introduction that would make @DRMacIver happy:

The market for software, generally speaking, does not reward correct software so much as software that is acceptably correct and that can be built cheaply and rapidly. Thus, while our work on software correctness is highly unlikely to improve the quality of software, it may plausibly make it possible to create acceptable software a bit more cheaply and rapidly.

I broadly agree with this sentiment, though I flip-flop a bit on whether I think tools can actually make it possible to make more correct software or whether they just let you spend the time you’d have spent on correctness on other things. The technical term for this is Risk Compensation – when you increase safety measures, people behave in a more risky manner to improve performance. When this entirely compensates for the safety measure you have Risk Homeostasis. The general conclusion seems to be that you usually have Risk Compensation and rarely or never have Risk Homeostatis. (This paragraph has entirely exhausted my current knowledge of the subject, but I plan to fix that in the near future).

The main reason this doesn’t actually make me happy is that despite my previous post, it’s not really the introductions I object to. In many ways I like the introductions, and I’m certainly not above a bit of grandiosity myself. The Hypothesis documentation literally says that our goal in developing it is “to drag the world kicking and screaming into a new and terrifying age of high quality software“.

The problem is that I don’t think that correctness research is currently doing a very good job of matching its means to its ends.

there’s a difference between fundamental and applied research (with a fluid boundary). And then there’s “fake applied” research, which is applied to fake problems

(Paolo has asked me to add here that he has similar concerns to the ones I express below about the use of the word “fake” and that he was thinking of it in terms of “The Fake Applied Problem” in “The Craft of Research”)

I think this is a good distinction. I’d maybe hesitate to label any particular problem as “fake” – I can’t actually point to any specific research in the field of correctness and say “this problem doesn’t exist” (and even if I could I’d prefer not to do so in public), but I do think a lot of people are working on problems that most software developers wouldn’t really care about. It may still be perfectly good fundamental research, and it’s not necessarily obvious until you’ve made some progress in the problem whether it’s actually useful, but saying “Software is terrible and this is our solution to it” is not a great look if you’re working on something that will never actually solve the problem of bad software.

When I say “solve the problem” I do not mean that academics have an obligation to produce tools for industry. The job of an academic is not to do software engineering, it’s to explore the space of ideas. I do think that it’s good for academics to produce relatively solid and easy to use tools – I think it was John Regehr who pointed out to me that by doing so you make it much easier for other people to use you as a baseline and get a bajillion citations as a result, so there are even good incentives for this – but as I’ve discovered the hard way there’s a really large gap between doing that and actually making an impact on industry.

Instead I mean something closer to developing ideas such that you can “Just add engineering (and sales, and marketing, and all the rest)” to go from the research idea to actually producing such a tool. In this sense the original QuickCheck academia-solves the problem, and Quviq industry-solves the problem. Often you find out in the course of going from academia-solves to industry-solves that there were other problems you needed to academia-solve before you could complete the project, but that’s OK (personally I love it when this happens).

I think there is a lot of research that no matter how much engineering, sales, and marketing, you added to them, people just wouldn’t care about the solution. There is no path from academia-solves to industry-solves for this research, because the result would essentially have been yet another startup that spends five years trying to tell you about their great idea before ending up on Our Incredible Journey.

That may be fine! I’ve spent enough time as a pure mathematician to be entirely happy with people working on problems that will never be of use to anyone. Also they turn out to be useful after all surprisingly often because you can reuse their ideas in other areas that are useful.

But the disconnect between the promise (We want to change the world!) and the solutions (by implementing this tool that would make almost every software developer yawn and/or go “huh?” when we tried to explain it to them) does make me unhappy.

It’s not necessarily that I want any of the specific research I’m complaining about not to exist – like I said, I’m fine with it as foundational research. I even get why they feel the need to claim it’s practical research – there’s a certain social and financial obligation to do so in our field – and if they know that’s what they’re doing and are just gaming the system to do something they think is worthwhile, good for them.

But what makes me sad is that there are gaps in the research that I think would not be there if the crisis of buggy software were actually a problem we were trying to solve, and I think these gaps mostly exist because of misaligned incentives (admittedly I always think that).

The problem, I think, with most correctness research is that it has a major false consensus problem: Researchers believe (or at least act like they believe) that software developers at large are roughly like them, and so they build tools that they think are useful to those hypothetical versions of themself that just happen to be working on different problems.

This is by no means an academia specific problem. It’s a problem that almost every single software development team has to some greater or lesser extent. There’s a reason the user experience people have to shout “You Are Not The User” so often in meetings (if they even get invited to meetings, and if they’re not then you have real problems. PS. Your company does have user experience people, right?).

But I think it shows up in academic software correctness research in one really important way (I’m sure it shows up in many ways, but this is the one that hits closest to home for me): The Oracle Problem.

Colloquially, the Oracle Problem is the issue that you can’t tell whether a piece of software is correct unless you know what it’s supposed to do.

When testing, every test has two parts (these may be quite intertwined, but they’re still in some sense logically distinct:

1. A test case: the concrete set of inputs that the program is to be run on
2. A test oracle: a way of determining whether the program is incorrect on those inputs (typically this is only one way: i.e. the oracle can’t say for sure that the program is exactly right on those inputs, but it can prove various ways in which it might be wrong)

And this split also defines a very large split between academic research in software correctness and potential consumers of software correctness tools: For an academic, the problem of test-case generation is an interesting object of study which they know a great deal about, but the problem of finding a test-case oracle is typically boring and would require a great deal of non-research legwork to figure out what it is that this software is actually supposed to do.

Typically research works around this problem in one of a couple of ways:

1. Use a test oracle that points out only things that are obviously bad – segfaults, assertion failures, deadlocks, etc.
2. Use a specification to derive a test oracle (e.g. via metamorphic testing – finding changes to the input that should have no effect but do. My group at Imperial do a lot of this)
3. Use multiple implementations of the same specification to do differential testing – run each implementation on the same input and see if you get different results (this is typically how e.g Csmith is used)
4. Try to guess an oracle – e.g. assume that the current behaviour is correct and use it for regression testing, try to build a simple model of the behaviour and treat deviations as potential bugs, etc.

The first can be useful, especially if you’re writing in an unsafe language, and the second and third are very useful when you are in the limited set of circumstances to which they can be applied. I have yet to be convinced that I have ever had a use case that would be served by tools in the fourth category, although I do believe they exist.

If you want to know more about these solutions, I can recommend “The Oracle Problem in Software Testing: A Survey“. I won’t go into them more here.

But there’s a fifth category of solution to the problem (which is also mentioned in the survey article), which is the one that QuickCheck and its descendants adopt: Make the end user write the test oracle.

In many ways the most valuable insight implicit in the success of QuickCheck is that this order of priorities is reversed in software developers: Software developers don’t really want to write test-case generators, but as long as you let them write them in their host language then they don’t mind writing test oracles because they’re already used to doing that every time they write a test! (In contrast they do seem to mind writing test cases, which they also have to do when writing a test. I think the distinction is that each test oracle feels unique, while writing test cases feels like drudge work, but I’m speculating wildly here).

From my point of view, this is a great solution. If you want to adequately test the behaviour of a piece of software then at some point you’re going to have to find out what it is supposed to do. Why not just let the end user tell you that?

I also think that if we want to make a dent on the bad software problem then this is the only solution that can possibly work. The overwhelming majority of software does not have any specification to speak of, but many software developers are willing and able to write tests (much software is completely untested. There is almost nothing we can do as correctness researchers to fix this problem, because if the authors of said software cared about the problem then they probably wouldn’t be in this situation). So correctness tools which actually make an impact on real world correctness have to get the developers to say what it is the software is actually supposed to do (and doing it in their host language is a much easier sell, though doing it in a derived language is viable, and people have shown willing to write tests in languages like Gherkin).

So, great, we make the users do all the work on solving this hard problem and we go back to working on the bits that we care about. What’s the problem?

Well, there are two big problems:

1. It’s not the job of academia to build usable tools, and we/they usually don’t. So we don’t have any users to write these oracles.
2. Even when tools have users, getting good empirical information about them is like pulling teeth. Especially when the tool is integrated into their development workflow so they don’t necessarily notice specific events involving it.

My personal research solves the first problem by starting from a widely used (518 open source users and counting, squee) tool rather than a research project. As to how to solve the second problem? Uh, I’ll get back to you.

Unfortunately, this means that the most useful version of the problem to work on is one that it is actually very hard for academia to do anything with – we mostly don’t have users, and if we did have users it would be hard to get a good set of examples for empirical evaluation out of them (especially if they’re corporate users).

These problems are surmountable, but doing so is a lot more work than just picking an open source project, running your tool against it, and seeing what happens.

How much of a problem is this? I don’t know. One could argue that it’s not one:

• In some sense test-case generation is test-case generation, and a test oracle is just a way of making a program that crashes on some invalid behaviour, so there’s no reason you can’t take an arbitrary test-case generator and hook it up to an arbitrary “property-based test” that runs based on it.
• Certainly when writing such tools for industry (which there seems to be very little financial incentive to do, which is a whole different problem. I’m not any more impressed with software correctness industry than I am with software correctness academia), you can still draw on a lot of the academic research that didn’t have this sort of focus – e.g. I learned a lot from reading the compiler testing literature despite not doing compiler testing.
• The situation is not completely without hope – there’s a decent amount (though less than I’d like) of QuickCheck derived literature ticking along and e.g. Alex Groce’s TSTL is in another interesting corner of this space that isn’t a QuickCheck derived work (Update: Alex points out in the comments that in some sense it is a QuickCheck derived work, but I think it’s sufficiently different from QuickCheck in usage that I’d still count it as its own thing). That’s more or less it as far as examples I know about, but I hope there are others.

But I do think it creates certain significant gaps. For example:

1. There’s been very little research on making it easy to write good, composable, test-case generators outside the context of QuickCheck (and there hasn’t been that much in that context and I’m not super happy with any of the answers I’ve seen). Most research I’ve seen on test-case generation works on the assumption that there’s more or less only one thing that you’re generating.
2. It’s actually not true that a test-case generator is a test-case generator. Nezha is an example of how the fact that you’re doing differential testing can usefully inform your test case generation (mutational fuzzing in this case). Is the fact that we’re mostly testing on trivial properties in some way distorting the sorts of test cases we generate?

How do we fix this? Gods, I don’t know. The number of incentives you have to change to fix it is just appalling, and I lack both the lever and the fulcrum to move them.

I do think the sort of thing I’m doing with Hypothesis of starting a research program based on an existing widely used tool is probably a good one, and that open source can form a lot of the bridge – both by providing the engineering that it’s not academia’s job to provide (not that it’s our job to provide it either, we’re just suckers who like doing work for free. Go us), and also by providing a ready source of visible users who can at least lower the bar for evaluation (even if it’s still not easy), but I don’t think this is even close to sufficient.

So fixing it? Pass. I bring you problems, not solutions.

It sure would be nice if we could at least acknowledge that this is a problem though.

This entry was posted in programming on by .

## 9 thoughts on “(Some of) my problems with correctness research”

1. Alex Groce

David, very nice post — I have to think about longer responses, but I’ll note technically TSTL is probably somewhat QuickCheck derived, in that I admired QuickCheck. But since I cut my teeth testing file systems at JPL, I wondered “what would make writing a file system tester not hellish?” and then decided to grant myself the one false premise that file systems were actually written in Python, not C.

1. david Post author

Oh, sure. I know you know about QuickCheck, and I didn’t imagine the TSTL work appeared in a vacuum, but in that sense almost everything is QuickCheck derived. :-) QuickCheck is “just” random testing + test case reduction + usability, and it’s had enough influence on the field that almost everything has or should have been touched by it!

I do think TSTL fits into a different corner of the design space in a way that something like Hypothesis doesn’t though – Hypothesis has very little implementation in common with QuickCheck, but shares most of the same aesthetics, while TSTL is a very different interface to the problem.

Out of curiousity, given its lineage in file system testing, what’s the historical relationship (if any) between TSTL and eXplode (https://dl-acm-org.iclibezp1.cc.ic.ac.uk/citation.cfm?id=1298469)?

1. Alex Groce

No real historical relationship; I’d read the paper, and I think talked to Dawson some about it at the time, but I wasn’t thinking of eXplode at all. The JPL file system testers were either pure random testing with hand-tuned probabilities, or later a SPIN model that could do quite a few things, including random testing. For one thing, iirc, eXplode is quite breadth-first-search oriented, right? We had a very DFS-centric approach at JPL, and that came into TSTL. Really, TSTL was just the combination of:

– SPIN’s embedding of C, “build a custom tool for every harness”, and finite nondeterministic choice (so TSTL from the start aimed to force you to stay finite breadth, and originally was going to be used to build model checkers too; except the model checkers were always hopelessly inefficient)

– adaptation-based programming, which also emphasized a finite set of actions to use in the table for Q-learning

– the pool-based concept of unit tests Jamie Andrews really clarified, that was what Randoop etc. were using

– memories of what was painful (other than writing in C, which is a fine language for some things, but not that) about writing the original tester and even the SPIN code at JPL

2. pozorvlak

The market for software, generally speaking, does not reward correct software so much as software that is acceptably correct and that can be built cheaply and rapidly.

Or as the Camel Book puts it, “a correct Perl program is one that does the job before your boss ﬁres you.” Like many Perl quotes, this is (a) funny, (b) somewhat horrifying, (c) likely to be used as a stick with which to beat Perl, but (d) much more widely applicable. Software that is delivered too late to be useful because it hews to unnecessarily high standards of correctness has failed, but software that does not meet high-enough standards of correctness has not done the job.

On test oracles, “assume that the current behaviour is correct and use it for regression testing” has at least two important use-cases, both of which I’ve encountered:
I need to refactor this legacy code to make it testable, and I don’t want to change its behaviour while doing so because I don’t yet understand what it’s supposed to do.Third-party clients rely on undocumented behaviour of this system; I can’t afford to break them, but I do need to refactor/replace/add functionality to it.

3. Alex Groce

“Is the fact that we’re mostly testing on trivial properties in some way distorting the sorts of test cases we generate?”

My guess would be (sort of) no, and probably yes. On the one hand, for many SUTs, just getting coverage of the code under test is challenging, and advances that can be shown to improve mean coverage for many SUTs by X, given time budget Y, are of general practical and theoretical interest.

But all coverage is not equal, and you can (for example) fuzz a file system by generating one meeelion paths where mount fails in fascinating and unlikely ways, or get less coverage (in some cases) by generating a limited set of likely-valid mountable images (or even just an empty filesystem) and then executing some operations and doing a differential test. The latter is often going to find both more bugs and more interesting bugs, in some settings.

Coverage != interesting behavior, and exploring the meat of a system is usually more interesting than exploring the parser, etc.

4. david Post author

> My guess would be (sort of) no, and probably yes. On the one hand, for many SUTs, just getting coverage of the code under test is challenging, and advances that can be shown to improve mean coverage for many SUTs by X, given time budget Y, are of general practical and theoretical interest.

Yeah, coverage-maximizing is definitely interesting in its own right, but I think there are interesting categories of test where it’s mostly useless and you really need to know what is being tested in order to do anything, and I’m not sure much is being studied there at all (with the targeted property-based testing work maybe serving as one counterexample).

For example, testing numeric code. If you’ve got a high performance numeric routine, there’s probably very little branching going on at all, so coverage maximizing is almost trivial, but there’s an awful lot of weird corner cases that you need to think about – what if you underflow here, what if you overflow there, etc. These never show up as coverage events (although I guess you could do new instrumentation so that they do!), but they still have a significant impact on your testing.

ETA: I don’t know how much the property matters here. My suspicion is “some”. It might be that all that’s needed is better coverage tracking and existing coverage maximizing routines are fine here, property irrespective, but again the targeted property-based testing work might serve as a counter-example where it does matter.

1. pozorvlak

> For example, testing numeric code. If you’ve got a high performance numeric routine, there’s probably very little branching going on at all, so coverage maximizing is almost trivial, but there’s an awful lot of weird corner cases that you need to think about

Apparently the hardware people handle this by defining attributes of the state (eg “x == INT_MAX”, “y is near zero”) in a special verification language, and then specifying which combinations of those attributes are of interest. They then measure coverage of the set of interesting-combinations-of-attributes, and try to maximise it using a combination of clever algorithms (ML, formal models, concolic testing, etc) and lots and lots of test runs. More here. From a quick scan of Löscher and Sagonas’s 2017 paper, it looks to me like the difference between this and targeted property-based testing (and Hypothesis’ assume) is that you can define many disjoint regions of interest, and attempt to cover them with a single test run – if your generated examples lie in region B rather than region A, you mark region B as covered rather than saying “we were looking for region A examples, better throw this one away”.

5. Alex Groce

Yeah, in theory any (safety) property can be converted into a branch on assert(false) and so, hey, presto, branch coverage is all you need, whatever the Beatles say. But in practice, nobody much does this, so there’s room for improvement. Also, there are interesting properties that are at least expensive and unpleasant to turn into a branch. For example, differential testing.