# Test-Case Selection and Choice Theory

Attention conservation notice: Honestly you probably don’t care about this post unless the title sounds really intriguing to you. I’m not sure this matters at all.

At its most abstract, the test case selection problem is a function $$T : \mathcal{P}(A) \setminus \{\emptyset\} \to A$$ such that $$T(U) \in U$$. i.e. we have some set $$A$$ of possible test cases, a bug that occurs in some non-empty subset of the test cases, and we want to select a test case to represent the bug. Call such a $$T$$ a choice function.

(The axiom of choice states that choice functions always exist, but our sets of test cases are actually always finite, so we don’t need that here).

This elides some complexity in that:

1. Our test case selection may be non-deterministic. In this case we can (by fixing the set of random choices made) treat this as a choice-function valued random variable, so it still reduces to this.
2. The test-case selection might fail. We can ignore this problem by assuming that it will succeed eventually (assuming it doesn’t do something silly) and just running it until it does. For some methods this may take rather a long time and we might have to worry about implementation details like the heat death of the universe, but ¯\_(ツ)_/¯

This is actually a coincidence of naming for the most part, but choice functions are interestingly related to choice theory: The problem of expressing a preference between alternatives, and describing axioms of “rationality” for doing so.

One interesting axiom of rationality is that of contraction consistency. Contraction consistency is the requirement that if $$T(A) \in B \subseteq A$$ then $$T(B) = T(A)$$. i.e. if you picked $$T(A)$$ as the best element of $$A$$, removing elements from $$A$$ that aren’t $$T(A)$$ shouldn’t change your opinion!

On the face of it this seems reasonable, but it actually imposes a very strong restriction on what $$T$$ can look like.

Theorem: If $$T$$ is contraction consistent then $$T(A) = \min\limits_\prec A$$ for some total order $$\prec$$.

Proof:

Define $$a \prec b$$ if $$a = T(\{a, b\})$$. This is antisymmetric because the set $$\{a, b\}$$ doesn’t depend on the order of $$a$$ and $$b$$, and reflexive because $$(T(\{a\}) = \in \{a\}$$, and total because $$T(\{a, b\} \in \{a, b\}$$ so either $$a \prec b$$ or $$b \prec a$$.

So to show that it’s a total order we now need to show that it’s transitive. Suppose $$a \prec b$$ and $$b \prec c$$. Then $$T(\{a, b, c\}) = a$$: If it were $$c$$ then this would violate contraction consistency when considering $$\{b, c\}$$, and if it were $$b$$ then it would violate it when considering $$\{a, b\}$$. Now, by constraction consistency, $$T(\{a, c\}) = T(\{a, b, c\}) = a$$, so $$a \prec c$$.

Now suppose $$b \in A$$. By constraction consistency, $$T(\{T(A), b\}) = T(A)$$. Therefore $$T(A) \prec b$$, and so $$T(A) = \min\limits_\prec A$$ as desired.

QED

What this means in practice is that the only contraction consistent test-case selection methods must be equivalent to bounded exhaustive enumeration: You have some total ordering over your test cases, and in order to return a test case $$a \in A$$ you must have verified that $$b \not\in A$$ for every $$b \prec a$$. This is potentially very expensive if all you have is membership queries for $$A$$! If you have more structure on $$A$$ (e.g. because of symbolic execution) then you could potentially rule out this happening without actually performing those membership queries.

If you adopt the classic approach of generating a random test case then running test-case reduction on it, you will typically require substantially fewer membership queries for the set, but the $$\prec$$ relationship in the above proof may not even be transitive!

Say, for example, that we have $$A = \mathbb{N}$$, and our test case reduction algorithm consists of iterating the operations $$n \to n / 2$$ and $$n \to n – 1$$ to a fixed point as long as they remain in the set. So for example if we have $$A$$ as the set of even numbers and start from $$8$$ then we will go $$8 \to 4 \to 2$$, having tried $$7, 3, 1$$ and concluded they weren’t in the set. If we’d started from $$10$$ though we’ve have become stuck because both $$10 – 1$$ and $$10 / 2$$ are odd. This is then an example of intransitivity, because we have $$5 \prec 10$$, and $$4 \prec 5$$, but $$4 \not\prec 10$$, and this intransitivity is in large part responsible for our failure to find a global minimum.

(Note that it would be perfectly possible to have $$\prec$$ be transitive and have $$T$$ not be contraction-consistent – just make $$T$$ do something different whenever $$|A| > 3$$. Contraction consistency implies the transitivity of $$\prec$$, but we still needed the full consistency to show that $$T(A)$$ was the $$\prec$$-minimum).

This is roughly equivalent to the observation that classic choice theory only really works for logically omniscient agents: Test-case reduction is actually a boundedly rational agent that is unable (or unwilling) to run exponential time algorithms, while a logically omniscient agent is perfectly happy to do that and considers it a free action.

There is also an interesting topological perspective that allows us to firm up what we mean by $$T$$ being “black box”, but I think I’ll leave that for another time.

This entry was posted in Numbers are hard, programming on by .

# Doing the Two-Step for Efficient Reversible Deletes

Attention conservation notice: I’m back to blogging about tech again. Sorry.

Credit: Thanks to Suzy Hamilton for useful information about dance types.

I recently read Donald Knuth’s dancing links paper. It was sent to me as an example of good writing, which I don’t entirely agree with (it’s locally good writing, but I didn’t really understand it until I rewrote the first couple of pages in my own words and notation and then it was fine), but regardless of whether it’s good writing it’s definitely an interesting technique that I was previously vaguely aware of but didn’t appreciate.

Roughly the idea of dancing links is this:

1. For efficiently implementing depth first search for solving some hard combinatorial problem (e.g. SAT solving, but also others. The main example of the paper is the exact cover problem) it is very useful to be able to have mutable data structures with undoable operations. In your recursive call you make a bunch of changes, and you undo them all before returning.
2. A particularly useful category of undoable change is to be able to maintain a list of elements that you can delete individual elements from, then undo those deletes.
3. A doubly-linked list allows you to efficiently implement that by removing a node from the list and just keeping it around (with its pointers still intact) and then just reinserting it when you undo. Apparently this is like the links in the doubly-linked list “dancing”, hence the name.

1 and 2 are interesting points that I think will have a significant impact on how I design things. I am very glad to have read the paper for that.

Point 3, eh. No thanks.

There are two problems with it:

1. Doubly-linked lists are a bullshit data structure
2. Point 1 of this list is in large part responsible for one of my main problems with wrapping my head around the paper: If you give me an interesting algorithmic approach then I’m going to want to wrap it up as an ADT so I can treat it in terms of an abstract set of operations it implements and reason about that separately from the algorithm. Doing this as a doubly linked list makes that hard because of what the deletion operation ends up looking like.

It turns out that it’s actually really easy to fix both these problems if you are willing to relax one of the constraints very slightly: The desire to keep all of the elements in a particular order.

There is a data structure that supports the following operations:

• Initialise with N elements in O(N).
• Get the current number of elements in O(1)
• Get the element at index i in O(1)
• Delete the element at index i, putting the element that was previously at index length – 1 in its place (or just delete it if i = length – 1) in O(1).
• Undo a single delete O(1)

So if you perform k deletes and then call undo k times, the list is back in the state it was before the deletes.

Moreover there is very little indirection involved (everything is just stored in an array) so algorithms using this involve almost no pointer-chasing, unlike a doubly linked list, and is likely substantially more efficient as a result (but I have done literally zero benchmarking to demonstrate this). It also uses less memory per item.

One important thing to note is that although the order of the elements changes, it does so in a sufficiently predictable way that forward iteration over the list is easy: You just iterate as normal, and if you want to delete the current element you just skip incrementing the loop index.

How does this data structure work?

1. Put all the desired elements in an array of length N. Store a length field and an undo stack. The length field keeps track of the end of the currently present elements – everything that has been deleted is kept in the array but stored starting from index length.
2. When you delete the element at index i, push i onto the undo stack, decrement the length by 1, and swap the elements at position i and position length.
3. When you undo, pop i from the undo stack, swap positions i and length, and then increment length.

That’s it. We’re done.

This is probably a known technique – it only took me about an hour of thinking about the problem to come up with it, so it would be frankly bizarre if nobody had noticed this since the dancing links technique was invented in 1979 – but some cursory googling didn’t find anything, so if nothing else this post will hopefully make the idea more visible and maybe someone who knows of it can send me a link (dancing or otherwise).

In general this technique of “You can remove an element from an array in O(1) if you don’t mind losing the order” is surprisingly powerful and I think underappreciated. It’s pretty straightforward once you see it, but allows for some pleasingly elegant data structure designs.

This entry was posted in programming on by .

# Monads as an interface for composition

Fair warning: I’m going to be talking about monads. It’s in the title and everything. This post is even a little bit of a monad tutorial, as I’m not going to assume that you know what a monad is, though that’s not the main point of it.

A thing I’ve been thinking about recently is that a lot of the success of QuickCheck and descendants as a way of getting people who are not testing specialists to use better testing techniques comes from the fact that random generation is a monad (QuickCheck’s use of it technically isn’t a monad, but the difference is unimportant enough that I can elide it for this post. You could fix this problem and literally nobody would notice).

A point in favour of this theory is that even though Hypothesis has abandoned normal random generation as its underlying method, it’s done so in favour of adopting a different monad (monadic parser combinators, if you’re curious).

Fundamentally the thing that makes QuickCheck so effective is that it is providing a library of composable data generators – you can take existing generators and turn them into generators for lists of that value, apply functions to them, etc.

This makes it extremely easy to define generators for your own data types, and the reason you can do that is fundamentally that the composition interface is the monad one.

When you first encounter Haskell you see something like the following:

main = do name <- getLine putStrLn $"Hello " ++ name And OK you might think it looks a little syntactically strange, but it basically looks like a normal program, and it’s only when people tell you that there’s something called a monad involved that it starts to seem like anything strange is going on. But the first intuition is correct: Nothing strange is going on, and this code really does do the obvious thing (there’s a little bit of non-obviousness for people in terms of the difference between “let x = y” and “x <- y”, but I’m going to ignore that). It really is “a normal program”. The thing that is unusual is not this program, but the fact that it is an instance of a more general one. Suppose you instead had the following: sayHello get put = do name <- get put$ "Hello " ++ name   main = sayHello getLine putStrLn

And now we have this split into two parts, with some code that is abstracted over how you get and put the name (maybe you want to run this as part of a website as well as a command line? I don’t know. It’s an artificial example. Bear with me here), and a main definition that calls that abstracted function with concrete implementations for terminal I/O that read and write lines.

This is all type inferred, but lets be explicit about the types:

sayHello :: IO String -> (String -> IO ()) -> IO () sayHello get put = do name <- get put $"Hello " ++ name main :: IO () main = sayHello getLine putStrLn So sayHello takes one argument that is an IO String (“an effectful program that returns a string”) and one argument that is a String -> IO () (“a function that takes a string and does some effects”), and main just passes the corresponding values to it for our command line program. This is just extracting a function from our original program, business as usual except that because it’s Haskell we have a specific type for “things that have effects” and we have all the usual higher order function goodness. Where monads come in is the observation that we’ve actually used nothing about IO in the definition of sayHello – all we’ve used is do notation and strings, so we could abstract it to anything that supports do notation, which is precisely the monad interface. sayHello :: (Monad m) => m String -> (String -> m ()) -> m () sayHello get put = do name <- get put$ "Hello " ++ name   main :: IO () main = sayHello getLine putStrLn

This is exactly the same program, all we’ve done is generalise the type signature for sayHello to only use the operations it requires from IO rather than the full complexity of the IO type.

(You could make this type signature more generic yet by replacing () with a type parameter, but I think the concrete version is clearer for explaining this point).

So that’s all that monads are: The type signature for things that support do notation (plus some laws that ensure do notation behaves reasonably sensibly), and do notation just looks like normal programs.

So what are monads? Well they’re a monoid in the category of endofunctors err a type class with the operations return and >>= satisfying the laws no basically a burrito in a space suit actually never mind it doesn’t matter what monads are, but the reason why monads are useful is that they represent data types that you can compose together with do notation, and do notation looks basically “like writing normal programs that return values”, which is extremely convenient because we can build on our normal experience and intuition about writing such programs.

That’s not to say monads are normal programs that return values, only that we can build them up as if we were. e.g. a monad might “return” no values. e.g. suppose we ran the above code with the maybe monad:

main :: IO () main = print \$ sayHello Nothing (\_ -> Just ())

This takes our existing sayHello function but the get argument always fails (returns Nothing) and the put argument just returns () without doing anything (because the only things it could do are either that or fail). This program now prints the string Nothing, because rather than ever “returning a value”, instead the computation “failed”. You could think of this as saying that normal programs with exceptions also compose like monads. Similarly a monad might “return” multiple values (e.g the list monad), do backtracking (a parsing monad), or many other things.

What any given monad actually does can be almost arbitrarily complicated, but the monad interface isolates us from that by allowing us to compose them together as if they were such programs anyway.

In QuickCheck this comes in by letting us compose generators together easily using do notation:

integerRange :: Int -> Int -> Gen Int integerRange = undefined -- Imagine we had a real implementation here     orderedPair :: Gen (Int, Int) orderedPair = do a <- integerRange 0 10 b <- integerRange a 10 return (a, b)

The access to do notation means that these generators compose more or less how we’d expect programs to without us having to have any knowledge of the underlying implementation or source of randomness. Or even how these will be run – in QuickCheck, Gen more or less is just a program that returns a random value, but e.g. in Hypothesis one of the possibilities of drawing is that it may fail because it violates some assumption or is unable to generate a good value, and that can just as easily be rolled into the monadic interface.

Python doesn’t have do notation for monads, but it does have free side effects, so in Hypothesis we use the @composite decorator instead of do notation, but the motivation for that was absolutely “Boy I wish there was a way to compose Hypothesis generators that was as easy as do notation”.

But if we don’t have do notation, what does it being a monad buy us?

Well, directly not much. Oh, sure, if you have monads as a first class concept in your programming language, you can define a whole bunch of useful functions that take generic monads, but they’re not that useful. They’re nice to haves. You can just write the special case version in a couple of lines if you really need to.

But even without do notation, it’s not too hard to use the monadic interface. We could write the above as

orderedPair = integerRange 0 10 >>= \a -> (integerRange a 10 >>= \b -> return (a, b))

This does the same thing as the do notation, and it just uses more explicit chaining. It’s unwieldy, but people have shown themselves perfectly able to translate the sort of normal program composition reasoning when writing promises, and before I added @composite to Hypothesis people were happy to use flatmap (which is the Hypothesis equivalent of >>=), and this is no different just because it’s not tied down to a specific monad implementation.

Even if you never consciously express the fact that your data type is a monad and can’t represent it in your programming language, the fact that it is one still puts it in a particular corner of the design space, and it’s a particularly attractive one because of how easy monads are to use.

This ease of use is then a lot of what has enabled the popularity of QuickCheck – because it’s so easy to define generators for your own data types, people can and do, and this then enables QuickCheck to be used in so many contexts.

It’s also what constrains QuickCheck to being a particular local maximum I think. Ensuring the monad property is hard, because many natural operations do not compose monadically. Shrinking is an example where the natural composition is not monadic – it doesn’t compose like a normal function, because you have to be able to “reverse the arrows” – to turn a  Shrinker a into a Shrinker b, you need to be able to transform both from a to b and from b to a, and monadic composition isn’t bidirectional like that. test.check and Hypothesis both build shrinking into the generation to make it monadic (though test.check struggles with actually making the monadic composition shrink well).

The way Hypothesis escapes this is by building in some of these operations at a lower level. Instead of operating in a monad representing a random number generator, it operates in a parsing monad. In a parsing monad, “running” the monad corresponds to reading some bytes of input and either failing with a parse error or returning a value (possibly multiple values if you want to handle ambiguity). Parsing monads are neat and you can see a proper example of them in things like parsec, but Hypothesis gets to be relatively simplistic in its implementation of them because it doesn’t have to do anything like back-tracking.

The reason you can use a parsing monad for this is because non-determinism is a form of parsing: A pseudo-random number generator is basically “just” an infinite stream of bits, and consuming those bits to get non-determinism is just parsing the stream. You can then extend this to a monad which parses finite sequences of bits by having the parse fail if you ever read try to read past the end of the sequence. Shrinking in Hypothesis then becomes a matter of trying to minimize these finite input bit sequences in the shortlex order (first minimize the length, then minimize lexicographically. The latter lets us do things like shrinking integers towards zero). This lets us escape this local maximum by basically avoiding all the hard parts of composition: We just need to be able to construct a value, and all of the things that we manipulate are of a single concrete type, so we never need to do the reverse transformation back into the original type because shrinking isn’t operating on values at all, just on the underlying representation that is being parsed.

One of these days (hopefully soon) I’ll publish a paper expanding on that, but that’s just a taster and not the important part of this post. The important part of this post is that the public interface of Hypothesis and QuickCheck being a monad gives them very powerful composition properties that makes them much easier to use.

If you’re not writing a fairly general purpose library, you probably don’t need to understand this (not that you shouldn’t try, you just don’t need to). Just use the monadic interface and don’t worry about it too much – people are doing this all the time with specific things that happen to be monads and it works out great for them.

But if you are writing a fairly general purpose library – promises, test-case generation, etc. then the question “Can I make this type a monad?” seems to push you into a particularly useful corner of the design space, because it enables users to build instances of your type in a very pleasant to use way.

This entry was posted in programming on by .

# (Some of) my problems with correctness research

Epistemic status: I think this is mostly right, but I’m sure there are gaps in my understanding of the field. Also you should be intrinsically suspicious of any argument that concludes that the author’s own field of research is going to change the world in a way that everyone else’s failed to do.

My previous post on looking for grandiose introductions in correctness research got a very mixed receptions. A lot of people who I normally mostly agree with objected to it, and several people agreed with it for reasons that I disagree with vehemently. This wasn’t the universal reaction, and there were plenty of decent responses to it, but it left me feeling that I’d been really unclear about what I actually meant, so this is my attempt to unpack it.

One response that I thought was a nice piece of writing in its own right but I didn’t really think worked as addressing my problems with correctness research was from John Regehr. To quote in full:

I tried to write a software correctness paper introduction that would make @DRMacIver happy:

The market for software, generally speaking, does not reward correct software so much as software that is acceptably correct and that can be built cheaply and rapidly. Thus, while our work on software correctness is highly unlikely to improve the quality of software, it may plausibly make it possible to create acceptable software a bit more cheaply and rapidly.

I broadly agree with this sentiment, though I flip-flop a bit on whether I think tools can actually make it possible to make more correct software or whether they just let you spend the time you’d have spent on correctness on other things. The technical term for this is Risk Compensation – when you increase safety measures, people behave in a more risky manner to improve performance. When this entirely compensates for the safety measure you have Risk Homeostasis. The general conclusion seems to be that you usually have Risk Compensation and rarely or never have Risk Homeostatis. (This paragraph has entirely exhausted my current knowledge of the subject, but I plan to fix that in the near future).

The main reason this doesn’t actually make me happy is that despite my previous post, it’s not really the introductions I object to. In many ways I like the introductions, and I’m certainly not above a bit of grandiosity myself. The Hypothesis documentation literally says that our goal in developing it is “to drag the world kicking and screaming into a new and terrifying age of high quality software“.

The problem is that I don’t think that correctness research is currently doing a very good job of matching its means to its ends.

there’s a difference between fundamental and applied research (with a fluid boundary). And then there’s “fake applied” research, which is applied to fake problems

(Paolo has asked me to add here that he has similar concerns to the ones I express below about the use of the word “fake” and that he was thinking of it in terms of “The Fake Applied Problem” in “The Craft of Research”)

I think this is a good distinction. I’d maybe hesitate to label any particular problem as “fake” – I can’t actually point to any specific research in the field of correctness and say “this problem doesn’t exist” (and even if I could I’d prefer not to do so in public), but I do think a lot of people are working on problems that most software developers wouldn’t really care about. It may still be perfectly good fundamental research, and it’s not necessarily obvious until you’ve made some progress in the problem whether it’s actually useful, but saying “Software is terrible and this is our solution to it” is not a great look if you’re working on something that will never actually solve the problem of bad software.

When I say “solve the problem” I do not mean that academics have an obligation to produce tools for industry. The job of an academic is not to do software engineering, it’s to explore the space of ideas. I do think that it’s good for academics to produce relatively solid and easy to use tools – I think it was John Regehr who pointed out to me that by doing so you make it much easier for other people to use you as a baseline and get a bajillion citations as a result, so there are even good incentives for this – but as I’ve discovered the hard way there’s a really large gap between doing that and actually making an impact on industry.

Instead I mean something closer to developing ideas such that you can “Just add engineering (and sales, and marketing, and all the rest)” to go from the research idea to actually producing such a tool. In this sense the original QuickCheck academia-solves the problem, and Quviq industry-solves the problem. Often you find out in the course of going from academia-solves to industry-solves that there were other problems you needed to academia-solve before you could complete the project, but that’s OK (personally I love it when this happens).

I think there is a lot of research that no matter how much engineering, sales, and marketing, you added to them, people just wouldn’t care about the solution. There is no path from academia-solves to industry-solves for this research, because the result would essentially have been yet another startup that spends five years trying to tell you about their great idea before ending up on Our Incredible Journey.

That may be fine! I’ve spent enough time as a pure mathematician to be entirely happy with people working on problems that will never be of use to anyone. Also they turn out to be useful after all surprisingly often because you can reuse their ideas in other areas that are useful.

But the disconnect between the promise (We want to change the world!) and the solutions (by implementing this tool that would make almost every software developer yawn and/or go “huh?” when we tried to explain it to them) does make me unhappy.

It’s not necessarily that I want any of the specific research I’m complaining about not to exist – like I said, I’m fine with it as foundational research. I even get why they feel the need to claim it’s practical research – there’s a certain social and financial obligation to do so in our field – and if they know that’s what they’re doing and are just gaming the system to do something they think is worthwhile, good for them.

But what makes me sad is that there are gaps in the research that I think would not be there if the crisis of buggy software were actually a problem we were trying to solve, and I think these gaps mostly exist because of misaligned incentives (admittedly I always think that).

The problem, I think, with most correctness research is that it has a major false consensus problem: Researchers believe (or at least act like they believe) that software developers at large are roughly like them, and so they build tools that they think are useful to those hypothetical versions of themself that just happen to be working on different problems.

This is by no means an academia specific problem. It’s a problem that almost every single software development team has to some greater or lesser extent. There’s a reason the user experience people have to shout “You Are Not The User” so often in meetings (if they even get invited to meetings, and if they’re not then you have real problems. PS. Your company does have user experience people, right?).

But I think it shows up in academic software correctness research in one really important way (I’m sure it shows up in many ways, but this is the one that hits closest to home for me): The Oracle Problem.

Colloquially, the Oracle Problem is the issue that you can’t tell whether a piece of software is correct unless you know what it’s supposed to do.

When testing, every test has two parts (these may be quite intertwined, but they’re still in some sense logically distinct:

1. A test case: the concrete set of inputs that the program is to be run on
2. A test oracle: a way of determining whether the program is incorrect on those inputs (typically this is only one way: i.e. the oracle can’t say for sure that the program is exactly right on those inputs, but it can prove various ways in which it might be wrong)

And this split also defines a very large split between academic research in software correctness and potential consumers of software correctness tools: For an academic, the problem of test-case generation is an interesting object of study which they know a great deal about, but the problem of finding a test-case oracle is typically boring and would require a great deal of non-research legwork to figure out what it is that this software is actually supposed to do.

Typically research works around this problem in one of a couple of ways:

1. Use a test oracle that points out only things that are obviously bad – segfaults, assertion failures, deadlocks, etc.
2. Use a specification to derive a test oracle (e.g. via metamorphic testing – finding changes to the input that should have no effect but do. My group at Imperial do a lot of this)
3. Use multiple implementations of the same specification to do differential testing – run each implementation on the same input and see if you get different results (this is typically how e.g Csmith is used)
4. Try to guess an oracle – e.g. assume that the current behaviour is correct and use it for regression testing, try to build a simple model of the behaviour and treat deviations as potential bugs, etc.

The first can be useful, especially if you’re writing in an unsafe language, and the second and third are very useful when you are in the limited set of circumstances to which they can be applied. I have yet to be convinced that I have ever had a use case that would be served by tools in the fourth category, although I do believe they exist.

If you want to know more about these solutions, I can recommend “The Oracle Problem in Software Testing: A Survey“. I won’t go into them more here.

But there’s a fifth category of solution to the problem (which is also mentioned in the survey article), which is the one that QuickCheck and its descendants adopt: Make the end user write the test oracle.

In many ways the most valuable insight implicit in the success of QuickCheck is that this order of priorities is reversed in software developers: Software developers don’t really want to write test-case generators, but as long as you let them write them in their host language then they don’t mind writing test oracles because they’re already used to doing that every time they write a test! (In contrast they do seem to mind writing test cases, which they also have to do when writing a test. I think the distinction is that each test oracle feels unique, while writing test cases feels like drudge work, but I’m speculating wildly here).

From my point of view, this is a great solution. If you want to adequately test the behaviour of a piece of software then at some point you’re going to have to find out what it is supposed to do. Why not just let the end user tell you that?

I also think that if we want to make a dent on the bad software problem then this is the only solution that can possibly work. The overwhelming majority of software does not have any specification to speak of, but many software developers are willing and able to write tests (much software is completely untested. There is almost nothing we can do as correctness researchers to fix this problem, because if the authors of said software cared about the problem then they probably wouldn’t be in this situation). So correctness tools which actually make an impact on real world correctness have to get the developers to say what it is the software is actually supposed to do (and doing it in their host language is a much easier sell, though doing it in a derived language is viable, and people have shown willing to write tests in languages like Gherkin).

So, great, we make the users do all the work on solving this hard problem and we go back to working on the bits that we care about. What’s the problem?

Well, there are two big problems:

1. It’s not the job of academia to build usable tools, and we/they usually don’t. So we don’t have any users to write these oracles.
2. Even when tools have users, getting good empirical information about them is like pulling teeth. Especially when the tool is integrated into their development workflow so they don’t necessarily notice specific events involving it.

My personal research solves the first problem by starting from a widely used (518 open source users and counting, squee) tool rather than a research project. As to how to solve the second problem? Uh, I’ll get back to you.

Unfortunately, this means that the most useful version of the problem to work on is one that it is actually very hard for academia to do anything with – we mostly don’t have users, and if we did have users it would be hard to get a good set of examples for empirical evaluation out of them (especially if they’re corporate users).

These problems are surmountable, but doing so is a lot more work than just picking an open source project, running your tool against it, and seeing what happens.

How much of a problem is this? I don’t know. One could argue that it’s not one:

• In some sense test-case generation is test-case generation, and a test oracle is just a way of making a program that crashes on some invalid behaviour, so there’s no reason you can’t take an arbitrary test-case generator and hook it up to an arbitrary “property-based test” that runs based on it.
• Certainly when writing such tools for industry (which there seems to be very little financial incentive to do, which is a whole different problem. I’m not any more impressed with software correctness industry than I am with software correctness academia), you can still draw on a lot of the academic research that didn’t have this sort of focus – e.g. I learned a lot from reading the compiler testing literature despite not doing compiler testing.
• The situation is not completely without hope – there’s a decent amount (though less than I’d like) of QuickCheck derived literature ticking along and e.g. Alex Groce’s TSTL is in another interesting corner of this space that isn’t a QuickCheck derived work (Update: Alex points out in the comments that in some sense it is a QuickCheck derived work, but I think it’s sufficiently different from QuickCheck in usage that I’d still count it as its own thing). That’s more or less it as far as examples I know about, but I hope there are others.

But I do think it creates certain significant gaps. For example:

1. There’s been very little research on making it easy to write good, composable, test-case generators outside the context of QuickCheck (and there hasn’t been that much in that context and I’m not super happy with any of the answers I’ve seen). Most research I’ve seen on test-case generation works on the assumption that there’s more or less only one thing that you’re generating.
2. It’s actually not true that a test-case generator is a test-case generator. Nezha is an example of how the fact that you’re doing differential testing can usefully inform your test case generation (mutational fuzzing in this case). Is the fact that we’re mostly testing on trivial properties in some way distorting the sorts of test cases we generate?

How do we fix this? Gods, I don’t know. The number of incentives you have to change to fix it is just appalling, and I lack both the lever and the fulcrum to move them.

I do think the sort of thing I’m doing with Hypothesis of starting a research program based on an existing widely used tool is probably a good one, and that open source can form a lot of the bridge – both by providing the engineering that it’s not academia’s job to provide (not that it’s our job to provide it either, we’re just suckers who like doing work for free. Go us), and also by providing a ready source of visible users who can at least lower the bar for evaluation (even if it’s still not easy), but I don’t think this is even close to sufficient.

So fixing it? Pass. I bring you problems, not solutions.

It sure would be nice if we could at least acknowledge that this is a problem though.

This entry was posted in programming on by .

# An Unkind Question about Software Correctness Research

Dear Lazyweb,

I’m about to be something of a jerk, and I’d like your help in doing it.

I have noticed a common pattern in a lot of software correctness research, which is that they often follow the following format:

1. Oh god software, it’s everywhere.
2. And it doesn’t work very well at all, amirite?
3. This is only going to get worse.
4. We’re basically all doomed.
5. Therefore in order to fix this we propose the following.
6. We have made an improvement.
7. To a category of tool that literally nobody uses.
8. Happy to help.
9. Love, The Authors.

What are your “favourite” examples of this genre? I’m particularly interested in examples from software testing research but I’ll take anything vaguely correctness related.

The “That nobody actually uses” part of it is also somewhat inessential – I’m mostly interested in good examples of the motivation section, and in many ways having examples from things people actually use is quite helpful because it gives me papers I can point to without feeling too mean about it!