Linear time (ish) test case reduction

A problem I’m quite interested in, primarily from my work on Hypothesis, is test case reduction: Taking an example that produces a bug, and producing a smaller version that triggers the same bug.

Typically a “test-case” here means a file that when fed to a program triggers a crash or some other wrong behaviour. In the abstract, I usually think of sequences as the prototypical target for test case reduction. You can view a file as a sequence in a number of usefully different ways – e.g. as bytes, as lines, etc. and they all are amenable to broadly similar algorithms.

The seminal work on test-case reduction is Zeller and Hildebrant’s “Simplifying and Isolating Failure-Inducing Input“, in which they introduce delta debugging. Delta-debugging is essentially an optimisation to the greedy algorithm which removes one item at a time from the sequence and sees if that still triggers the bug. It repeatedly applies this greedy algorithm to increasingly fine partitions of the test case, until the partition is maximally fine.

Unfortunately the greedy algorithm as described in their paper, and as widely implemented, contains an accidentally quadratic bug. This problem is not present in the reference implementation, but it is present in many implementations of test case reduction found in the wild, including Berkeley delta and QuickCheck. Hypothesis gets this right, and has for so long that I forgot until quite recently that this wasn’t more widely known.

To see the problem, lets look at a concrete implementation. In Python, the normal implementation of the greedy algorithm looks something like this:

def greedy(test_case, predicate):
    while True:
        for i in range(len(test_case)):
           attempt = list(test_case)
           del attempt[i]
           if predicate(attempt):
               test_case = attempt
               break
        else:
           break
    return test_case

We try deleting each index in turn. If that succeeds, we start again. If not, we’re done: No single item can be deleted from the list.

But consider what happens if our list is, say, the numbers from 1 through 10, and we succeed at deleting a number from the list if and only if it’s even.

When we run this algorithm we try the following sequence of deletes:

  • delete 1 (fail)
  • delete 2 (succeed)
  • delete 1 (fail)
  • delete 3 (fail)
  • delete 4 (succeed)
  • delete 1 (fail)

Every time we succeed in deleting an element, we start again from scratch. As a result, we have a classic accidentally quadratic bug.

This is easy to avoid though. Instead of starting again from scratch, we continue at our current index:

def greedy(test_case, predicate):
    deleted = True
    while deleted:
        deleted = False
        i = 0
        while i <  len(test_case):
           attempt = list(test_case)
           del attempt[i]
           if predicate(attempt):
               test_case = attempt
               deleted = True
           else:
               i += 1
    return test_case

At each stage we either successfully reduce the size of the test case by 1 or we advance the loop counter by one, so this loop makes progress. It stops in the same case as before (when we make it all the way through with no deletions), so it still achieves the same minimality guarantee that we can’t delete any single element.

The reason this is only linear time “ish” is that you might need to make up to n iterations of the loop (this is also true with the original algorithm) because deleting an element might unlock another element. Consider e.g. the predicate that our sequence must consist of the numbers 1, …, k for some k – we can only every delete the last element, so we must make k passes through the list. Additionally, if we consider the opposite where it must be the numbers from k to n for some k, this algorithm is quadratic while the original one is linear!

I believe the quadratic case to be essential, and it’s certainly essential if you consider only algorithms that are allowed to delete at most one element at a time (just always make the last one they try in any given sequence succeed), but anecdotally most test cases found in the wild don’t have nearly this high a degree of dependency among them, and indexes that previously failed to delete tend to continue to fail to delete.

A model with a strong version of this as its core assumption shows up the complexity difference: Suppose you have \(k\) indices out of \(n\) which are essential, and every other index can be deleted. Then this algorithm will always run in \(n + k\) steps (\(n\) steps through the list the first time, \(k\) steps at the end to verify), while the classic greedy algorithm will always run in \(O(k^2 + n)\) steps.

Although this model is overly simplistic, anecdotally examples found in the wild tend to have something like this which acts as an “envelope” of the test case – there’s some large easily deletable set and a small essential core. Once you’re down to the core you have to do more work to find deletions, but getting down to the core is often the time consuming part of the process.

As a result I would be very surprised to find any instances in the wild where switching to this version of the algorithm was a net loss.

This entry was posted in programming, Python on by .

Startup Life

Regular readers of this blog who quite sensibly don’t use Twitter (or do use Twitter and don’t follow me there) have probably not noticed that I have a new short story out: Startup Life.

As with most of my fiction, this one is about exploring important philosophical and political questions.

Specifically it’s about exploring the important political and philosophical question: What would happen if Dr Vicky Frankenstein joined Vampire Ada Lovelace’s cybernetics startup?

Um. OK. Maybe it’s not that important or philosophical. But I’m really pleased with how it turned out and if you like my fiction at all I recommend reading it.

This entry was posted in Fiction on by .

Life Changes Announcement: Academia Edition

This is just a small blog post to let people know what’s going on in my life.

I wrote a little while ago that I was thinking about starting doing a PhD. Well, thought turned into action remarkably quickly, and in October I will be starting a PhD with Dr Alistair Donaldson’s Multicore Programming Group at Imperial College London.

Despite the group title, my focus will be almost entirely on the sorts of things I’ve already been doing for the last few years and probably won’t involve that much multicore programming at all (I’m interested in them because they do a lot of work on language based fuzzing for GPUs, and also because Ally and I really hit it off and there’s a bunch of other interesting related work at Imperial). I’ll be focusing on Hypothesis and related technology, but with a goal of validating and publishing some of the research I’ve been doing on it from a more academic standpoint.

Separately, though not unrelated, to that, I’ve also started a part time research assistant job at Imperial with the same group. I’ll be doing three days a week there for the next three months, helping them with some of their work on GLFuzz (I somehow convinced Ally I knew a thing or two about programming in general and writing fuzzers in particular. Not sure how). So I am now officially an academic.

Feels weird, to be honest. Also not really sure it’s sunk in yet.

For now this doesn’t mean I’ll be around London any more often than I have been, as it will mostly be a remote working position, but come October or so when I start my PhD I will be moving back. I have decidedly mixed feelings about this, but it will be very good to see my friends and family who live there more often.

This entry was posted in Uncategorized on by .

Programmer at Large: Can we not?

This is the latest chapter in my web serial, Programmer at Large. The first chapter is here and you can read the whole archives here or on the Archive of Our Own mirror. This chapter is also mirrored at Archive of Our Own.


“I think that went well, don’t you?

I let out a huff of frustration.

“What on the ground was that about?”

Sam cuddled up closer into a two person conversation stance.

“Brian is just a bit… sensitive about the subject of sex. Not sure why. Their profile says they have a high disgust reflex, but I’ve never been sure if it says that because of the sex thing or whether it’s the reason for it.”

“OK, but call me off consensus here, but wasn’t that just straight up rumour mongering? We’re not supposed to do that, right? What am I missing?”

Sam grimaced and (check tagging) looked uncomfortable.

“There are… special cases. Sex is one of them. Kimiko could legally challenge Brian about this if they wanted, but without that Brian is technically within socially acceptable bounds, though they’re being a bit gauche about it.”

I called up Kimiko’s social graph into a shared workspace and highlighted ties that had risen and then fallen. Brian was connected to about a third of them. I switched it to timeline mode and the pattern was even clearer.

“Gauche? Look at that. This isn’t gauche, this is practically aggression!”

I was getting mood warnings again, and Sam was stroking my back in a calming manner.

“Arthur, calm down. This isn’t your problem.”

“How is this not my problem? This is a clear social breakdown on the ship! Those are everybody’s problem! It says so right there in the charter!”

“Look, it’s complicated.”

“That’s what people always say when they think the rules don’t apply to them!”

Sam grimaced again and shook their head.

“Ugh, Arthur, I can’t do this right now. I’m sorry. I’m not angry at you, and I understand why you feel this way, but this is a much higher effort conversation than I have the capacity for at the moment. Can we drop it?”

Sam and I have had more than a few conversations like this, and they probably could tell how this one was going to go.

The problem was that it was hitting right at the core of the things I find hardest about shipboard society.

The goals of our charter are mostly worthy, and the rules it defines are mostly a fair way of achieving those goals. It’s not perfect, but nothing created by humanity is. We’ve learned that the hard way.

The official rules are sometimes very hard for me to follow, but I accept their legitimacy and where I struggle too much I have software to help keep me on track.

But then there are all the unofficial rules, which are impossible to keep straight because nobody knows what they are in any formal sense, they just somehow magically know how to follow them. Every time I ask people to explain, we just both get frustrated – they tell me things they want to be true, and I get mad because they obviously aren’t.

And when the unofficial rules override the official rules you got this sort of completely hypocritical situation where people just said “it’s complicated” and can’t really explain why.

But it wasn’t Sam’s fault the ship was like this, and I could definitely understand not feeling able to talk about it. Even if it wasn’t basic courtesy, I’d respect that.

I was glad they had clarified they weren’t angry with me though.

I took a deep breath, pulled myself together, and nodded acceptance.

“Of course. Sorry I got carried away there, but this stuff… gets to me.”

“I understand. It gets to me too sometimes.”

“It’s fine to not talk about it, but then I need to not talk right now. I think this is going to be on my mind for a while and I’m probably not going to be able to stay off the subject.”

“That’s fine. Do you want to go? Or should we hang out together quietly for a bit?”

I hesitated briefly, but the need to show Sam that I wasn’t angry at them won out.

“I don’t need to be anywhere, and the company would be nice if you’re still willing.”

“Of course I am! And I’ve got plenty of quiet things I can get on with, so this works well.”

“Great.”

We shifted around a bit so we weren’t directly facing each other and could each have a hand free to work with.

The first thing I did was drop a note in Kimiko’s inbox saying I’d like to talk to them at their convenience. I flagged it as low-urgency but mildly important. Their status showed as busy at the moment, so they wouldn’t get the notification until later.

The second thing I did was start calling up social network diagrams.

Kimiko was indeed poorly connected to a lot of the rest of the crew. They had the obvious contacts in the biology sections, and there was a fairly densely connected group of about fifty people that they were part of that didn’t have any obvious reason for the connection.

I guessed that was probably the sexual subset of the crew, assuming Brian hadn’t simply been wrong or lying.

The network structure here was weird. The group was much more densely connected relative to its external connections than a group this size should be. It looked a lot like a clique or a minority interest group, and we weren’t suppose to have those. I looked up the various group metrics to see why it hadn’t been flagged.

Apparently the answer was that it consistently sat just under the alerting threshold on every single metric – slightly too large to be a clique, slightly too small to be a minority interest. The standard clustering algorithms all cleaved the group roughly in half, though they didn’t agree on which halves. Average group centrality was low but not quite low enough to be worth a flag. And so on – we have about ninety social unity metrics and this group managed to just avoid alerting on every single one of them.

If I’d found a system in the plumbing like this then I would have immediately flagged it up for review. It’s in the nature of difficult systems to push right up against the alerting boundaries, and often it’s a sign that you need a new alerting metric.

Properly that was exactly what I should have done here too: The rules don’t distinguish systems made out of humans from systems made out of machines. If you see anomalous structure you should flag it for review.

But I had a nagging feeling that doing that here would be… bad. I resolved to wait until after I talked to Kimiko, and raised the importance level of my request to meet slightly, while still leaving it as non-urgent. This had clearly been going on for a while, and just because I only found out about it now didn’t make it suddenly urgent.

The whole scenario left me feeling intensely uncomfortable, but on the plus side I’d found my own little exception to the rules to be hypocritical about. Maybe I was starting to understand the rest of the crew after all.


Next chapter: Does that work?

Like this? Why not support me on Patreon! It’s good for the soul, and will result in you seeing chapters as they’re written (which is currently not very much earlier than they’re published, but hopefully I’ll get back on track soon).

This entry was posted in Fiction, Programmer at Large on by .

Ordering and mapping regular languages

I’ve had the following question for a while: How do I create a mapping of keys to values where the keys are regular expressions, and two regular expressions are considered equivalent if they correspond to the same language?

An example of why you might want to do this is e.g. when constructing a minimal deterministic finite automaton for a regular language you end up labelling states by regular expressions that represent the language matched when starting from that state. In order for the automaton to be minimal you need to have any two equivalent regular expressions correspond to the same state, so you need a way of taking a new regular expression and finding out which state it should go to.

It’s easy (if potentially expensive) to test regular expression equivalence once you know how, so the naive implementation of this is just to do a linear scan of all the regular expressions you’ve found so far. It’s O(n) lookup but it’s at least an existence proof.

In the past I’ve ended up implementing a crude hash function for regular languages and then just used a hash table. It works, but collisions are common unless you’re quite clever with your hashing, so it doesn’t work well.

But it turns out that there is a better way! Rather than using hashed data structures you can used ordered ones, because it turns out that there is a natural and easy to compute (or at least not substantially harder than testing equivalence) total ordering over the set of regular languages.

That way is this: If you have two regular languages \(L\) and \(M\) that are not equivalent, there is some \(x \in L \triangle M\), the symmetric difference. That is, we can find and \(x\) which is in one but not the other. Let \(x\) be the shortlex minimal such word (i.e. the lexicographically first word amongst those of minimal length). Then \(L < M\) if \(x \in L\), else \(M < L\).

The work in the previous post on regular language equivalence is thus enough to calculate the shortlex minimal element of an inequivalent pair of languages (though I still don’t know if the faster of the two algorithms gives the minimal one. But you can use the fast algorithm for equivalence checking and then the slightly slower algorithm to get a minimal refutation), so we can readily compute this ordering between two regular expressions. This, combined with any sort of ordered collection type (e.g. a balanced binary tree of some sort) gives us our desired mapping.

But why does this definition provide a total order?

Well, consider the enumeration of all words in increasing shortlex order as \(w_0, \ldots, w_n, \ldots\). Let \(l_n = 1\) if \(w_n \in L\), else \(l_n = 0\). Define \(m_n\) similarly for \(M\).

Then the above definition is equivalent to the reverse of the lexicographical ordering between \(l\) and \(m\)! If \(w_k\) is the smallest word in the symmetric difference then \(k\) is the first index at which \(l\) and \(m\) differ. If  \(w_k \in L\) then \(l_k = 1\) and \(m_k = 0\), so \(l > k\), and vice versa. The lexicographical order is a total order, and the reverse of a total order is a total order, so the above definition is also a total order.

This definition has a number of nice properties:

  • Any language containing the empty word sorts before any language that doesn’t
  • The function \(L \to \overline{L}\) is order reversing.
  • \(L \cap M \leq L, M \leq L \cup M\)

I originally thought that union was coordinate-wise monotonic, but it’s not. Suppose we have four words \(a < b < c < d\), and consider the languages \(L = \{a, d\}, M = \{b, c\}\). Then \(L < M\) because the deciding value is \(a\). But now consider \(P = \{a, b\}\). Then \(L \cup P > M \cup P\) because the deciding element now ends up being \(c\), which is in \(M\).

I’ve yet to try this in practice, so it might turn out that there are some interestingly pathological failure modes for this comparison function, but I don’t think there are likely to be any that aren’t also present in testing regular expression equivalence itself.

Another open question here is which sorted map data structure to use? The comparison is relatively expensive, so it might be worth putting a bit of extra effort in to balance it. As such an AVL tree might be a fairly reasonable choice. I’m not sure.


Want more blog posts like this? Join the 30 others who are supporting my writing on Patreon!

This entry was posted in Automata Theory, programming on by .