# Adaptive delta debugging

Note: This blog post is mostly intended to be expository about a nice technique, so I’m omitting a lot of the evidence for my claims. Sorry. I hope to write those up better at some later date.

As I explained previously, I’m quite interested in the problem of faster test case reduction.

In my previous post I explained how to improve on the basic greedy algorithm for test case reduction by better choices of where to try to delete. Now I want to show how to improve on delta debugging itself by better choices of deletion size.

Delta debugging can be thought of as the following code:

def ddmin(test_case, predicate): block_size = len(test_case) // 2 while block_size > 0: prev = test_case i = 0 while i < len(test_case): attempt = list(test_case) del attempt[i:i+block_size] if predicate(attempt): test_case = attempt else: i += block_size if test_case == prev: block_size //= 2 return test_case

(this is not how delta debugging is described in the original paper, which is in terms of partition size, but it is compatible with their description and is more straightforward to adapt to the linear time version of greedy search)

This is the same as the greedy search I outlined previously, but instead of always deleting ranges of size one we try to delete long ranges. When this doesn’t work, we decrease the size of the range and try again.

For many cases this produces a dramatic speedup. e.g. it can reduce a long list to a single element list in log(n) steps.

Unfortunately for many other cases this ends up being substantially slower than the naive greedy algorithm. In simulation I’ve found that if the final test case is about 10-15% of the size of the original test case then delta debugging starts to be more expensive than just running the greedy search. For cases where the test case is already minimal, delta debugging pays an additional cost between 50% and 100% on top of the greedy algorithm.

The problem is basically that delta debugging is not making good choices of block size for these cases, and doesn’t have a better way to choose a block size than just try it and see what happens. In order to improve on this behaviour we need to get a block size that is more appropriate for the particular target we’re shrinking.

There turns out to be an easy way to do this. Instead of trying to choose the block size globally, we choose it locally, finding a block size for the current index that produces a large delete. The key tool for doing this will be the binary probe and search algorithm I outlined before.

At each index in our greedy search instead of just trying to delete one element, we search for a long block that we can delete starting from there and delete that:

def adaptive_ddmin(test_case, criterion): prev = None while prev != test_case: prev = test_case i = 0 while i < len(test_case): def is_block_deletable(k): return criterion(test_case[:i] + test_case[i+k:]) block_size = find_change_point(is_block_deletable, 0, len(test_case) - i) if block_size > 0: test_case = test_case[:i] + test_case[i + block_size:] i += 1 return test_case

This has all the advantages of ddmin, but is able to automatically figure out the appropriate block size. In almost all circumstances it will perform far fewer shrinks than ddmin. The counterexamples are basically when there is a great deal that can be deleted, where in some cases it will perform an additional log(n) deletes to discover that fact.

It also performs much more closely to greedy search in the cases where greedy search beats delta debugging. It doesn’t always achieve that close to greedy search – each block deletion deletes block_size elements and checks that you can’t delete an additional element. This takes O(log(block_size)) calls and would take block_size + 1 calls in the greedy algorithm, so as block_size gets large this algorithm starts to win big. For small block sizes though it’s not a win, and you can see it making up to 50% more calls (as bad as classic delta debugging! Although in very different scenarios) in some cass.

This can be improved by special casing the algorithm to do a linear probe for very small values. We replace it with:

def find_change_point(f, lo, hi): base = f(lo) linear_cap = min(hi + 1, lo + 6) for i in range(lo + 1, linear_cap): if f(i) != base: return i - 1 return binary_search(f, *exponential_probe(f, linear_cap - 1, hi))

This special case removes some of the worst cases, and leaves us with the the following:

block_size ratio
1 1.000000
2 1.000000
3 1.000000
4 1.000000
5 1.000000
6 1.142857
7 1.000000
8 1.111111
9 1.000000
10 0.909091
11 0.833333

Above this the ratio consistently remains better than 1. So the worst case scenario is that this algorithm ends up making < 15% more calls than the greedy algorithm.

It also consistently beats ddmin in simulation. I tested them both on a model of reducing a sequence to a unique subset (the predicate was that the example was interesting if and only if it contained a particular set of indices), with subsets chosen at various fixed expected fractions of the sequence. It was essentially always better, over a range from 1% to 99% of the set. It fairly consistently makes about 75% as many calls to get the same result, but I don’t have a theoretical argument that that should be the case.

This may not seem like a big deal, but this sort of technique is often applied in cases where individual calls can take multiple seconds (e.g. because you need to compile and run some code), so anything that gives you a speedup like this is greatly appreciated.

It also appears to be a major improvement in practice as well as in simulation. I’m not currently using this technique in hypothesis (though I do plan to), but I have a branch of structureshrink using this and it’s proving to be a major win.

This entry was posted in Code on by .

# Mergeable compressed lists of integers

Alexander Shorin’s work on more configurable unicode generation in Hypothesis has to do some interesting slicing of ranges of unicode categories. Doing both generation and shrinking in particular either required two distinct representations of the data or something clever. Fortunately I’d previously figured out the details of the sort of data structure that would let you do the clever thing a while ago and it was just a matter of putting the pieces together.

The result is an interesting purely functional data structure based on Okasaki and Gill’s “Fast Mergeable Integer Maps”. I’m not totally sure we’ll end up using it, but the data structure is still interesting in its own right.

The original data structure, which is the basis of Data.IntMap in Haskell, is essentially a patricia trie treating fixed size machine words as strings of 0s and 1s (effectively a crit-bit trie). It’s used for implementing immutable mappings of integers with fast operations on them (O(log(n)) insert, good expected complexity on union).

With some small twists on the data structure you can do some interesting things with it.

1. Ditch the values (i.e. we’re just representing sets)
2. Instead of tips being a single key, tips are a range of keys start <= x < end.
3. Split nodes are annotated with their size and the smallest interval [start, end) containing them.

When using this to represent sets of unicode letters this is extremely helpful – most of the time what we’re doing is we’re just removing one or two categories, or restricting the range, which results in a relatively small number of intervals covering a very large number of codepoints.

Let T be the number of intervals and W the word size. The data structure has the following nice properties:

1. Getting the size of a set is O(1) (because everything is size annotated or can have its size calculated with a single arithmetic operation)
2. Indexing to an element in sorted order is O(log(T)) because you can use the size annotation of nodes to index directly into it – when indexing a split node, check the size of the left and right subtrees and choose which one to recurse to.
3. The tree can be automatically collapse tointervals in many cases, because a split node is equivalent to an interval if end = start + size, which is a cheap O(1) check
4. Boolean operations are generally O(min(W, T)), like with the standard IntSet (except with intervals instead of values)
5. Range restriction is O(log(T)).

Note that it isn’t necessarily the case that a tree with intervals [x, y) and [y, z) in it will compress this into the interval [x, z) because their common parent might be further up the tree.

An extension I have considered but not implemented is that you could potentially store very small subtrees as arrays in order to flatten it out and reduce indirection.

In particular the efficient indexing is very useful for both simplification and generation, and the fact that merging efficiently is possible means that we can keep two representations around: One for each permitted category (which helps give a better distribution when generating) and one for the full range (which makes it much easier to simplify appropriately).

Here is an implementation in Python. It’s not as fast as I’d like, but it’s not unreasonably slow. A C implementation would probably be a nice thing to have and is not too difficult to do (no, really. I’ve actually got a C implementation of something similar lying around), but wouldn’t actually be useful for the use case of inclusion in Hypothesis because I don’t want to add a C dependency to Hypothesis just for this.

This entry was posted in Code, programming, Python, Uncategorized on by .

# Using tmux to test your console applications

Edit to add: The ideas in this post are now implemented in a library. Feel free to read this post for understanding the concepts, but you’re probably better off just using that.

If you’ve noticed that there haven’t been 5000 words of blogging and seven Hypothesis releases in the last week and wondered whether I was dead, fear not. I was just on a fort.

The theme of this fort was that we were going to work on a C project. People made the mistake of listening to me, and one thing lead to another, and the result is Conch, a security free curses interface to our terrible Twitter clone, Bugle. Here’s a sneak peek:

This post is mostly not about Conch, except that working on it was what lead to this concept.

We used check for testing a lot of the implementation, which was fine if a little laborious. However testing the front end (oh god, why do we have front end code written in C?) proved challenging. ncurses was not our friend here.

I tried using pexpect for this, which is like expect but written in python instead of TCL, but ran into a bunch of problems. It has an ANSI virtual terminal, but for whatever reason (this might have been my fault) it got very confused and ended up with lots of problems with partial drawing of things and leaving the screen in a wrong state.

So I put my thinking cap on, read some man pages, and applied some twisted imagination to the problem and came up with a solution that works great, albeit at some small cost to my dignity and sense of taste.

The solution is this: What I need is a robust virtual terminal I can control and inspect the state of.

Fortunately I have one. It’s called tmux.

Tmux is pretty great. It has a whole bunch of functionality, is a rock solid virtual terminal that is widely used with a wide variety of programs, and you can control it all externally via the command line. Putting these all together lead to a bunch of primitive operations out of which I could basically build selenium style testing for the console.

I’m still figuring out the details. When I do I’ll probably turn this into a library for testing console applications rather than the current ad-hoc thing I have, but basically there’s a relatively small set of operations you can build this testing out of:

1. First we allocate a unique ID for our test. This should be long and random to avoid conflicting with existing sessions. Henceforth it will be called $ID. 2. “tmux -L$ID  new-session -d <your shell command>” will start your program running in a fresh tmux session under your id. The -d is necessary because you will not be starting this from a controlling terminal in your program, so you want it to start detached. If you want you can specify width and height with -x and -y respectively.
3. At the end, “tmux -L $ID kill-server” will shut down all sessions in your test tmux server, including the child processes. 4. In order to capture the current contents of your tmux session you can run: “tmux -L$ID capture-pane; tmux -L $ID show-buffer; tmux -L$ID delete-buffer”. This will save a “screenshot” of the currently active pane (of which there is only one if you’ve just used these commands) to a new paste buffer, print the paste buffer to stdout, then delete the buffer.
5. In order to send key presses to the running program you can use “tmux -L $ID send-key <some-char>”. These can either be ascii characters or a variety of control ones. e.g. PageUp, PageDown and Enter do what you expect. Adding C- as a modifier will hold down control, so e.g. C-c and C-d would be Control-c and Control-d with their usual interpretations (send an interrupt to the running program, send EOF). 6. In order to send non-ascii or larger text you can use do “tmux -L$ID set-buffer <my text>; tmux -L $ID paste-buffer; tmux -L$ID delete-buffer”, which will set a paste buffer, paste it to the active pane, and then delete the buffer.

(Some of the above is not actually what I did, because I figured out some better ways using commands I’d previously missed while writing this post).

The main things that are hard to do with this, and why for now this is a blog post rather than a piece of open source software, is getting the PID and exit code out for the program you’ve started and resizing the window. I know how to do both of those (running a manager process and starting inside a pane respectively), but it’s fiddly and I haven’t got the details right. When I do, expect all of the above to be baked into a library.

This entry was posted in Code, Python on by .

# Now you too can have a magic bug-finding machine

…at least, you can if you write python (though I’ve already received some interest in porting this to other languages from people).

I’ve spent today hacking on a pure python testmachine library. It works rather nicely: It’s significantly less magical than hypothesis, gives you a lot more power and produces very clean output.

Here’s some example output, which finds an unbalanced tree:

 t1 = Leaf() t2 = Leaf() t3 = Leaf() t4 = t3.join(t2) t5 = t4.join(t4) t6 = t5.join(t1) assert t6.balanced()

I’ve uploaded an initial version to pypi, but this should definitely be considered a “for you to have a play with” version rather than something that you should get annoyed with me if I break the API. I’m very much still experimenting with how this works, but initial impression is that the answer is “rather well”, and I’m not too unhappy with the core API.

This entry was posted in Code on by .

# Improving code coverage with testmachine

I feel a little bad continuing to write about testmachine. At this point I’m basically just going “I have a magic bug finding machine and you don’t! Nyah nyah”. I’m trying to come up with a way to library-ise it, and I have some ideas, but they’re still gelling.

So, for now, nyah  nyah. Here’s what my magic bug finding machine can do.

Yesterday I added another 50+ test cases to intmap’s deterministic test suite. Most of them were entirely automatically generated.

How’d I generate them? Well, I decided that I should aim for a much higher degree of code coverage than I had previously, so I started running dettest (the deterministic test suite) under gcov to find what my code coverage was like. Then when I found lines that weren’t executed by dettest I started adding assertions to them (actually I added an UNREACHABLE macro which is only an assertion if you compile with a particular preprocessor directive). Then I ran testmachine to find the assertions and added its output to the test case.

This was especially helped by the fact that last night I wrote a compiler of sorts which took the stack programs that testmachine runs and compiles them to SSA C code which doesn’t reference the test machine library. It’s the world’s dodgiest compiler, and I found and fixed a bunch of issues in it today, but it’s really useful. For example here’s a test case it generated. The numbers and strings it uses are a bit weird, but it otherwise looks like a very nice hand written test case.

These tests aren’t all amazingly useful. A lot of them just create an intmap through a bunch of operations and then free it without asserting any properties about it. So even if I get to 100% coverage this way it’s very likely that there are bugs that it won’t catch. But even this is still useful – intmap has some internal integrity checks it can run, and valgrind can do invalid access and leak checks on those tests.

Also, knowing that the code can be executed turns out to be useful. I’ve found at least one interesting class of bug in the course of this which meant that the query optimiser was basically never firing because I was forcing everything with an intmap_is_empty check before doing algebraic operations. This meant that a whole bunch of code could never be triggered, which was showing up as lines the tests were never hitting.

There are still a few (7 and counting) UNREACHED annotations in intmap.c that I haven’t been able to figure out how to trigger. A few of them I’m pretty sure are legitimately impossible to trigger and I’m OK with that for now (e.g. I’ve got find_single_in which is a completely internal function that gets used mostly for intersection. It has a test for empty but is currently never passed an empty value). One or two I think it’s just that testmachine generates them with too low probability. Some of them look suspicious and I’ve not yet got to the bottom of them.

It’s hard to tell what sort of code coverage I’ve got now because I’ve not been able to figure out how to get unified metrics across multiple different programs, and I’ve got several different ways of invoking the test suite (I have some compiler flags designed to make bugs easier to find, a slightly more vanilla build, and an optimised build with debug turned off). Also, I’ve been able to demonstrate in a few cases that gcov is lying about a line being unreached (it claims it’s never executed but if I add an UNREACHED annotation there the tests catch it). But on any given run I’ve got gcov reporting about 95% line coverage just on the deterministic test suite. I started at around 70%. In pure numbers, this isn’t a huge improvement, but I find that getting the first 80% of anything like coverage done is comparatively easy, and it tends to be the last steps that are like pulling teeth, so for about a day’s work that’s a pretty satisfying amount.

This entry was posted in Code on by .