Category Archives: Uncategorized

Conferences Should Publish Menus

When organising conferences and other events, you often end up feeding people. This is a surprisingly fraught endeavour, but one that can be made less so with this one entirely sensible trick: Publish a menu in advance, ideally with an ingredients list.

Doing so gives people a much better understanding of whether they’re going to be able to eat anything at your event1, or whether they need to bring their own food.

People are complicated to feed. Many of us have dietary requirements of varying degrees of specificity, and many of those requirements are borderline incompatible. As a result, trying to cater for everyone’s preferences in a large group of people is legitimately hard.

People with complicated dietary requirements are used to this, but the ambiguity is a source of considerable annoyance – it’s fine bringing your own food, but it would be nice to know if you have to bring your own food. Having this information in advance is very life improving, and is significantly lower effort for venues to provide than trying to cater for everyone would be.

In addition, there is a significant curb cut effect – even people who are well catered for will find this information life improving, because they will spend less time hunting for the one or two things they can eat, and will have a better idea of what to expect in advance – this will significantly cut down on wait times and generally improve the dining experience.

You should do this even if you think you are doing a very good job of catering for a wide variety of dietary requirements, because the reality of the range of possible dietary requirements is inevitably much larger than you think it is. e.g. I can’t eat raw onion, lamb, or broccoli, and I bet those are not things you have taken into account in your planning. By publishing information about what food you are providing, you allow people who are experts in their own dietary requirements to make that judgement.

This does not, of course, mean that once you have published these menus your obligations are done. Most people are not that hard to cater for, and yet are still reliably badly served. There is definitely low hanging fruit that it is worth any event of a sufficient size (say 20+ people) catering for by default – e.g. providing a vegan and gluten free option (a more specific example of low hanging fruit is to serve actual fruit in breaks as well as just pastries) is not that hard and will help a lot of people – but even if you decide not to do that you can still publish the menu and at least make that visible to the people it affects.

This entry was posted in Uncategorized on by .

Notes on Test-Case Reduction

Attention Conservation Notice: Niche, somewhat poorly organised.

Epistemic Status: Reasonably confident that following this advice is better than trying to figure things out on your own, as it’s the distillation of about four years of work in the subject, but much of it is unvalidated and individual bits may be suboptimal.

There’s a lot of knowledge in my head about test-case reduction that doesn’t as far as I know exist anywhere else, as well as plenty that exists in maybe two or three other people’s heads but isn’t written down anywhere. I thought I would write some of it down.

What is test-case reduction?

Test-case reduction is the process of taking some value (a test case, usually represented as a sequence of bytes, though in property-based testing land it can be an arbitrary value) that demonstrates some property of interest (usually that it triggers a bug in some program) and turning it into a smaller and simpler test case with the same property.

This is boring to do by hand, so we generally want to have tools that do this for us – test-case reducers. When I say test-case reduction in this post I will generally mean this form of automated test-case reduction done by such a reducer.

Test-case reduction is sometimes called test-case minimization, or shrinking. I prefer the test-case reduction terminology and will use that in this post, but they’re all the same thing.

Generally speaking test-case reduction is black box. You have some oracle function that e.g. runs the software and sees if it crashes and returns yes or no. You may understand the format of the test cases (e.g. C-Reduce and Conjecture’s shrinker are the two most sophisticated test-case reducers I know of and they both make extensive use of the fact that they can understand the format), and you may have a little visibility of what the oracle is doing (Conjecture tracks the API the oracle uses to access the data and uses that to guide the reduction), but you definitely don’t have anything like a detailed understanding of the grammar that leads to the property of interest.

Why do we want to do test-case reduction? Roughly three reasons:

  1. Smaller examples are nicer for everyone concerned, computer and human alike. They run faster, are easier to read, and if you need to report them as upstream bugs probably don’t have any incriminating data about your internal software and IP.
  2. Test-case reduction helps to normalize bugs – if you have a thousand bug reports then you don’t really want to triage that. Many of those bugs will become the same after reduction, reducing your workload
  3. The knowledge that a test-case is reduced tells you what is essential about it. Although it may not be the smallest value that demonstrates the bug, every feature of it is in some way essential – you can’t just delete a bit and expect it to work. This significantly aids the developer in localizing the fault.

A fourth reason that I think is underexplored is that test-case reduction is really good at finding bugs. I’d like to do some more work on this, and have found at least one probably-security bug in a widely deployed piece of software through this (I reported it upstream years ago and it’s fixed in the latest version, but I won’t say what). I’ll ignore this use case for now, as it’s a bit non-central – generally reducers work well for this when they’re failing to achieve the actual desired goal of test-case reduction.

Importantly, none of these reasons require the reducer to do a perfect job – for the third one it has to do a comprehensible job (i.e. you only know what details are essential if you know what details the reducer can remove), but beyond that every little bit helps – a reducer that is not very good is still better than nothing, and incremental improvements to the reducer will result in incrementally better results.

This is good, because doing test-case reduction perfectly is basically impossible. Because of the black-box nature of the problem, almost all test-case reduction problems can only be solved perfectly with an exponential search.

Instead essentially every test-case reducer is based on a form of local search – rather than trying to construct a minimal example from scratch, we make simplifying transformations to the known initial example that try to make it smaller and simpler. We stop either when the user gets bored or when no further transformations work.

This approach dates back to Delta Debugging from Hildebrant and Zeller’s classic paper on delta debugging, Simplifying and Isolating Failure-Inducing Input. Delta-debugging takes test cases that are a sequence of some form and tries to find a test case that is one-minimal – i.e. such that removing any single element of the sequence loses you the property of interest. It has a bunch of optimisations on top of the naive greedy algorithm, but I think these are mostly the wrong optimisations. The sequence reducer we use in Hypothesis that I’ll talk a little bit about later shares almost nothing in common with delta debugging other than the basic greedy algorithm.

Due to this lineage, Regehr et al. in the C-Reduce paper “Test-Case Reduction for C Compiler Bugs” refer to these local search based approaches as generalised delta debugging. I’m just going to refer to them as local search.

All of the best reducers are highly specialised to a particular category of test case, with a lot of elaborate local search procedures designed to work around particular difficulties. Writing such a reducer is inevitably a lot of work, but even simple reducers can be worthwhile. In this post I hope to give you some general lessons that will help you either work on an existing reducer or write your own simple ones (I don’t especially encourage you to write your own complicated one, though I’d love to hear about it if you do).

Good Oracles are Hard to Find

The oracle for a test-case reduction problem is the predicate that takes a test case and determines if it is interesting or not. Writing an effective oracle is surprisingly tricky due to two basic problems: Test-case validity and slippage.

The test-case validity problem is when you have some oracle whose domain doesn’t contain all of the values that the reducer might want to try, and its behaviour is incorrect outside of that domain. For example, suppose you had a test that compiles a program with multiple different optimisation levels and asserts the outputs of running the compiled programs are all the same. This is a perfectly good oracle function, but a C program that it accepts only corresponds to a compiler bug if it does not execute any undefined behaviour. Thus when doing reduction you might start with a legimate bug but turn it into a C program that executes undefined behaviour and thus still satisfies the oracle but is no longer a bug. You could mitigate this by using a semantics checking interpreter that first checks if the program executes any undefined behaviour, but that’s expensive.

The problem of slippage is when your oracle function is too broad and thus captures bugs that are not the ones you intended. Suppose your oracle just checked if the program crashes. In the course of reduction you might find an entirely different (possibly easier to trigger) crash and move to that one instead.

There are a number of mitigation strategies I use for this in Hypothesis. The main ones are the following:

  1. The oracle functions we use are always “An exception of this type is thrown from this line”. This is reasonably robust to slippage.
  2. Because of how we apply test-case reduction, any reduced example is one that could have been generated, so we can only reduce to invalid examples if we could have generated invalid examples in the first place. This makes it much easier for users to fix their tests by adding preconditions when this happens.
  3. We record all successful reductions and when we rerun the test we replay those reductions first. This means that if we slipped to a different bug or an invalid test at some point and that has been fixed (either by fixing the bug or the test), we don’t lose our previous work – we just resume from the point at which the slippage occurred.

Because of this I’m going to ignore these problems for the rest of this post and just assume that the oracle function is fixed and known good.

Reduction Orders

An approach that is slightly idiosyncratic to me is that I like to make sure I have a reduction order – some total order over the test cases such that given any two distinct test cases we consider the smaller value in this order to be better than the other.

It’s more normal for this to be a preorder, where two distinct values can be either unrelated or equally good. The most common choice for this preorder is just that you measure the size of the test case in some way and always prefer smaller test cases.

Some reducers (e.g. QuickCheck) do not have any defined order at all. I think this makes it hard to reason about the reducer and easy to introduce bugs when writing it, so I dislike this.

The reason I prefer to have a total order is that it makes it much easier to determine when a test-case reducer is making progress, and it is helpful for the purposes of test-case normalization because it means that any oracle has a canonical best solution (the minimal test case satisfying the oracle) so perfect normalization is possible at least in principle even if we can’t necessarily find it.

The order I almost always use for test cases represented as sequences of bytes is the shortlex order. There are specific reasons that this is a good choice for Hypothesis, but I think it works pretty reasonably in general – certainly you want to always prefer shorter test cases, and among test cases of the same size having a reasonably local and easy to reason about order matters more than what that order actually is.

As Alex Groce points out in the comments, TSTL also effectively uses the shortlex order for its normalization, and the proof of termination in “One Test to Rule Them All” relies on this. In the papers terminology the two concepts are split, with reduction referring to reducing the length and normalization referring to canonicalization operations that decrease the test-case lexicographically while keeping the length fixed. I tend not to separate the two and refer to the whole process as test-case reduction, as they tend to be very intertwined (especially with Hypothesis’s autopruning, which I will explain below, which means that even operations that look like “pure normalization” can have dramatic effects on the size of the test case).

Caching and Canonicalization

Oracle functions are often expensive, and natural ways of writing test-case reducers will often try invoking it with the same value over and over again. It can be very useful to cache the oracle function to reduce the cost of re-execution. Almost all test-case reducers do this except for ones based on QuickCheck’s approach which sadly can’t due to limitations of the API and its generality (essentially the problem is that you can generate and reduce values which you can’t compare for equality).

As well as all the usual difficulties around caching and invalidation (Hypothesis currently ignores these by just not invalidating the cache during reduction, but I need to fix this), there are a number of things that are peculiar to test-case reduction that are worth noting here.

The first is that if you can assume that your reducer is greedy in that it only ever considers values which are smaller than the best test case you’ve seen so far, you don’t actually need to cache things, you just need to record whether you’ve seen them at all – you can assume that if you’ve seen a value before and it’s not your current best value then it must not be interesting. This allows you to use a more efficient data structure like a bloom filter. This can be very helpful if your test cases are large or you try a lot of them during reduction. theft is the only implementation I’m aware of that does this. You could adapt this to non-greedy reducers by having two levels of caching: One that contains all known good examples, and a bloom filter for all known bad examples. I’ve never seen this done but it would probably work well.

The second thing you can do is canonicalization. You may have access to e.g. a formatter that will transform your test case into a canonical form (of course, the formatting might itself break the property. This implicitly turns your oracle into the oracle “the formatted version of this satsisfies the original oracle”, but that’s probably fine). You can now do the calculation in two stages: First look up your test case in the cache. If you get a cache miss, canonicalize it and look that up in the cache. If that’s still a cache miss, run the oracle. Either way, update the cache entry for both the original and canonicalized form. I’m not sure if anyone does this – I’ve done it in some experimental reducers, and Hypothesis does something like it but a bit more complicated (canonicalization happens as a side effect of test execution, so it can’t run it without executing the test, but does update the cache with the canonical version as well as the non-canonical one) – but it works pretty well.

Keeping Track of the Best Example

The way I design reducers is that there is a single object that manages the state of reduction. It exposes a function, say test_oracle, which handles the caching as above.

What it also does is that whenever it runs the actual oracle and the result is interesting, if the test case in question is better than the best it’s seen so far, it updates its best known example. This is part of why having a defined reduction order is helpful.

The reason this is useful is because it is often helpful to make decisions based on the result of the oracle function (e.g. for adaptive searches). It can be an annoying amount of book-keeping to make sure you keep everything updated correctly, and this nicely automates that.

This is also useful because (especially if you save it externally somewhere – Hypothesis has an example database, file-based reducers might have a copy on disk or something) it means that your reducer is automatically an anytime algorithm – that is, you can interrupt it at any point and get its current best answer rather than losing all your work.

Cost of Reduction

The cost measure I tend to design around is the number of oracle invocations. This is an extremely crude measure and I ignore it where useful to do so.

The only actual cost that matters is of course wall clock runtime. The problem with this is that it tends to be almost entirely dependent on the vagaries of the oracle function – it could have essentially arbitrary complexity (exactly the same reducer will have very different timing profiles if the oracle is quadratic vs linear), or you could be regularly hitting a timeout bug that you aren’t interested in, or just about anything else. It’s a black box, it can do whatever it wants.

The most common thing that you see in practice is that the oracle gets much faster when the reducer has been running for a while, because it tends to be very fast to run small examples and very slow to run large examples, even if smaller changes don’t necessarily have as predictable effects. As a result, I tend to design reducers to actually prioritise progress over performance. e.g. the sequence reducer that we use in Hypothesis can technically make up to twice as many oracle calls as an alternative design in some circumstances, but has a huge benefit of avoiding stalls – long periods where many oracle invocations are made and none of them return true so the reducer fails to make any progress.

An additional benefit of making progress is that it tends to make future attempts to reduce cheaper – e.g. if you’ve successfully deleted some data, the test case is now smaller, so there are fewer transformations you need to make.

A minor side benefit of prioritising progress is that it makes the reducer’s progress much more fun to watch.

Reduction Passes

Most non-trivial reducers are organised into reduction passes. A reduction pass is basically just a specialized reducer that does one sort of operation – e.g. you might have a reduction pass that deletes values, and a separate reduction pass that moves them around.

There’s nothing logically special about reduction passes – they’re just functions – but they do seem to be a common and useful way of organising test-case reducers.

The big question when you organise your reducer this way then becomes: What order do you run the reduction passes in?

Generally the goal is that by the end of the reduction process you want your best test case to be a fixed point of every reduction pass – none of them can make progress – but that gives you a lot of freedom about which to run when, and that seems to make a huge difference for a couple of reasons:

  1. Some reduction passes are much more expensive than others (e.g. constant time vs linear vs quadratic), so it makes sense to run them once the example is already small.
  2. Often one pass will unlock another, allowing a previously stuck pass to make progress.
  3. Some passes will substantially increase the cache hit rate on later runs, which can have a large performance impact.

Current pass ordering schemes are a bit ad hoc and not very well justified. I’m not sure anyone really knows how to do this well, including me. This is a good area for research (which I’m doing some of).

That being said I’m reasonably sure the following things are true:

  1. You should run at least some other passes before you rerun a pass – a pass that has just run is much less likely to make good progress the second time you run it.
  2. It’s worth having a pass that is primarily designed to promote cache hits run fairly early on (e.g. Hypothesis has an alphabet_minimize reduction pass, which tries to bulk replace all bytes of the same value with smaller ones, for this purpose).
  3. It can be worth deprioritising passes that don’t seem to do anything useful.
  4. You should try to put off running quadratic or worse passes for as long as you possibly can (Ideally don’t have them at all or, if you do, do whatever you can to get the constant factors way down). They are often best treated as “emergency passes” that only run when all of the fast passes get stuck.
  5. In general it is worth running passes which will reduce the size of the test case before passes that won’t, and among those passes you should generally prefer to run cheaper ones earlier.

Reduction Pass Design

The way I write them, a reduction pass is basically just a function that takes a starting test case and calls the oracle function with a number of other test cases (reductions). After that the best one will have been updated, or it won’t and thus the pass is currently at a fixed point.

The questions worth asking about a reduction pass are then basically:

  1. What reductions does it try when…
    • The pass fails to reduce the current best test case?
    • The pass successfully reduces the current best test case?
  2. What order does it try them in?

The answer to these questions matters a lot – the reductions you try have a big impact on the cost of reduction and the quality of the final result, and the order can have a big impact on progress (and thus implicitly on the cost of reduction. Where it affects the quality of the result it’s usually by accident).

The question of what reductions to try depends a lot on what you’re trying to do, but generally speaking I find it’s best to make reduction passes that try at most an \(O(1)\) number of big changes (e.g. replacing a number with \(0\)) and at most \(O(n)\) small changes (e.g. trying subtracting one from every number, deleting every element).

The reason the questions of which operations to run differs based on whether the pass successfully reduces if you’ve successfully reduced then you know the pass will be run again before the reducer stops, so you can rely on that next run for ensuring any properties you wanted the pass to ensure. Thus you have a fair bit of freedom to do whatever you want, and can skip things that seem unlikely to work. The book-keeping doesn’t have to be too precise.

One useful rule of thumb is that when you have successfully reduced something you should try other similar reductions before moving on to the next thing. In particular it is useful to run adaptive algorithms – ones which try to turn small reductions into much larger ones. e.g. when you successfully delete something, try deleting progressively larger regions around it using an exponential probe and binary search. This can often replace \(O(n)\) oracle invocations with only \(O(\log(n))\).

One big advantage of this approach of avoiding big operations but using adaptive versions of small operations is that you get most of the benefits of large operations but don’t have to pay their cost. In the case where no or few reductions are possible, delta debugging will make up to twice as many calls as a more naive greedy algorithm, while the adaptive algorithm will make exactly the same number.

It should be noted that the opposite is also true when delta debugging makes progress – the adaptive algorithm may make twice as many calls for some easy to reduce examples. The difference is that both it and delta debugging are only making \(O(\log(n)\) calls in this case, so the slow down is less noticeable. It’s also not hard to create examples that are pathological for either algorithm, but the common case is the more interesting one, and in general it seems to be common that delta debugging quickly slows down compared to the adaptive or greedy algorithms.

The question of what order to run operations in is tricky and probably deserves its own entire article, but a short version:

  • It’s often a good idea to run the reductions in a random order. The reason for this is that it gets a good trade off between progress and cost – larger reductions are great for reducing cost but they often don’t work. Smaller reductions are more likely to work but if you only make small reductions you’ll take forever to progress.
  • When you do make progress you should try to avoid doing things that look too much like things you’ve already tried this pass. e.g. a common mistake is that when trying to delete elements from a sequence you start again at the beginning when you’ve deleted one. This is a great way to turn a linear time pass into a quadratic time one.
  • In general it is better to run dissimilar operations next to eachother. For example in a quadratic pass that tries each pair of indices it is better to run in order (0, 1), (1, 2), …, (0, 2), (1, 3), … than it is to run as (0, 1), (0, 2), …, (1, 2), (1, 3), …. The latter order is much more likely to make progress early on because it spreads out each index across the whole iteration so you don’t get long stalls where one index just doesn’t work at all.

A nicely illustrative (and well commented) example of all of these principles is Hypothesis’s sequence deletion algorithm.

Salient features are:

  • Indices are deleted in a random order.
  • When an index is deleted we try to binary search to expand this into an entire surrounding interval that can be deleted.
  • We never try to delete the same index more than once in a given run, even when we’ve made changes that would in principle allow it to be possible.

Another common pattern that is useful when doing this sort of thing is to loop over a varying list with an index into it. For example a simpler deletion algorithm would try deleting the element at index i and if that fails, increment i. In this loop either the index can increase or the length can decrease. If we never succeed, we’ve attempted to delete every element of the list. If we do succeed, we didn’t retry the elements earlier in the list but that’s OK because the pass will run again.

Stopping Conditions

Because a reducer’s task is essentially unbounded (strictly it is finite, but in practice the number of shorter smaller strings than a typical test case exceeds the number of atoms in the universe), one important design consideration is when to stop reducing – you can’t hope to stop reducing when you’ve got the optimal answer, because even when you have it you can’t verify that it’s optimal.

The natural stopping point for local search based approaches is that you stop when all of the reduction passes fail to make progress and you’re at a local minimum.

Other common stopping criteria are:

  • When the user tells you to stop (which is part of why the anytime nature of reduction is useful).
  • After you have performed N oracle invocations.
  • After some time limit has passed.
  • After you have successfully reduced the test case N times.

The latter condition is the one Hypothesis uses – after it has successfully reduced the test case 500 times it stops reduction. I think this approach originally comes from QuickCheck. It’s a slightly non-obvious condition but generally works pretty well. The reasoning is basically that if you’re making a lot of successful shrinks then the shrinks you’re making must be relatively small, which means you’re probably stuck in a zig zagging pattern (I’ll expand on these below) and could be here a while if you actually waited for the reducer to finish. Because in Hypothesis you will often have a user sitting there waiting on the end result, it’s better to terminate early.

It’s generally rare for Hypothesis to hit this limit in practice these days – the current pass selection is very good at adapting to the shape of the test case so tends to turn small shrinks into large shrinks unless it’s working on a very large test case. Generally when we find cases where it hits the limit we’ll fix the shrinker by adding or improving a reduction pass.

Having these early stopping conditions slightly ruins the utility of reduction for highlighting necessary features of the reduced test cases, because you no longer have the guarantee that the final test case is fully reduced, but is something of a necessary evil – it’s much better to return a suboptimal but still good result in a reasonable amount of time than it is to wait on getting a perfect result.

Better Reduction through (Bad) Parsing

A big obstacle to test-case reduction is that often the transformations you want to make don’t work not because of something related to the property of interest but because there’s some more basic validity constraint on the test cases you need. For example, suppose we were trying to delete a byte at a time using delta debugging or a sequence reducer like Hypothesis’s. This would work fine for some formats, but what if we were trying to reduce C or some other language where braces and brackets have to balance? We would never be able to remove those braces because you have to remove the balanced pair simultaneously.

Misherghi and Su pointed out in “HDD: Hierarchical Delta Debugging” that you can solve this problem if you have a formal grammar for the language of valid test cases, by using the grammar to suggest reductions. This observation is very sensible and reasonable and went on to have approximately zero users ever because it’s really annoying to get a good formal grammar for test cases and nobody ever bothered to do the work to make it usable. This may be changing as there is now picireny, which I haven’t tried but looks pretty good.

However there’s a much simpler variant of this, which starts from the observation that this works even if you don’t have a fully blown grammar for the format but just have some bad guesses about what it might look like.

For example, a classic observation is that delta debugging makes much more sense if you run it line oriented rather than byte oriented. You “parse” the file into a series of lines and then run the reducer on that sequence. You can do this just as well with any other separator you liked – spaces, arbitrary bytes, repeated tokens, etc. – and it works pretty well as a set of reduction passes.

You can also e.g. guess that it might be a brace oriented language and try balancing braces and see if it works. If it does, try doing what you would have done with a proper grammar (e.g. remove the contents of the braces, remove the entire region, just remove the braces and leave their contents intact).

I’ve yet to turn many of these observations into anything I’d consider a production ready test-case reducer, but I’ve done enough experimentation with them to know that they work pretty well. They’re not perfect – some formats are hard to reduce without knowing a fair bit about them – but they’re surprisingly good, particularly if you take a lot of different bad parses and use each of them as a separate pass.

More generally, a philosophy of taking a guess and see if it works seems to play out pretty well in test-case reduction, because the worst case scenario is that it doesn’t work and you wasted a bit of time, and often it’s possible to make quite good guesses.

Another advantage of the “bad parsing” approach is that you can often make changes that just happen to work but don’t necessarily correspond to grammatically sensible transformations.

This works particularly well if you have multiple different ways of parsing the data – e.g. Hypothesis has some reduction passes that work using the underlying token structure of its representation, and some that work on the higher level hierarchical structure. Many of the lexical reductions are boundary violating (e.g. Hypothesis can reduce the nested list \([[0, 1], [2, 3]]\) to \([[0, 1, 2, 3]]\) by deleting data in a region that cuts across a hierarchical boundary. Similarly you can imagine doing this on the text form by deleting the two innermost brackets which don’t correspond to a balanced pair). Switching between different viewpoints on how to parse the data gives you the ability to define fairly orthogonal reduction passes, each of which can perform transformations that the other would miss.

Misc Details

Here are some additional useful details that didn’t fit anywhere else.


A thing that has proven phenomenally useful in Hypothesis is autopruning. By the nature of the problem in Hypothesis, you can determine how much of the test case is used. This lets you replace any successful test case with a potentially much shorter prefix by pruning out the unused bits. This happens automatically whenever you call the test function.

I don’t know how broadly applicable this is. Certainly it’s easy to implement without that tracking – every time you successfully reduce the test case, try replacing it with a prefix – first delete the last byte and if that works try binary searching down to some shorter prefix. I suspect this is more useful in Hypothesis’s application than anywhere else though.

Pessimistic Concurrency

An approach I haven’t personally used yet but is found in Google Project Zero’s halfempty is pessimistic concurrency. C-Reduce uses a similar strategy.

Essentially the idea is to adopt a parallelisation strategy that assumes that the result of the oracle is almost always false. When you invoke the test oracle and miss the cache, rather than running the oracle then and there you just return false (i.e. pessimistically assume that the reduction didn’t work) and carry on. You then schedule the real oracle to run in parallel and if it happens to return true you either rerun the pass or restart the reduction process from that successful reduction.

I suspect this approach interacts badly with some of my opinions on reducer design – in particular what I actually observe when watching reduction run is not that success is unlikely but that there are long periods of mostly failure and modest periods of mostly success. I imagine one can adapt this to choose between sequential and parallel modes appropriately though.

I need to look into this area more. I haven’t so far because Python is terrible at concurrency so it hasn’t really been worth my while in Hypothesis, but I’m doing some experiments with a new file-based test case reducer where I’ll probably implement something like this.

Zig Zagging

This is an interesting thing I have observed happening. It’s reasonably well explained in this Hypothesis pull request.

The problem of zig zagging occurs when a small change unlocks a small change unlocks another small change etc. This is particularly pernicious when the small change is not one that changes the size of the example, because it can put you into exponential time failure modes.

The example in the issue is when you have a test that fails if and only if \(|m – n| \leq 1\). Each iteration of reduction will either reduce \(m\) or \(n\) by \(2\), and so you end up running a total of \(m + n\) reductions.

Hypothesis solves this by noticing when the example size isn’t changing and looking for the byte values that are changing and trying to lower them all simultaneously. It handles this case very well, but it’s not a fully general solution.


One of the big questions in test-case reduction is whether you want to backtrack – that is, rather than constantly be trying to reduce the current best example, is it sometimes worth restarting reduction from a larger starting point.

The reason this might be worth doing is that often larger examples are easier to reduce and will get you to a nicer local minimum. One example of backtracking might be running a formatter. For example, suppose you had a pass that could delete single lines, and you had the code ‘while(true){print(“hello”);}’ and wanted to reduce with respect to there being an infinite loop. Your line based reducer can’t do anything about this because it’s all on a single line and you have to retain the loop, but if a formatter ran and put the print in its own line then you’d have increased the size of the test case, but once you’ve run the line based pass you’ll have decreased it.

A classic approach here is related to the earlier observation about bad parsing and languages with braces, which is the use of topformflat to make files more amenable to delta debugging. topformflat doesn’t usually increase the size of the file, but it’s at best a lateral move.

C-Reduce currently uses these sorts of backtracking moves fairly extensively – it applies topformflat, but it also e.g. inlines function calls which may be a significant size increase.

At the moment Hypothesis does not do any sort of backtracking. I keep toying with the idea, but inevitably end up coming up with purely greedy approaches that solve the problem that I have at the time better than backtracking would. One of the big differences here though is that Hypothesis has to run in much shorter time periods than C-Reduce does, so frequently even if it would be worth adding in backtracking it may not be worth the extra wait for the user


This was mostly a grab bag of stuff, so I don’t know that I really have a conclusion, except that I think test-case reduction is really interesting and I wish more people worked on it, and I hope this post gave you some sense of why.

If you’d like to learn more, I’d recommend starting with the Conjectureshrinker from Hypothesis. You might also find it useful to look through the relevant issues and pull requests, which I try to make sure are well explained and often have good discussion and commentary.

This suggestion isn’t just personal bias – along with C-Reduce, Conjecture is easily in the top two most sophisticated examples of test-case reducers you’re likely to find (I genuinely can’t think of any other examples that come even close to being comparable). The two are quite different, and studying C-Reduce is also a worthwhile activity, but the advantages of starting with Hypothesis is that it’s much smaller, much better commented, not written in Perl, and that you will make me happy by doing so.

This entry was posted in Uncategorized on by .

Situated Software

Attention Conservation Notice: Sometimes I write about tech.

There’s a concept I’ve been using recently that I find quite helpful as a descriptive tool, which is the notion of software being situated.

Software which is situated is software which is highly specific to some particular context – e.g. a single person, a single task or category of task, a single computer. It is non-reusable outside of that context, and as a result it is able to make very in depth (and usually implicit) assumptions about reality.

This isn’t a binary. All software has a context, so no software is truly unsituated, but that context can be more or less specific. The question is not whether software is situated, but where and how strongly it is situated.

There are a number of interesting ways that software can be situated.

Hypothesis is not especially situated software (libraries tend not to be), but Hypothesis’s build system is very situated despite being designed to run on a variety of different computers and CI systems – it is adapted to a very specific task (running Hypothesis build tasks).

Another situated codebase I work on is the code that powers my notebook. This code is what Simon Tatham calls SymbiosisWare. Code that is part of the extended self rather than a distinct project in its own right (although jml has recently adapted the code for his own use. It’s important to remember that even situated software can be reused, it just requires adaptation to the new environment). In this case the context to which it is situated is that of a single person.

Another piece of situated software is the code that exists to manage experiments I run on my workstation, Szalinski. Szalinski is a bit of a beast of a machine and I use it to run experiments on test-case reduction, so there is a lot of code that is very specific to it and its storage environment. This code is mostly written so that it could be run on another machine with broadly similar configuration, but we’d almost certainly find a huge wealth of baked in assumptions by making it do so.

A fourth category of situated software is most software companies’ internal code (except the code for software that they’re actually shipping to end users, if there is any): Generally a lot of the individual applications are not that situated, because you’ve generalised them enough to make them run in dev, staging, and production, but there’s still a large ecosystem of glue and deployment code that is highly situated to a specific environment that is only found within the company.

Note that the same code base can contain a variety of code that is situated to a greater or lesser degree. Your junk drawer module is probably full of highly situated code even if the project it supports is not especially situated, and many highly situated codebases will contain nice well factored code that you could easily extract out into its own reusable library if you wanted to.

“Situated” is not a negative term. I like all three of the situated codebases I work on. The great thing about situated code is that it’s easy to create, and because you don’t have a huge range of use cases to consider it is also often very easy to modify. This can make it worthwhile to create highly situated code in cases where less situated code would be too high cost.

It’s not a positive term either. Situating software absolutely has its downsides.

The first is the nature of situation – if you have a highly specific context outside of which the code is not useful, then the code is not useful outside of that highly specific context! The Hypothesis build system is great. We keep thinking about turning it into a generic tool, but honestly that sounds exhausting so we never do. As a result Hypothesis’s build has some great functionality that nobody else gets to benefit from.

The second is that situated software can be difficult to get started with as a newcomer. It gains much of its power from being embedded in a particular context, but that context is often implicit to some large degree, which means that as a newcomer you essentially have to reverse engineer it. In the same way that you have to be embedded in the context to use the situated code, you have to be embedded in the context to even understand the situated code, and that comes with a certain amount of up front work. I’ve been seeing this in the context of getting some students up to scratch on Hypothesis internals – there’s a lot of highly situated code in the insides of the fuzzer, with all sorts of non-obvious baked in assumptions.

Despite not giving a way to describe the quality of the code, I still find it a very useful descriptive tool, partly because it lets you name and observe certain trade offs that might not otherwise be obvious.

In particular, knowing that situated code exists and is valid is important because it gives you permission to write situated code! There are many circumstances in which this is a useful thing to do and I think we resist doing it because it feels like we’d be writing bad code.

But code is only good or bad in context, and I think many of the things we would dismiss as bad practice are only bad in the context of reusability. Much of the code we feel guilty about, we shouldn’t, because it’s actually perfectly acceptable code, it’s just highly situated.

This entry was posted in Uncategorized on by .

Nonspecifically Neurodivergent

Epistemic status: Seems legit.

Hermeneutical Status: I think this is useful? It’s useful to me and other people seem to think it’s useful to them.

Here’s a joke for you: Suppose you take subclinical manifestations of about six different recognised conditions and disorders and mix them together, what do you get?

Speaking from personal experience, what you get is very annoyed.

I’ve started using the term “nonspecifically neurodivergent” to describe myself recently. Other people have been saying that the term also resonates with them, so I thought it would be useful to write it up in order to make it more broadly available.

What does it mean to be nonspecifically neurodivergent? Well, roughly the feeling I’m trying to capture here is normal people are really weird.

The goal here is not to describe any particular aspect of how our brains work, but instead the shared experience of whether you get to reliably assume other people’s brains work like your own.

Roughly what I wanted is a term that would capture how I feel in the following two conversations patterns:

Neurotypical: *tells amusing story about themselves, a bartender, and a goat*

Me: Hmm. I see, I see. There’s one bit I didn’t understand… could you back up to the thing where you assumed that all of their minds worked exactly like yours and that didn’t stress you out in any way?

NT: Huh? Which bit was that?

Me: Literally all of it.


More or less any neurodivergent person: *describes some aspect of their lived experience that most people would find deeply strange*

Me: Hmm. I do not necessarily share this experience but yes that seems like a sensible model of how brains work. Good. Carry on.

I can describe specific ways in which my brain is weird (Objection: my brain isn’t weird, normal people’s brains are weird! Counter-objection: Unfortunately weird is a consensus term and I’m outvoted), but often the most salient part of my experience not the specifics, but instead the fact it’s weird at all: it works in ways that require a high degree of effort to relate to the experience of people around me, in both directions.

One of the nice things about realising this as a pattern is that it makes a lot of sense out of who is and isn’t an easy person for me to have conversations with. If the other party does not have this experience then not only do I have to carry the load of translation, the very fact that I am doing that will be counted as a mark against me. If we both know that translation is required, it just becomes part of how we talk to each other and that’s fine.

To some extent this is just an aspect of the experience of neurodivergence, but I think “nonspecifically” adds something to it, which is that I feel much more comfortable self-describing as “nonspecifically neurodivergent” than I do “neurodivergent”. It captures two important aspects for me:

  1. I do not have, and probably am never going to bother to acquire, anything in the way of a formal diagnosis. It’s possible I don’t even clear any of the thresholds that would be required for such.
  2. What is important in this particular context is the aggregate effect more than any specific manifestation.

I don’t know if this is a particularly good term to be using, but it’s been helping me, and other people I’ve talked to about it seem to instantly get it and also find it helpful, so until something better comes along I’m probably going to keep using it.

This entry was posted in Uncategorized on by .

FAQ: How do you read so many books?

I’ve had a number of people comment on my slightly ridiculous rate of book consumption recently (you may have noticed if you follow the notebook blog that I read multiple nonfiction books per week), and ask how I do it.

Well, it’s very simple (which is not the same as easy):

  1. I read very fast.
  2. I spend a lot of time reading.
  3. I only read good books.

Much of this happened somewhat organically, so I can’t necessarily offer great advice on how to do the same, but I’ll try to share some things that have worked for me.

Read Very Fast

This may seem obvious. Yes, the secret to reading a lot is to be fast at reading, that is true for all the obvious reasons. The less obvious way that this is helpful is that it makes reading cheap – you can experiment around it, you can use it to test things, and you have a fast feedback loop.

The first one I can’t offer that much advice on. I just read this way, it happened naturally.

Some friends built The Reading Gym for testing your reading speed and learning to improve it. I don’t know how well it works – when I tried it it broke because it didn’t expect an initial test result of more than 900 words per minute.

I’ve heard good things about Break Through Rapid Reading, but not strongly enough or from people I know well enough that this is anything more than a “Here’s a thing you can try if you want” suggestion.

A trick that it occurs to me might work well is to learn to do reading where you’re not actually trying to read the text, only asking questions about it – start with some written material and a question you want to answer about it, and try to answer that question in less time than it would take you to read the whole thing. You could ask semantic questions (who is mentioned in this article?) or even purely syntactic ones (find all occurrences of this word in this article). 

The reason I suggest this is that this sort of nonlinear scanning reading seems to be the main key to how I speed read, and I believe is how people do it in general. It also has the nice property of being a single skill that you can focus on in isolation.

The main advice I’d be willing to definitively offer here is that whatever your reading speed is, make sure you have a good estimate of it and how that plays out in practice. Being able to predict how long a book will take you to read is very helpful.

Spend a lot of time reading

Also fairly obvious I think. If you read more then you will read more.

There are a couple of things I have found help with getting myself to do this:

  1. Physical books over ebooks. I like physical books, and they create more of a pressure on me to read them. I try to make sure I am carrying at least one, sometimes two, physical books with me at all times.
  2. Squeeze reading in between other things, e.g. I get in a good hour or two of reading per day from commuting.
  3. Try to use reading as a default downtime activity, in preference to other things like watching TV, checking Twitter, etc.
  4. The trick of putting the bookmark at the beginning of the next chapter that I mentioned in Local Goal Setting is good for keeping me reading without getting side tracked.
  5. Making a conscious effort to return to a book if I ever put it down mid chapter.

Of course the real things that allow me to spend a lot of time reading are:

  1. I have tolerably good mental and physical health, or at least it’s all bad in ways that don’t stop me processing information.
  2. I have no dependents.
  3. I don’t have a real job and can spend days where I wouldn’t otherwise be very productive due to illness, bad sleep, etc. reading instead and catch up the time later.

I can’t really offer advice on how to replicate that experience.

Only read good books

This section is the one where I probably have the most useful advice, as it’s instrumental in how I’ve gone from reading about one non-fiction book a month to reading about three per week.

The number of books available to you read is vastly greater than your capacity to read them, even if you read fast and extensively. I recommend solving this by not reading the very large percentage of those books that are bad and otherwise not worth your time.

Saying I only read good books is a bit disingenuous. It feels true, but it requires rather specific interpretations of the words “good” and “read”. A more accurate explanation would be that I am engaged in a constant process of evaluation of whether a book is still worth my time, and if it is not then I stop reading it.

I use roughly the following evaluation criteria:

  1. Am I going to get something valuable out of reading (or continuing to read) this book? Useful insight, enjoyment, and having a book you would recommend to others are all examples of something valuable here.
  2. How much effort is finishing this book likely to be?

If the effort is greater than the reward, it’s time to stop reading, and it’s probably time to get rid of the book (see the next section).

This sometimes involves a bit of “pregaming” reading where I’m not exactly reading it but I read some of it (say the introduction) and maybe flip through the book a bit. Useful questions to ask at this point are whether it’s well written and whether there’s something in it that you’re looking forward to.

It’s also worth doing a page count. Long books are higher effort, so require a higher standard of proof of value.

I use roughly the following rules of thumb:

  • If the book is under a hundred pages I’m probably just going to read it. I can do that in two hours, no problem. I can always stop if I hate it.
  • If the book is between one and two hundred pages it’s worth spending a bit of time determining if it’s worthwhile, especially if it’s not that well written.
  • If the book is more than three hundred pages I am very unlikely to read it if it’s badly written. If it’s well written then I hold it to a higher standard of value but am still willing to read it.
  • If the book is more than five hundred pages then I basically want someone signing a promise in blood that it’s worth my time to read it.

Note that four hundred pages is not an especially long book. It could be argued that two hundred pages is an unusually short book, and one hundred pages is a very slim volume indeed.

There is a very simple reason for this discrepancy: Most people writing books are full of themselves and do not respect your time as a reader. Granted that some long books are long because of the amount they pack in, but generally speaking I find that you need to be doing a lot with your book to justify even three hundred pages, and you should probably be doing less with it.

(Note that I do not count reference books and other books that are not intended to be read cover to cover in this).

 In terms of judging value, the following questions I think are useful to ask:

  1. Am I enjoying this book? It can still be worth reading if the answer is no, but if the answer is yes then by all means keep reading it.
  2. Who would I recommend this book to? If the answer is “nobody” then it’s probably not that useful.
  3. How will my life change as a result of reading this book?
  4. What new thing will I understand as a result of reading this book?

Answering these questions can help you understand the value you’re getting out of the book, and from there you can decide whether it’s worth the effort.

Some good rules of thumb on the process:

  1. You should be able to determine within five minutes if it’s worth reading the first fifty or so pages.
  2. You should know within a chapter or two (say 50ish pages) if it’s worth reading the rest of the book.
  3. If you get through the first half of the book and you don’t feel your time has been well spent so far, it’s unlikely to be worth reading the rest of the book.

If a book fails to clear these thresholds, stop reading it.

As well as resulting in reading better books, I find this process helps a lot with the other two. A good book is much easier to spend a lot of time on than a bad one. If the book is not engaging you, you will find it very easy to remember other things you could be doing instead. It’s also easier to read quickly – if it is poorly written then you will struggle and read it more slowly as a result.

The important thing to remember with this part of the process is that you have no obligation to read a book. If a book is failing to be interesting or useful, why are you still reading it? There are better books you could be reading instead.

This entry was posted in Uncategorized on by .