(Some of) my problems with correctness research

Epistemic status: I think this is mostly right, but I’m sure there are gaps in my understanding of the field. Also you should be intrinsically suspicious of any argument that concludes that the author’s own field of research is going to change the world in a way that everyone else’s failed to do.

My previous post on looking for grandiose introductions in correctness research got a very mixed receptions. A lot of people who I normally mostly agree with objected to it, and several people agreed with it for reasons that I disagree with vehemently. This wasn’t the universal reaction, and there were plenty of decent responses to it, but it left me feeling that I’d been really unclear about what I actually meant, so this is my attempt to unpack it.

One response that I thought was a nice piece of writing in its own right but I didn’t really think worked as addressing my problems with correctness research was from John Regehr. To quote in full:

I tried to write a software correctness paper introduction that would make @DRMacIver happy:

The market for software, generally speaking, does not reward correct software so much as software that is acceptably correct and that can be built cheaply and rapidly. Thus, while our work on software correctness is highly unlikely to improve the quality of software, it may plausibly make it possible to create acceptable software a bit more cheaply and rapidly.

I broadly agree with this sentiment, though I flip-flop a bit on whether I think tools can actually make it possible to make more correct software or whether they just let you spend the time you’d have spent on correctness on other things. The technical term for this is Risk Compensation – when you increase safety measures, people behave in a more risky manner to improve performance. When this entirely compensates for the safety measure you have Risk Homeostasis. The general conclusion seems to be that you usually have Risk Compensation and rarely or never have Risk Homeostatis. (This paragraph has entirely exhausted my current knowledge of the subject, but I plan to fix that in the near future).

The main reason this doesn’t actually make me happy is that despite my previous post, it’s not really the introductions I object to. In many ways I like the introductions, and I’m certainly not above a bit of grandiosity myself. The Hypothesis documentation literally says that our goal in developing it is “to drag the world kicking and screaming into a new and terrifying age of high quality software“.

The problem is that I don’t think that correctness research is currently doing a very good job of matching its means to its ends.

In discussion about this, Paolo G. Giarrusso made the following distinction:

there’s a difference between fundamental and applied research (with a fluid boundary). And then there’s “fake applied” research, which is applied to fake problems

I think this is a good distinction. I’d maybe hesitate to label any particular problem as “fake” – I can’t actually point to any specific research in the field of correctness and say “this problem doesn’t exist” (and even if I could I’d prefer not to do so in public), but I do think a lot of people are working on problems that most software developers wouldn’t really care about. It may still be perfectly good fundamental research, and it’s not necessarily obvious until you’ve made some progress in the problem whether it’s actually useful, but saying “Software is terrible and this is our solution to it” is not a great look if you’re working on something that will never actually solve the problem of bad software.

When I say “solve the problem” I do not mean that academics have an obligation to produce tools for industry. The job of an academic is not to do software engineering, it’s to explore the space of ideas. I do think that it’s good for academics to produce relatively solid and easy to use tools – I think it was John Regehr who pointed out to me that by doing so you make it much easier for other people to use you as a baseline and get a bajillion citations as a result, so there are even good incentives for this – but as I’ve discovered the hard way there’s a really large gap between doing that and actually making an impact on industry.

Instead I mean something closer to developing ideas such that you can “Just add engineering (and sales, and marketing, and all the rest)” to go from the research idea to actually producing such a tool. In this sense the original QuickCheck academia-solves the problem, and Quviq industry-solves the problem. Often you find out in the course of going from academia-solves to industry-solves that there were other problems you needed to academia-solve before you could complete the project, but that’s OK (personally I love it when this happens).

I think there is a lot of research that no matter how much engineering, sales, and marketing, you added to them, people just wouldn’t care about the solution. There is no path from academia-solves to industry-solves for this research, because the result would essentially have been yet another startup that spends five years trying to tell you about their great idea before ending up on Our Incredible Journey.

That may be fine! I’ve spent enough time as a pure mathematician to be entirely happy with people working on problems that will never be of use to anyone. Also they turn out to be useful after all surprisingly often because you can reuse their ideas in other areas that are useful.

But the disconnect between the promise (We want to change the world!) and the solutions (by implementing this tool that would make almost every software developer yawn and/or go “huh?” when we tried to explain it to them) does make me unhappy.

It’s not necessarily that I want any of the specific research I’m complaining about not to exist – like I said, I’m fine with it as foundational research. I even get why they feel the need to claim it’s practical research – there’s a certain social and financial obligation to do so in our field – and if they know that’s what they’re doing and are just gaming the system to do something they think is worthwhile, good for them.

But what makes me sad is that there are gaps in the research that I think would not be there if the crisis of buggy software were actually a problem we were trying to solve, and I think these gaps mostly exist because of misaligned incentives (admittedly I always think that).

The problem, I think, with most correctness research is that it has a major false consensus problem: Researchers believe (or at least act like they believe) that software developers at large are roughly like them, and so they build tools that they think are useful to those hypothetical versions of themself that just happen to be working on different problems.

This is by no means an academia specific problem. It’s a problem that almost every single software development team has to some greater or lesser extent. There’s a reason the user experience people have to shout “You Are Not The User” so often in meetings (if they even get invited to meetings, and if they’re not then you have real problems. PS. Your company does have user experience people, right?).

But I think it shows up in academic software correctness research in one really important way (I’m sure it shows up in many ways, but this is the one that hits closest to home for me): The Oracle Problem.

Colloquially, the Oracle Problem is the issue that you can’t tell whether a piece of software is correct unless you know what it’s supposed to do.

When testing, every test has two parts (these may be quite intertwined, but they’re still in some sense logically distinct:

  1. A test case: the concrete set of inputs that the program is to be run on
  2. A test oracle: a way of determining whether the program is incorrect on those inputs (typically this is only one way: i.e. the oracle can’t say for sure that the program is exactly right on those inputs, but it can prove various ways in which it might be wrong)

And this split also defines a very large split between academic research in software correctness and potential consumers of software correctness tools: For an academic, the problem of test-case generation is an interesting object of study which they know a great deal about, but the problem of finding a test-case oracle is typically boring and would require a great deal of non-research legwork to figure out what it is that this software is actually supposed to do.

Typically research works around this problem in one of a couple of ways:

  1. Use a test oracle that points out only things that are obviously bad – segfaults, assertion failures, deadlocks, etc.
  2. Use a specification to derive a test oracle (e.g. via metamorphic testing – finding changes to the input that should have no effect but do. My group at Imperial do a lot of this)
  3. Use multiple implementations of the same specification to do differential testing – run each implementation on the same input and see if you get different results (this is typically how e.g Csmith is used)
  4. Try to guess an oracle – e.g. assume that the current behaviour is correct and use it for regression testing, try to build a simple model of the behaviour and treat deviations as potential bugs, etc.

The first can be useful, especially if you’re writing in an unsafe language, and the second and third are very useful when you are in the limited set of circumstances to which they can be applied. I have yet to be convinced that I have ever had a use case that would be served by tools in the fourth category, although I do believe they exist.

If you want to know more about these solutions, I can recommend “The Oracle Problem in Software Testing: A Survey“. I won’t go into them more here.

But there’s a fifth category of solution to the problem (which is also mentioned in the survey article), which is the one that QuickCheck and its descendants adopt: Make the end user write the test oracle.

In many ways the most valuable insight implicit in the success of QuickCheck is that this order of priorities is reversed in software developers: Software developers don’t really want to write test-case generators, but as long as you let them write them in their host language then they don’t mind writing test oracles because they’re already used to doing that every time they write a test! (In contrast they do seem to mind writing test cases, which they also have to do when writing a test. I think the distinction is that each test oracle feels unique, while writing test cases feels like drudge work, but I’m speculating wildly here).

From my point of view, this is a great solution. If you want to adequately test the behaviour of a piece of software then at some point you’re going to have to find out what it is supposed to do. Why not just let the end user tell you that?

I also think that if we want to make a dent on the bad software problem then this is the only solution that can possibly work. The overwhelming majority of software does not have any specification to speak of, but many software developers are willing and able to write tests (much software is completely untested. There is almost nothing we can do as correctness researchers to fix this problem, because if the authors of said software cared about the problem then they probably wouldn’t be in this situation). So correctness tools which actually make an impact on real world correctness have to get the developers to say what it is the software is actually supposed to do (and doing it in their host language is a much easier sell, though doing it in a derived language is viable, and people have shown willing to write tests in languages like Gherkin).

So, great, we make the users do all the work on solving this hard problem and we go back to working on the bits that we care about. What’s the problem?

Well, there are two big problems:

  1. It’s not the job of academia to build usable tools, and we/they usually don’t. So we don’t have any users to write these oracles.
  2. Even when tools have users, getting good empirical information about them is like pulling teeth. Especially when the tool is integrated into their development workflow so they don’t necessarily notice specific events involving it.

My personal research solves the first problem by starting from a widely used (518 open source users and counting, squee) tool rather than a research project. As to how to solve the second problem? Uh, I’ll get back to you.

Unfortunately, this means that the most useful version of the problem to work on is one that it is actually very hard for academia to do anything with – we mostly don’t have users, and if we did have users it would be hard to get a good set of examples for empirical evaluation out of them (especially if they’re corporate users).

These problems are surmountable, but doing so is a lot more work than just picking an open source project, running your tool against it, and seeing what happens.

How much of a problem is this? I don’t know. One could argue that it’s not one:

  • In some sense test-case generation is test-case generation, and a test oracle is just a way of making a program that crashes on some invalid behaviour, so there’s no reason you can’t take an arbitrary test-case generator and hook it up to an arbitrary “property-based test” that runs based on it.
  • Certainly when writing such tools for industry (which there seems to be very little financial incentive to do, which is a whole different problem. I’m not any more impressed with software correctness industry than I am with software correctness academia), you can still draw on a lot of the academic research that didn’t have this sort of focus – e.g. I learned a lot from reading the compiler testing literature despite not doing compiler testing.
  • The situation is not completely without hope – there’s a decent amount (though less than I’d like) of QuickCheck derived literature ticking along and e.g. Alex Groce’s TSTL is in another interesting corner of this space that isn’t a QuickCheck derived work (Update: Alex points out in the comments that in some sense it is a QuickCheck derived work, but I think it’s sufficiently different from QuickCheck in usage that I’d still count it as its own thing). That’s more or less it as far as examples I know about, but I hope there are others.

But I do think it creates certain significant gaps. For example:

  1. There’s been very little research on making it easy to write good, composable, test-case generators outside the context of QuickCheck (and there hasn’t been that much in that context and I’m not super happy with any of the answers I’ve seen). Most research I’ve seen on test-case generation works on the assumption that there’s more or less only one thing that you’re generating.
  2. It’s actually not true that a test-case generator is a test-case generator. Nezha is an example of how the fact that you’re doing differential testing can usefully inform your test case generation (mutational fuzzing in this case). Is the fact that we’re mostly testing on trivial properties in some way distorting the sorts of test cases we generate?

How do we fix this? Gods, I don’t know. The number of incentives you have to change to fix it is just appalling, and I lack both the lever and the fulcrum to move them.

I do think the sort of thing I’m doing with Hypothesis of starting a research program based on an existing widely used tool is probably a good one, and that open source can form a lot of the bridge – both by providing the engineering that it’s not academia’s job to provide (not that it’s our job to provide it either, we’re just suckers who like doing work for free. Go us), and also by providing a ready source of visible users who can at least lower the bar for evaluation (even if it’s still not easy), but I don’t think this is even close to sufficient.

So fixing it? Pass. I bring you problems, not solutions.

It sure would be nice if we could at least acknowledge that this is a problem though.

This entry was posted in programming on by .

An Unkind Question about Software Correctness Research

Dear Lazyweb,

I’m about to be something of a jerk, and I’d like your help in doing it.

I have noticed a common pattern in a lot of software correctness research, which is that they often follow the following format:

  1. Oh god software, it’s everywhere.
  2. And it doesn’t work very well at all, amirite?
  3. This is only going to get worse.
  4. We’re basically all doomed.
  5. Therefore in order to fix this we propose the following.
  6. We have made an improvement.
  7. To a category of tool that literally nobody uses.
  8. Happy to help.
  9. Love, The Authors.

What are your “favourite” examples of this genre? I’m particularly interested in examples from software testing research but I’ll take anything vaguely correctness related.

The “That nobody actually uses” part of it is also somewhat inessential – I’m mostly interested in good examples of the motivation section, and in many ways having examples from things people actually use is quite helpful because it gives me papers I can point to without feeling too mean about it!

Answers accepted in comments, emails, tweets, whatever.


Edit to add: I do not mean to suggest that such papers are necessarily bad papers. Many of them are interesting and intelligent research.

I’m also not interesting in being mean to individual papers or authors (I originally approved a comment which I have since retracted because I decided it was too far in that direction).

What I want to call into a question is our priorities. We have all of these introductions which claim that what we’re trying to do is solve these real world problems, but if we’re not addressing the fundamental question of why is nobody using our stuff then maybe we need to admit that that’s not actually what we’re trying to do.

As a pure mathematician I’m perfectly on board with a program of pure research, but as a software developer I am much more interested in the question of what a research program that actually cared about these questions would look like.

This entry was posted in programming on by .

Refining L* search

Attention conservation notice: If you don’t care about fine grained algorithmic improvements to inference algorithms for regular languages, you don’t care about this post.

Also, apologies, I know mathjax isn’t currently not working so you will see some raw LaTeX. If you turn on unsafe scripts (i.e. “not loaded over https”) it will work, but due to some technical hiccups this is currently difficult for me to fix. I’m on the case.

L* search (Angluin, D. Learning regular sets from queries and counterexamples. Inf. Comput. 75, 2 (1987), 87–106) is a really neat algorithm that lets you infer a deterministic finite automaton (DFA) for an unknown language.

I’ve long been of the suspicion that it would be useful in Hypothesis. Unfortunately to date reality has disagreed with me. The key problem is that L* search is really expensive and any attempt to use it would almost immediately blow Hypothesis’s normal budget for number of tests run.

Starting from a trivial automaton that accepts no strings, it proceeds in two stages:

  1. It uses membership queries for the language to complete the DFA, finding transitions from states that provably do not go to any of the existing known states, and stopping only when it has a consistent transition table for every known state and every member of the alphabet.
  2. It then asks for a counter-example to its conjectured DFA. If there is none, we have correctly inferred the language. If there is one, it is used to improve our ability to distinguish states, enlarging the number of states by at least one.

Because L* search is maintaining a minimal DFA, where the only states that are distinguished are ones that it can actually prove are inequivalent, this guarantees that the algorithm terminates with a correct DFA in time at most O(n), where n is the size of the minimal DFA of the language.

It does this by maintaining a set of experiments. These are strings that can be used to distinguish states – if by following the experiment from two states you end up in an accepting state in one and a non-accepting state in another, the two states must be inequivalent.

The rest of this post assumes you understand L* search. If you don’t, I recommend reading my prior more detailed explanation of it.

I use an improvement proposed in “Rivest, R. L., and Schapire, R. E. Inference of finite automata using homing sequences.
Inf. Comput. 103, 2 (1993), 299–347” that lets you take the cost of the counter-example step down to logarithmic in the length of the counter-example and ensures that each counter-example only adds one experiment to the list.

The problem is that the completion step is still extremely expensive: It requires evaluating \(|A||S||E|\) membership queries, where \(A\) is the alphabet, \(S\) are the states we currently have, and \(E\) is the number of experiments we currently have (you can remove some of these because some state transitions are provable because of the representation of the states as strings – \(s\) always transitions to \(sa\) under \(a\) if both strings are in the state set, but you can’t remove more than \(|S| – 1\) evaluations this way).

One of the problems I have with using this in Hypothesis is that \(|A| = 256\), which is both intrinsically fairly large and also actually larger than Hypothesis’s typical testing budget. Thus there’s no way to fit even one L* pass into a normal test run!

So, can we get that \(|A|\) factor down?

It turns out, yes, at the cost of potentially needing more counter-examples (which is a cost I’m entirely willing to pay).

The observation follows one from “Owens, S., Reppy, J. H., and Turon, A. Regular-expression derivatives re-examined. J. Funct. Program. 19, 2 (2009), 173–190.”, which is that when you have a large alphabet you will typically find at a given state in a DFA many characters are equivalent to each-other (that is, they result in a transition to the same state). For example, when \(|A| \geq |S|\), even though there are \(|A|\) possible valid characters at a given state there are only \(|S|\) up to equivalence.

If we take advantage of this, we could reduce the number of queries substantially, and it turns out that it is easy to do.

The idea is simple: When we create a state, we initialise a partition refinement data structure along-side it. Initially this is set to consider all members of the alphabet to be in the same equivalence class.

When expanding our table, we now assume that any characters in the same partition are equivalent, and we only actually perform membership queries with, say, the smallest element of the partition. Thus instead of the \(|A||S|\) term we only perform \(\sum |P_i|\) term, where \(P_i\) is the set of partitions associated with state \(i\). In particular this is \(O(|S| \min(|A|, |S|)\).

Then when we find an experiment \(e\) and a transition \(s \to t\) via \(c\), instead of immediately adding to our experiment set we check whether we just have too coarse a partition. We do this by checking if \(sce \in \mathcal{L} = sc’e \in \mathcal{L}\), where \(c’\) is the “canonical” element of the partition that we selected when determining that this was the correct transition (if \(c = c’\) we can skip this bit). If this is not the case then we use it to refine the partition by splitting it into two based on whether \(sde \in \mathcal{L}\). We then add the experiment to the list and rebuild the DFA. If this does not result in expanding our list of states (this can only happen if we did end up refining the partition), we then remove the experiment from the list! This can happen because we can end up with partitions that our existing experiments were sufficient to distinguish but where we never checked, and by not adding redundant experiments in that case we bound the number of experiments to the number of states.

This is correct for the same reasons that L* is correct – at any given stage, either our partitions are correct, or we refine them. However it will take an additional \(\sum (|P_i| – 1)\) counter-example steps to converge (where the \(P_i -\) here are the partitions from our final state, i.e. the true partitions).

Counter-examples are somewhat costly – regardless of how much it costs to find (which may be “free” in some sense), a counter example \(u\) takes  \(O(\log(|u|)\) membership queries to process. If we assume all counter-examples are minimized and so \(|u| \leq n\) where \(n\) is the number of states in the true minimal DFA, then that’s \(O(\log(n)\)

Suppose the true minimal DFA has \(n\) states and that \(m = \sum |P_i|\) in it. Then at each stage we either add one or more states, or we increase the number of partitions, thus we perform at most \(n + \sum (|P_i – 1) = n + m – n = m\) counter-example queries.

Each completion step takes \(O(m |E|\) to run, and we only added experiments in the case where we increased the number of states, so each step runs in \(O(m n)\), and the entire thing runs in at most \(O((mn + \log(n)) m)\). \(m \geq n\), so we can discard the \(\log(m)\) term and the whole thing runs in \(O(m^2 n)\).

In contrast, classic L* the steps take \(O(|A| n^2)\) and we run in at most \(n\) steps, so the complexity is \(O(|A| n^3\). So this is a complexity improvement if and only if \(n^2 |A| \geq m^2\) asymptotically, or in other words if \(\frac{m}{n} \geq \sqrt{|A|}\) (technically this is an invalid way to reason about asymptotics, but those were pretty tight estimates).

\(\frac{m}{n}\) is the average number of distinct equivalent classes over states. For my use case I expect this to be very low – typically the number of equivalence classes will be one!

For the more general use case this isn’t obviously a win – if \(|A|\) is small it almost certainly isn’t. e.g. when \(|A| = 2\) this condition is equivalent to saying that 60% of states don’t depend on the character to determine their transition at all.

I haven’t attempted to reason about it much, but one possible fix would be an adaptive algorithm that tracks the average on the states and partitions found so far, and as soon as it exceeds \(\sqrt{|A|}\) switches over to normal L*. My suspicion is that if you measure cost purely in terms of membership queries, this adaptive version is a substantial saving on almost any language model.

This entry was posted in Automata Theory on by .

Novelty Requires Explanation

Epistemic status: Reasonably confident, but I should probably try to back this up with numbers about how often elementary results actually do get missed.

Attention conservation notice: More than a little rambling.

Fairly regularly you see news articles about how some long-standing problem that has stumped experts for years has been solved, usually with some nice simple solution.

This might be a proof of some mathematical result, a translation of the Voynich algorithm, a theory of everything. Those are the main ones I see, but I’m sure there are many others that I don’t see.

These are almost always wrong, and I don’t even bother reading them any more.

The reason is this: If something is both novel and interesting, it requires an explanation: Why has nobody thought of this before?

Typically, these crackpot solutions (where they’re not entirely nonsensical) are so elementary that someone would surely have discovered it before now.

Even for non-crackpot ideas, I think this question is worth asking when you discover new. As well as being a useful validity check for finding errors and problems, if there is a good answer then it can often be enlightening about the problem space.

Potentially, it could also be used as a heuristic in the other direction: If you want to discover something new, look in places where you would have a good answer to this question.

There are a couple ways this can play out, but most of them boil down to numbers: If a lot of people have been working for a problem for a long time during which they could have discovered your solution, they probably would have. As nice as it would be to believe that we were uniquely clever compared to everyone else, that is rarely the case.

So an explanation basically needs to show some combination of:

  1. Why not many people were working on the problem
  2. Why the time period during which they could have discovered your technique in is small

The first is often a bad sign! If not many people work on the problem, it might not be very interesting.

This could also be a case of bad incentives. For example, I’ve discovered a bunch of new things about test case reduction, and I’m pretty sure most of that is because not many people work on test case reduction: It’s a useful tool (and I think the problem is interesting!), but it’s a very niche problem at a weird intersection of practical needs and academic research where neither side has much of a good incentive to work on it.

As a result, I wouldn’t be surprised an appreciable percentage of person-hours ever spent on test-case reduction were done by me! Probably not 10%, but maybe somewhere in the region of 1-5%. This makes it not very surprising for me to have discovered new things about it even though the end result is useful.

More often I find that I’m just interested in weird things that nobody else cares about, which can be quite frustrating and it can make it difficult to get other people excited about your novel thing. If that’s the case, you’re probably going to have a harder time marketing your novel idea than you are discovering it.

The more interesting category of problem is the second: Why have the people who are already working on this area not previously thought of this?

The easiest way out of this is simply incremental progress: If you’re building on some recent discovery then there just hasn’t been that much time for them to discover it, so you’ve got a reasonable chance of being the first to discover it!

Another way is by using knowledge that they were unlikely to have – for example, by applying techniques from another discipline with little overlap in practice with the one the problem is form. Academia is often surprisingly siloed (but if the problem is big enough and the cross-disciplinary material is elementary enough, this probably isn’t sufficient. It’s not that siloed).

An example of this seems to be Thomas Royen’s  recentish proof of the Gaussian Correlation Inequality (disclaimer: I don’t actually understand this work). He applied some fairly hairy technical results that few people working on the problem were likely to be familiar with, and as a result was able to solve something people had been working on for more than 50 years.

A third category of solution is to argue that everyone else had a good chance of giving up before finding your solution: e.g. If the solution is very complicated or involved, it has a much higher chance of being novel (and also a much higher chance of being wrong of course)! Another way this can happen is the approach looks discouraging in some way.

Sometimes all of these combine. For example, I think the core design of Hypothesis is a very simple, elegant, idea, that just doesn’t seem to have been implemented before (I’ve had a few people dismissively tell me they’ve encountered the concept before, but they never could point me to a working implementation).

I think there are a couple reasons for this:

  1. Property-based testing just doesn’t have that many people working on it. The number might top 100, but I’d be surprised if if topped 200 (Other random testing approaches could benefit from this approach, but not nearly as much. Property-based testing implements lots of tiny generators and thus feels many of the problems more acutely).
  2. Depending on how you count, there’s maybe been 20 years during which this design could have been invented.
  3. Simple attempts at this approach work very badly indeed (In a forthcoming paper I have a hilarious experiment in which I show that something only slightly simpler than what we do completely and totally fails to work on the simplest possible benchmark).

So there aren’t that many people working on this, they haven’t had that much time to work on it, and if they’d tried it it probably would have looked extremely discouraging.

In contrast I have spent a surprising amount of time on it (largely because I wanted to and didn’t care about money or academic publishing incentives), and I came at it the long way around so I was starting from a system I knew worked, so it’s not that surprising that I was able to find it when nobody else had (and does not require any “I’m so clever” explanations).

In general there is of course no reason that there has to be a good explanation of why something hasn’t been discovered before. There’s no hard cut off line where something goes from “logically must have been discovered” to “it’s completely plausible that you’re the first” (discontinuous functions don’t exist!), it’s just a matter of probabilities. Maybe it’s very likely that somebody hasn’t discovered it before, but maybe you just got lucky. There are enough novel things out there that somebody is going to get lucky on a fairly regular basis, it’s probably just best not to count on it being you.

PS. I think it very unlikely this point is novel, and I probably even explicitly got it from somewhere else and forgot where. Not everything has to be novel to be worthwhile.

This entry was posted in rambling nonsense on by .

Times to exhaust finite distributions when sampling with replacement

Due to reasons a maths problem has been utterly hijacking my brain since Friday. I think I now have it solved to my satisfaction. I’m morally certain this is nothing new and has been solved a thousand times before, but I couldn’t  figure out the right keywords to Google and each time I posed it clearly enough to ask for help I realised a new avenue to explore and so ended up solving it before I had to.

The problem boils down to this:

Suppose I have some random variable \(X\), taking values \(1, \ldots, n\) with probabilities \(P(X = i) = p_i > 0\).

Update: steve mcc points out that the uniform version of this is normally called the Coupon Collector’s Problem. I have definitely heard that before but could not for the life of me remember it. Given that term it’s easy to find a bunch of prior art on this. I’ve yet to do a review. I still found the specific details interesting and it was fun to work on, but it’s up to you if you want to read this given that…

Now suppose I have infinitely many independent \(X_i\) each with the same distribution as \(X\).

How long do we expect it to take until we’ve seen every value \(1, \ldots, n\)? i.e. If \(T\) is a random variable whose value is the first \(i\) such that \(X_j = k\) for some \(j \leq i\) and each \(1 \leq k \leq n\), what is \(E(T)\)?

I don’t have an exact calculation for \(E(T)\) and my experiments on special cases suggest that there is no nice formulae. I nevertheless consider the problem solved to my satisfaction by the following three results:

  1. \(E(T) \geq n H(n)\), where \(H(n) = 1 + \frac{1}{2} + \ldots + \frac{1}{n}\) is the nth harmonic number, and this bound is achieved when \(X\) is the uniform distribution.
  2. \(E(T) \leq \sum \frac{1}{p_i}\), and this bound is achievable in the limit (and possibly exactly but I don’t think so) by some distributions I will describe below.
  3. \(E(T) \geq \frac{n^2 – 1}{4 E(X)}\). This bound is not especially tight but is good enough for asymptotics.

In particular, if \(X\) comes from a family of distributions in which \(n\) is allowed to vary and \(E(X)\) remains bounded, \(E(T)\) is asymptotically at least quadratic.

Proofs:

Suppose \(f(n)\) is a lower bound on \(E(T)\). Suppose we draw \(i\) for our first draw. Then we may reduce exhausting \(\{1, \ldots, n\}\) to exhausting the other values, which takes at most \(f(n – 1)\) draws. But each draw takes in expectation \(\frac{1}{1 – p_i}\) draws of \(X\), as the time to draw a value other than \(i\) is a geometric distribution with parameter \(1 – p_i\).

Therefore \(E(T) \geq 1 + f(n – 1)\sum \frac{p_i}{1 – p_i}\). This sum is minimized when the \(p_i\) all have the same value, which must be \(\frac{1}{n}\). So by substituting in, we have \(E(T) \geq 1 + \frac{n}{n – 1} f(n – 1)\), with equality when \(T\) is uniform. Thus \(f(n) = 1 + \frac{n}{n – 1} f(n – 1)\). \(n H(n) = n (H(n – 1) + \frac{1}{n}) = 1 + n H(n – 1) = 1 + \frac{n}{n – 1} (n – 1) H(n – 1)\), and thus \(n H(n)\) is the solution to this recurrence relationship.

Thus \(E(T) \geq n H(n)\) and this bound is tight when \(X\) is uniform.

To see the upper bound, consider the following process: Let \(S_0 = 0\) and let \(S_k\) be the first \(i > S_{k – 1}\) with \(X_i = k\). Then certainly \(S_n \geq T\) as by that point we must have seen each value, and so \(E(T) \leq E(S_n)\). But \(S_{k + 1} – S(k)\) is a geometric distribution with parameter \(p_k\), so \(E(S_{k + 1} – S_k) = \frac{1}{p_k}\). Thus by summing up the terms we have \(E(S_n) = \sum \frac{1}{p_k}\) as desired.

To see that this is tight is a bit of a mess and I’m going to omit the calculations. The family of distributions that demonstrates it are as follows: Let \(p_n = \epsilon^{n – 1}\) and \(p_i = \epsilon^{i – 1}(1 – \epsilon)\)  for \(i < n\) (The intuition here is that each \(i\) captures \(1 – \epsilon\) worth of the remaining probability). Then both the bound and \(E(T)\) end up as \(\epsilon^{1 – n} + o(\epsilon^{1 – n} )\), so as \(\epsilon \to 0\) their ratio converges to \(1\).

The final bound is the most interesting one, and I think captures a lot of the intuition that \(E(T)\) is maximized by being “more uniform”. Because the value of \(E(T)\) is invariant under permutations of \(\{1, \ldots, n\}\), we can reorder the values such that \(p_{i + 1} \leq p_i\).

For distributions satisfying this constraint, then \(E(X)\) is strictly maximized by the uniform distribution (that is, the maximum value is obtained on the uniform distribution and any other distribution attains a strictly smaller value).

To prove the bound, first observe that if \(p = P(X \geq i)\) then \(E(T) \geq \frac{n – i}{p}\) (in fact \(E(T) \geq \frac{(n – i) H(n)}{p}\), but the calculations become messier with that factor in so I chose to drop the \(H(n)\) term).

The reason is that to finish we must draw all \(n – i\) values which are greater than or equal to \(i\), and if we only do that with probability \(p\) then it takes an expected number of times equal to at least \(\frac{1}{p}\) to draw each, because \(1 – p\) worth of draws end up drawing \(< i\).

But we can then use the result that \(P(X \geq i) \leq \frac{E(X)}{i}\) (because \(X \geq 0\). Thus by combining these we have \(E(T) \geq \frac{i (n – i)}{E(X})\).

\(i\) is arbitrary, so we can choose it to maximize this expression. If \(n\) is even it is maximised by \(i = \frac{n}{2}\), if \(n = 2k + 1\) it is maximized by (i = k\). Combining these cases we get a maximized value of at least \(\frac{n^2 – 1}{4}\). Plugging these in to our original bound, we get the desired result.

You can get more bounds by considering the same argument with \(E(X^k)\) for arbitrary \(k\). The general bound is actually that \(E(T) \geq n^{k + 1} \frac{(1 + k^{-1})^k}{(k + 1) E(X^k)} + O(n^k)\), but the details are messy so I’ve omitted it here. These bounds are still interesting, as when \(E(X^k)\) is small this indicates that \(X\) is very tightly concentrated around small values of \(i\), which causes discovering all of the values to take a very long time. As we saw with the example for demonstrating that the \(\sum \frac{1}{p_i}\) bound was tight, it can even be exponential in \(n\)!

You can also get similar bounds for exponential variates. e.g. \(E(T) \geq \frac{e^{s(n – 1)}{E(e^{sX})\), so if \(s\) is small enough that \(E(e^{sX})\) is bounded as \(n\) varies then we will see this sort of exponential growth (you can get slightly tighter bounds if \(s \geq \frac{1}{n}\) but it’s probably not worth bothering).

If you found this post interesting, may I highly recommend “Birthday paradox, coupon collectors, caching algorithms and self-organizing search.“, which I found only after steve mcc put me on to the right search terms. It doesn’t contain these bounds, so is in that sense orthogonal, but it has some really cool general techniques.


So what’s this all about?

I was interested in the question of how well random testing can do at exploring its space. In this model each \(X = i\) represents the random tester finding a distinct example up to equivalence.

Typical experience is that random testing “plateaus” – its rate of discoveries goes down over time – and that seems to be more or less backed by this model: \(E(T)\) is the amount of time it takes to explore it feasible search space, and this grows superlinearly in \(n\).

In the case where you manage to be fully uniform over inequivalent states, this isn’t too bad – a logarithmic slow down is pretty manageable – but if there is any sort of concentration around some common point (which will be \(i = 1\) after reordering), it is likely that finding new examples becomes much more expensive over time as each new example is discovered.

In order to say how applicable this is we’d need to do some sort of studying of what random testers actually do, and we’d need some sort of equivalence oracle which is basically impossible, so for the moment I don’t have any real way of actually applying this result to the problem that motivated it, but it is at least suggestive.

This entry was posted in Numbers are hard on by .