Times to exhaust finite distributions when sampling with replacement

Due to reasons a maths problem has been utterly hijacking my brain since Friday. I think I now have it solved to my satisfaction. I’m morally certain this is nothing new and has been solved a thousand times before, but I couldn’t  figure out the right keywords to Google and each time I posed it clearly enough to ask for help I realised a new avenue to explore and so ended up solving it before I had to.

The problem boils down to this:

Suppose I have some random variable \(X\), taking values \(1, \ldots, n\) with probabilities \(P(X = i) = p_i > 0\).

Update: steve mcc points out that the uniform version of this is normally called the Coupon Collector’s Problem. I have definitely heard that before but could not for the life of me remember it. Given that term it’s easy to find a bunch of prior art on this. I’ve yet to do a review. I still found the specific details interesting and it was fun to work on, but it’s up to you if you want to read this given that…

Now suppose I have infinitely many independent \(X_i\) each with the same distribution as \(X\).

How long do we expect it to take until we’ve seen every value \(1, \ldots, n\)? i.e. If \(T\) is a random variable whose value is the first \(i\) such that \(X_j = k\) for some \(j \leq i\) and each \(1 \leq k \leq n\), what is \(E(T)\)?

I don’t have an exact calculation for \(E(T)\) and my experiments on special cases suggest that there is no nice formulae. I nevertheless consider the problem solved to my satisfaction by the following three results:

  1. \(E(T) \geq n H(n)\), where \(H(n) = 1 + \frac{1}{2} + \ldots + \frac{1}{n}\) is the nth harmonic number, and this bound is achieved when \(X\) is the uniform distribution.
  2. \(E(T) \leq \sum \frac{1}{p_i}\), and this bound is achievable in the limit (and possibly exactly but I don’t think so) by some distributions I will describe below.
  3. \(E(T) \geq \frac{n^2 – 1}{4 E(X)}\). This bound is not especially tight but is good enough for asymptotics.

In particular, if \(X\) comes from a family of distributions in which \(n\) is allowed to vary and \(E(X)\) remains bounded, \(E(T)\) is asymptotically at least quadratic.

Proofs:

Suppose \(f(n)\) is a lower bound on \(E(T)\). Suppose we draw \(i\) for our first draw. Then we may reduce exhausting \(\{1, \ldots, n\}\) to exhausting the other values, which takes at most \(f(n – 1)\) draws. But each draw takes in expectation \(\frac{1}{1 – p_i}\) draws of \(X\), as the time to draw a value other than \(i\) is a geometric distribution with parameter \(1 – p_i\).

Therefore \(E(T) \geq 1 + f(n – 1)\sum \frac{p_i}{1 – p_i}\). This sum is minimized when the \(p_i\) all have the same value, which must be \(\frac{1}{n}\). So by substituting in, we have \(E(T) \geq 1 + \frac{n}{n – 1} f(n – 1)\), with equality when \(T\) is uniform. Thus \(f(n) = 1 + \frac{n}{n – 1} f(n – 1)\). \(n H(n) = n (H(n – 1) + \frac{1}{n}) = 1 + n H(n – 1) = 1 + \frac{n}{n – 1} (n – 1) H(n – 1)\), and thus \(n H(n)\) is the solution to this recurrence relationship.

Thus \(E(T) \geq n H(n)\) and this bound is tight when \(X\) is uniform.

To see the upper bound, consider the following process: Let \(S_0 = 0\) and let \(S_k\) be the first \(i > S_{k – 1}\) with \(X_i = k\). Then certainly \(S_n \geq T\) as by that point we must have seen each value, and so \(E(T) \leq E(S_n)\). But \(S_{k + 1} – S(k)\) is a geometric distribution with parameter \(p_k\), so \(E(S_{k + 1} – S_k) = \frac{1}{p_k}\). Thus by summing up the terms we have \(E(S_n) = \sum \frac{1}{p_k}\) as desired.

To see that this is tight is a bit of a mess and I’m going to omit the calculations. The family of distributions that demonstrates it are as follows: Let \(p_n = \epsilon^{n – 1}\) and \(p_i = \epsilon^{i – 1}(1 – \epsilon)\)  for \(i < n\) (The intuition here is that each \(i\) captures \(1 – \epsilon\) worth of the remaining probability). Then both the bound and \(E(T)\) end up as \(\epsilon^{1 – n} + o(\epsilon^{1 – n} )\), so as \(\epsilon \to 0\) their ratio converges to \(1\).

The final bound is the most interesting one, and I think captures a lot of the intuition that \(E(T)\) is maximized by being “more uniform”. Because the value of \(E(T)\) is invariant under permutations of \(\{1, \ldots, n\}\), we can reorder the values such that \(p_{i + 1} \leq p_i\).

For distributions satisfying this constraint, then \(E(X)\) is strictly maximized by the uniform distribution (that is, the maximum value is obtained on the uniform distribution and any other distribution attains a strictly smaller value).

To prove the bound, first observe that if \(p = P(X \geq i)\) then \(E(T) \geq \frac{n – i}{p}\) (in fact \(E(T) \geq \frac{(n – i) H(n)}{p}\), but the calculations become messier with that factor in so I chose to drop the \(H(n)\) term).

The reason is that to finish we must draw all \(n – i\) values which are greater than or equal to \(i\), and if we only do that with probability \(p\) then it takes an expected number of times equal to at least \(\frac{1}{p}\) to draw each, because \(1 – p\) worth of draws end up drawing \(< i\).

But we can then use the result that \(P(X \geq i) \leq \frac{E(X)}{i}\) (because \(X \geq 0\). Thus by combining these we have \(E(T) \geq \frac{i (n – i)}{E(X})\).

\(i\) is arbitrary, so we can choose it to maximize this expression. If \(n\) is even it is maximised by \(i = \frac{n}{2}\), if \(n = 2k + 1\) it is maximized by (i = k\). Combining these cases we get a maximized value of at least \(\frac{n^2 – 1}{4}\). Plugging these in to our original bound, we get the desired result.

You can get more bounds by considering the same argument with \(E(X^k)\) for arbitrary \(k\). The general bound is actually that \(E(T) \geq n^{k + 1} \frac{(1 + k^{-1})^k}{(k + 1) E(X^k)} + O(n^k)\), but the details are messy so I’ve omitted it here. These bounds are still interesting, as when \(E(X^k)\) is small this indicates that \(X\) is very tightly concentrated around small values of \(i\), which causes discovering all of the values to take a very long time. As we saw with the example for demonstrating that the \(\sum \frac{1}{p_i}\) bound was tight, it can even be exponential in \(n\)!

You can also get similar bounds for exponential variates. e.g. \(E(T) \geq \frac{e^{s(n – 1)}{E(e^{sX})\), so if \(s\) is small enough that \(E(e^{sX})\) is bounded as \(n\) varies then we will see this sort of exponential growth (you can get slightly tighter bounds if \(s \geq \frac{1}{n}\) but it’s probably not worth bothering).

If you found this post interesting, may I highly recommend “Birthday paradox, coupon collectors, caching algorithms and self-organizing search.“, which I found only after steve mcc put me on to the right search terms. It doesn’t contain these bounds, so is in that sense orthogonal, but it has some really cool general techniques.


So what’s this all about?

I was interested in the question of how well random testing can do at exploring its space. In this model each \(X = i\) represents the random tester finding a distinct example up to equivalence.

Typical experience is that random testing “plateaus” – its rate of discoveries goes down over time – and that seems to be more or less backed by this model: \(E(T)\) is the amount of time it takes to explore it feasible search space, and this grows superlinearly in \(n\).

In the case where you manage to be fully uniform over inequivalent states, this isn’t too bad – a logarithmic slow down is pretty manageable – but if there is any sort of concentration around some common point (which will be \(i = 1\) after reordering), it is likely that finding new examples becomes much more expensive over time as each new example is discovered.

In order to say how applicable this is we’d need to do some sort of studying of what random testers actually do, and we’d need some sort of equivalence oracle which is basically impossible, so for the moment I don’t have any real way of actually applying this result to the problem that motivated it, but it is at least suggestive.

This entry was posted in Numbers are hard on by .