David R. MacIver's Blog: Faster random sampling by handling rejection well

Faster random sampling by handling rejection well

1 January 2017

Attention conservation notice: You probably don’t have the problem this post solves. It might be interesting anyway.

As part of some ongoing development for Hypothesis, I find myself in need of a well optimised solution to the discrete sampling problem: We have weights \(w_1, \ldots, w_n\) and we want to produce a number in \(\{1, \ldots, n\}\) such that \(i\) is drawn with probability proportional to \(w_i\).

All solutions to this problem are intrinsically \(O(n)\): If you have \(w_i = 1\) for \(i \neq j\) and \(w_j = 10^6\) then you need to inspect \(j\) to know whether to generate \(i \neq j\) often or rarely.

You can however use that \(O(n)\) work once to build a data structure for repeated drawing. For example you can use the alias method to build a data structure that lets you do future draws in \(O(1)\).

Which is great and all if you’re expecting to repeatedly draw from the same distribution over and over again. Unfortunately that’s not what my workload looks like - the distributions used are many and varied. Distributions will be reused, but not in any easy to predict manner.

You can of course do cached lookup to an alias method sampler. This works, but if your cache hit rate gets too high (which it often will be in my workload) this doesn’t actually improve matters. Additionally, depending on how good your caching is and how much is in there, it might not be much cheaper to look up something in the cache than it is to just create a new one from scratch.

So what we’d like is a method that is significantly cheaper to initialise. Fortunately there is one, it’s called rejection sampling!

You just need to calculate \(W = \max w_i\). Now you iterate the following procedure:

Pick \(i\) uniformly at random
With probability \(\frac{w_i}{W}\) return \(i\)
Go to 1

This is very cheap to initialise but not so great in terms of run time: If the weights are uniform then it’s optimal (because you always return \(i\) in step 2), but if they’re very far from uniform then you’re going to make a lot of trips (potentially \(O(n)\) in the case where one weight dominates everything else) through that loop.

But there turns out to be a hybrid solution that works pretty well and gets us most of the cheapness of initialising rejection sampling with most of the performance of our better sampling methods, with a tuning parameter deciding how much of each we want.

Pick an integer parameter \(R > 0\) (good choices for \(R\) are probably \(1\) or \(2\)) and then run the following algorithm:

Pick \(i\) uniformly at random
With probability \(\frac{w_i}{W}\) return \(i\)
If we’ve reached this point fewer than \(R\) times, go to 1.
Look up or initialise a new sampler for these weights and draw from that.

So we perform the cheap rejection sampling until it’s obvious that it’s not helping much and then move to a proper sampling method.

For cases where we’re near the optimum case for rejection sampling we’ll do well and mostly use it, for the rest of the cases we’ll spend a bit of an extra cost. Potentially we could also do some extra book keeping - if we tend to find rejection sampling never works we could just turn off that phase of the algorithm entirely.

Experiments so far in the thing I’m implementing suggests that this idea provides a modest improvement compared to initializing the sampler (I’m actually using a data structure from Succinct Sampling from Discrete Distributions, on grounds of it being substantially cheaper to initialize at the cost of being slightly more expensive to run. Plausibly this would give a much bigger improvement if I were using the alias method).

This probably isn’t that useful a trick. The work load I’m optimising for is a bit odd, and the performance improvements under that workload are only modest. Additionally, the baseline is already fairly fast. But it’s not something I’d seen before and it seemed worth noting down.

(Something something Patreon something. It feels a bit cheap to plug it here given the relatively niche nature of this post, but hey I was doing the Hypothesis research that lead to it for free, so I guess it balances out?)