Suppose you have some regular language. That is, some language you can define through some mix of single characters, concatenation, repetition and alternation. e.g. the language defined by the regular expression (0|1)*2 which consists of any number of 0s and 1s followed by a single 2.
Suppose further instead of matching you want to generate an instance of such a regular language. This can be quite useful as a building block for building interesting data generators, so I’ve been looking into doing that in Hypothesis. Lucas Wiman’s revex is an example of an existing project doing similarly.
It’s easy to do this naively from our building blocks:
- To generate a single character, just return that character.
- To generate one of x|y, pick one of x or y uniformly at random and then generate from that.
- To generate xy, generate from x and y independently and concatenate the results.
- To generate from x* pick the number of repetitions (e.g. as a geometric distribution) and then generate that many instances and concatenate them.
This almost works but it has a major problem: It’s quite biased.
Consider the regular expression (00)|(1[0-9]). That is, we match either the string 00 or the string 1 followed by any digit. There are 11 strings matched by this language, but under our above model we’ll pick 00 half the time!
Additionally, it’s biased in a way that depends on the structure of our regular expression. We could just have easily have written this as (10)|((00)|(1[1-9)). Now we produce 10 half the time, 00 a quarter of the time, and one of the remaining 9 strings the remaining quarter of the time.
So how do we define a distribution that is independent of our particular choice of regular expression?
One goal that might be reasonable is to be unbiased: To produce every string in the language with equal probability. Unfortunately, this doesn’t make sense if our language is infinite: There is no uniform distribution on an infinite countable set.
We can however hope for a way that has the weaker property of being unbiased only among strings with the same length. So in the above example we’d have got a uniform distribution, but if we had say the language 0*1 of arbitrarily many zeros followed by a single 1, some lengths of string would be more likely than others.
There turns out to be a nice way of doing this! They’re called Boltzmann Samplers (warning: The linked paper is quite information dense and I only understand about a third of it myself). The Boltzmann samplers for a language define a family of probability distributions over it, controlled by a single real valued parameter, \(x \in [0, 1]\).
The idea is that you pick each string of length \(n\) with probability proportional to \(x^n\) (note: each string. So if there are \(k\) strings of length \(n\) then you pick some string of length \(n\) with probability proportionate to \(k x^n\)).
In general there is not a defined Boltzmann sampler for every possible value of \(x\). The closer \(x\) is to \(1\), the slower the probability of picking a string drops off. If \(x = 1\) then we have a uniform distribution (so this is only possible when the language is finite). If \(x = 0\) then we can only generate the empty string. The Boltzmann sampler will be defined as long as the corresponding infinite sum converges, which will definitely happen when \(x < \frac{1}{|A|}\) (where \(A\) is the alphabet for our language) but may happen for larger \(x\) if the language is relatively constrained, and the average size of the strings will increase as \(x\) increases.
It’s not a priori obvious that simulating a Boltzmann sampler is any easier than our original problem, but it turns out to be. The reason is that for certain primitive languages the Boltzmann sampler is easy to compute (e.g. for a language consisting only of a single fixed string it just returns that string), and for certain natural operations for combining languages (especially for our purposes disjoint unions) we can then simulate the Boltzmann sampler using the simulated Boltzmann samplers for the base languages.
We’ll see how this works in a moment, but first a digression.
An idea that turns out to be intimately connected to Boltzmann samplers is that of the counting generating function. If you have a language \(L\), you can define \(g_L(x) = \sum\limits_{n=0}^\infty |L_n| x^n\), where \(L_n\) is the subset of the language consisting of strings.
Counting generating functions have a number of important properties about how they combine when you combine languages. The one that will be most relevant for us is that if \(L\) and \(M\) are two disjoint languages then \(g_{L \cup M}(x) = g_L(x) + g_M(x)\). This is because \((L \cup M)_n = L_n \cup L_m\). Because these are disjoint this means that \(|(L \cup M)_n| = |L_n| + |M_n|\). (It’s important to note that this doesn’t work when the languages aren’t disjoint: You over-count the intersection).
Counting generating functions will be useful because we can treat them as a sort of “mass function” that tells us how to weight our decisions: If we can go one of two ways, but one of them has a lot more strings below it, we should pick that one more often. This is the basis of the connection between counting generating functions and Boltzmann samplers.
In particular, if we have two disjoint languages \(L\) and \(M\) and we can simulate the Boltzmann sampler of parameter \(x\) for each, we can simulate the Boltzmann sampler with parameter \(x\) for \(L \cup M\) as follows: We pick \(L\) with probability \(\frac{g_L(x)}{g_L(x) + g_M(x)}\), else we pick \(M\), then we simulate the Boltzmann sampler for the language we ended up with.
This is almost exactly the same as our simulation from the beginnings, we’ve just changed the weightings! It’s important to remember that this only works if \(L\) and \(M\) are disjoint though.
This property, it turns out, is enough to compute a Boltzmann sampler for any regular language.
The key idea is this: If you compile a regular language to a deterministic finite automaton (DFA), you can now look at the languages matched from each state in that DFA. Each of these languages can be represented as a disjoint union of other languages, which gives us a system of equations for the counting generating functions that we can just solve.
If \(L[i]\) is the language matched when starting from state i, then \(L[i] = \bigcup\limits_{c \in A} c L[\tau(i, c)] \cup E[i]\), where \(\tau\) is the transition function and \(E[i]\) is the language that just matches the empty string if \(i\) is an accepting state, else is the empty language.
That is, every string matched when starting from \(i\) is either the empty string, or it is a single character prepended to a string from the language you get starting from the state that character causes a transition to. This is a disjoint union! Given a string in the union it is either empty, or you can look at its first character to identify which one of the branches it belongs to.
This means we can calculate the generating function as \(g_{L[i]}(x) = x \sum\limits_{c \in A} g_{L[\tau(i, c)]} + \epsilon[i]\), where \(\epsilon[i]\) is \(1\) if this is an accepting state or \(0\) otherwise.
(The extra factor of \(x\) comes from the fact that we’ve added an extra character to the beginning language, so every string has its length increased by \(1\)).
But this is a linear system of equations in the generating functions! There’s that unknown variable \(x\), but you can just treat that as meaning it’s a linear system of equations whose parameters live in the field of rational functions.
In particular, and importantly, we can use sympy to solve this rather than having to write our own linear solving routines (we can’t use a standard numpy-esque solver because of the symbolic parameter):
def compute_generating_functions(accepting, transitions): assert len(accepting) == len(transitions) n = len(accepting) z = sympy.Symbol('z', real=True) weights = {} for i in range(n): weights[(i, i)] = 1 for _, j in transitions[i].items(): key = (i, j) weights[key] = weights.get(key, 0) - z matrix = SparseMatrix( n, n, weights ) vector = sympy.Matrix(n, 1, list(map(int, accepting))) return z, matrix.LUsolve(vector) |
This simultaneously solves the linear equations for all of the states of the DFA. So a result looks like this:
In []: rd.compute_generating_functions([False, False, True], [{0: 1}, {0: 2}, {0: 2}]) Out[]: (z, Matrix([ [z**2/(-z + 1)], [ z/(-z + 1)], [ 1/(-z + 1)]])) |
This calculates the generating functions for the language that matches any string of two or more 0 bytes.
Now that we’ve calculated the generating functions, we can randomly walk the DFA to run the desired Boltzmann sampler! The counting generating function provides the weights for our random walks, and moving to one of the languages in our disjoint union is precisely either stopping the walk or transitioning to another state.
We start at the initial state, and at each stage we choose between one of \(|A| + 1\) actions: We can either emit a single character and transition to the corresponding new state, or we can stop. Each of these are weighted by the generating function of the language applied to our Boltzmann sampling parameter: The option “halt now” is given weight \(1\), and the action “emit character c” is given weight \(x g_{L[\tau(i, c)]}(x)\) – i.e. \(x\) times the state weight we’ve already calculated for the target state.
This is just a discrete distribution over a finite number of possibilities, so we can use the alias method to build a sampler for it.
Thus our algorithm becomes:
- Build a DFA for the regular expression
- Use sympy to calculate the counting generation function for the language matched starting from each state in our DFA
- Pick a Boltzmann parameter
- Calculate all the state weights
- For each state (possibly lazily on demand) initialize an alias sampler for the actions that can be taken at that state
- Repeatedly draw from the sampler, either drawing a stop action or drawing an emit character action. Where we emit a character, make the corresponding state transition.
I’m not going to claim that this approach is efficient by any stretch. The DFA may be exponential in the size of our starting regular expression, and we’re then solving an \(n \times n\) linear equation on top of that. However for a lot of languages the resulting DFA won’t be too large, and this approach is fairly reasonable. It’s also nice as an existence proof.
Additionally, we can cache the first 5 steps and reuse them. Only the 6th step needs to be redone for new values of the parameter, and it at least is generally very fast – It’s O(n) in the length of the generated string, with pretty good constant factors as drawing from a pre-built alias sampler is generally very fast. You can also initialise the alias samplers lazily as you walk the graph.
If you want to see all of this fitting together, I created a small project with these concepts in it. It’s far from production ready, and is only lightly tested, but it seems to work pretty well based on initial experimentation and should be enough to understand the ideas.
(And if you want to see more posts like this, why not support my blog on Patreon?)