Category Archives: Python

Linear time (ish) test case reduction

A problem I’m quite interested in, primarily from my work on Hypothesis, is test case reduction: Taking an example that produces a bug, and producing a smaller version that triggers the same bug.

Typically a “test-case” here means a file that when fed to a program triggers a crash or some other wrong behaviour. In the abstract, I usually think of sequences as the prototypical target for test case reduction. You can view a file as a sequence in a number of usefully different ways – e.g. as bytes, as lines, etc. and they all are amenable to broadly similar algorithms.

The seminal work on test-case reduction is Zeller and Hildebrant’s “Simplifying and Isolating Failure-Inducing Input“, in which they introduce delta debugging. Delta-debugging is essentially an optimisation to the greedy algorithm which removes one item at a time from the sequence and sees if that still triggers the bug. It repeatedly applies this greedy algorithm to increasingly fine partitions of the test case, until the partition is maximally fine.

Unfortunately the greedy algorithm as described in their paper, and as widely implemented, contains an accidentally quadratic bug. This problem is not present in the reference implementation, but it is present in many implementations of test case reduction found in the wild, including Berkeley delta and QuickCheck. Hypothesis gets this right, and has for so long that I forgot until quite recently that this wasn’t more widely known.

To see the problem, lets look at a concrete implementation. In Python, the normal implementation of the greedy algorithm looks something like this:

def greedy(test_case, predicate):
    while True:
        for i in range(len(test_case)):
           attempt = list(test_case)
           del attempt[i]
           if predicate(attempt):
               test_case = attempt
    return test_case

We try deleting each index in turn. If that succeeds, we start again. If not, we’re done: No single item can be deleted from the list.

But consider what happens if our list is, say, the numbers from 1 through 10, and we succeed at deleting a number from the list if and only if it’s even.

When we run this algorithm we try the following sequence of deletes:

  • delete 1 (fail)
  • delete 2 (succeed)
  • delete 1 (fail)
  • delete 3 (fail)
  • delete 4 (succeed)
  • delete 1 (fail)

Every time we succeed in deleting an element, we start again from scratch. As a result, we have a classic accidentally quadratic bug.

This is easy to avoid though. Instead of starting again from scratch, we continue at our current index:

def greedy(test_case, predicate):
    deleted = True
    while deleted:
        deleted = False
        i = 0
        while i <  len(test_case):
           attempt = list(test_case)
           del attempt[i]
           if predicate(attempt):
               test_case = attempt
               deleted = True
               i += 1
    return test_case

At each stage we either successfully reduce the size of the test case by 1 or we advance the loop counter by one, so this loop makes progress. It stops in the same case as before (when we make it all the way through with no deletions), so it still achieves the same minimality guarantee that we can’t delete any single element.

The reason this is only linear time “ish” is that you might need to make up to n iterations of the loop (this is also true with the original algorithm) because deleting an element might unlock another element. Consider e.g. the predicate that our sequence must consist of the numbers 1, …, k for some k – we can only every delete the last element, so we must make k passes through the list. Additionally, if we consider the opposite where it must be the numbers from k to n for some k, this algorithm is quadratic while the original one is linear!

I believe the quadratic case to be essential, and it’s certainly essential if you consider only algorithms that are allowed to delete at most one element at a time (just always make the last one they try in any given sequence succeed), but anecdotally most test cases found in the wild don’t have nearly this high a degree of dependency among them, and indexes that previously failed to delete tend to continue to fail to delete.

A model with a strong version of this as its core assumption shows up the complexity difference: Suppose you have \(k\) indices out of \(n\) which are essential, and every other index can be deleted. Then this algorithm will always run in \(n + k\) steps (\(n\) steps through the list the first time, \(k\) steps at the end to verify), while the classic greedy algorithm will always run in \(O(k^2 + n)\) steps.

Although this model is overly simplistic, anecdotally examples found in the wild tend to have something like this which acts as an “envelope” of the test case – there’s some large easily deletable set and a small essential core. Once you’re down to the core you have to do more work to find deletions, but getting down to the core is often the time consuming part of the process.

As a result I would be very surprised to find any instances in the wild where switching to this version of the algorithm was a net loss.

This entry was posted in programming, Python on by .

A worked example of designing an unusual data structure

Due to reasons, I found myself in need of a data structure supporting a slightly unusual combination of operations. Developing it involved a fairly straightforward process of refinement, and a number of interesting tricks, so I thought it might be informative to people to walk through (a somewhat stylised version of) the design.

The data structure is a particular type of random sampler, starting from a shared array of values (possibly containing duplicates). Values are hashable and comparable for equality.

It needs to support the following operations:

  1. Initialise from a random number generator and a shared immutable array of values so that it holds all those values.
  2. Sample an element uniformly at random from the remaining values, or raise an error if there are none.
  3. Unconditionally (i.e. without checking whether it’s present) remove all instances of a value from the list.

The actual data structure I want is a bit more complicated than that, but those are enough to demonstrate the basic principles.

What’s surprising is that you can do all of these operations in amortised O(1). This includes the initialisation from a list of n values!

The idea behind designing this is to start with the most natural data structure that doesn’t achieve these bounds and then try to refine it until it does. That data structure is a resizable array. You can sample uniformly by just picking an index into the array. You can delete by doing a scan and deleting the first element that is equal to the value. This means you have to be able to mutate the array, so initalising it requires copying.

Which means it’s time for some code.

Let’s start by writing some code.

First lets write a test suite for this data structure:

from collections import Counter
from hypothesis.stateful import RuleBasedStateMachine, rule, precondition
import hypothesis.strategies as st
from sampler import Sampler
class FakeRandom(object):
    def __init__(self, data):
        self.__data = data
    def randint(self, m, n):
        return self.__data.draw(st.integers(m, n), label="randint(%d, %d)" % (
            m, n
class SamplerRules(RuleBasedStateMachine):
    def __init__(self):
        super(SamplerRules, self).__init__()
        self.__initialized = False
    @precondition(lambda self: not self.__initialized)
    def initialize(self, values, data):
        self.__initial_values = values
        self.__sampler = Sampler(values, FakeRandom(data))
        self.__counts = Counter(values)
        self.__initialized = True
    @precondition(lambda self: self.__initialized)
    def sample(self):
        if sum(self.__counts.values()) != 0:
            v = self.__sampler.sample()
            assert self.__counts[v] != 0
                assert False, "Should have raised"
            except IndexError:
    @precondition(lambda self: self.__initialized)
    def remove(self, data):
        v = data.draw(st.sampled_from(self.__initial_values))
        self.__counts[v] = 0
TestSampler = SamplerRules.TestCase

This uses Hypothesis’s rule based stateful testing to completely describe the valid range of behaviour of the data structure. There are a number of interesting and possibly non-obvious details in there, but this is a data structures post rather than a Hypothesis post, so I’m just going to gloss over them and invite you to peruse the tests in more detail at your leisure if you’re interested.

Now lets look at an implementation of this, using the approach described above:

class Sampler(object):
    def __init__(self, values, random):
        self.__values = list(values)
        self.__random = random
    def sample(self):
        if not self.__values:
            raise IndexError("Cannot sample from empty list")
        i = self.__random.randint(0, len(self.__values) - 1)
        return self.__values[i]
    def remove(self, value):
        self.__values = [v for v in self.__values if v != value]

The test suite passes, so we’ve successfully implemented the operations (or our bugs are too subtle for Hypothesis to find in a couple seconds). Hurrah.

But we’ve not really achieved our goals: Sampling is O(1), sure, but remove and initialisation are both O(n). How can we fix that?

The idea is to incrementally patch up this data structure by finding the things that make it O(n) and seeing if we can defer the cost for each element until we actually definitely need to incur that cost to get the correct result.

Let’s start by fixing removal.

The first key observation is that if we don’t care about the order of values in a list (which we don’t because we only access it through random sampling), we can remove the element present at an index in O(1) by popping the element that is at the end of the list and overwriting the index with that value (if it wasn’t the last index). This is the approach normally taken if you want to implement random sampling without replacement, but in our use case we’ve separated removal from sampling so it’s not quite so easy.

The problem is that we don’t know where (or even if) the value we want to delete is in our array, so we still have to do an O(n) scan to find it.

One solution to this problem (which is an entirely valid one) is to have a mapping of values to the indexes they are found in. This is a little tricky to get right with duplicates, but it’s an entirely workable solution. It makes it much harder to do our O(1) initialize later though, so we’ll not go down this route.

Instead the idea is to defer the deletion until we know of an index for it, which we can do during our sampling! We keep a count of how many times a value has been deleted and, if we end up sampling it and the count is non-zero, we remove it from the list and decrement the count by one.

This means that we potentially pay an additional O(deletions) cost each time we sample, but these costs are “queued up” from the previous delete calls, and once incurred do not repeat, so this doesn’t break our claim of O(1) amortised time – the costs we pay on sampling are just one-off deferred costs from earlier.

Here’s some code:

class Sampler(object):
    def __init__(self, values, random):
        self.__values = list(values)
        self.__random = random
        self.__deletions = set()
    def sample(self):
        while True:
            if not self.__values:
                raise IndexError("Cannot sample from empty list")
            i = self.__random.randint(0, len(self.__values) - 1)
            v = self.__values[i]
            if v in self.__deletions:
                replacement = self.__values.pop()
                if i != len(self.__values):
                    self.__values[i] = replacement
                return v
    def remove(self, value):

So now we’re almost done. All we have to do is figure out some way to create a mutable version of our immutable list in O(1).

This sounds impossible but turns out to be surprisingly easy.

The idea is to create a mask in front of our immutable sequence, which tracks a length and a mapping of indices to values. Whenever we write to the mutable “copy” we write to the mask. Whenever we read from the copy, we first check that it’s in bounds and if it is we read from the mask. If the index is not present in the mask we read from the original sequence.

The result is essentially a sort of very fine grained copy on write – we never have to look at the whole sequence, only the bits that we are reading from, so we never have to do O(n) work.

Here’s some code:

from collections import Counter
class Sampler(object):
    def __init__(self, values, random):
        self.__values = values
        self.__length = len(values)
        self.__mask = {}
        self.__random = random
        self.__deletions = set()
    def sample(self):
        while True:
            if not self.__length:
                raise IndexError("Cannot sample from empty list")
            i = self.__random.randint(0, self.__length - 1)
                v = self.__mask[i]
            except KeyError:
                v = self.__values[i]
            if v in  self.__deletions:
                j = self.__length - 1
                    replacement = self.__mask.pop(j)
                except KeyError:
                    replacement = self.__values[j]
                self.__length = j
                if i != j:
                    self.__mask[i] = replacement
                return v
    def remove(self, value):

And that’s it, we’re done!

There are more optimisations we could do here – e.g. the masking trick is relatively expensive, so it might make sense to switch back to a mutable array once we’ve masked off the entirety of the array, e.g. using a representation akin to the pypy dict implementation and throwing away the hash table part when the value array is of full length.

But that would only improve the constants (you can’t get better than O(1) asymptotically!), so I’d be loathe to take on the increased complexity until I saw a real world workload where this was the bottleneck (which I’m expecting to at some point if this idea bears fruit, but I don’t yet know if it will). We’ve got the asymptotics I wanted, so lets stop there while the implementation is fairly simple.

I’ve yet to actually use this in practice, but I’m still really pleased with the design of this thing.  Starting from a fairly naive initial implementation, we’ve used some fairly generic tricks to patch up what started out as O(n) costs and turn them O(1). As well as everything dropping out nicely, a lot of these techniques are probably reusable for other things (the masking trick in particular is highly generic).

Update 09/4/2017: An earlier version of this claimed that this solution allowed you to remove a single instance of a value from the list. I’ve updated it to a version that removes all values from a list, due to a comment below correctly pointing out that that approach biases the distribution. Fortunately for me in my original use case the values are all distinct anyway so the distinction doesn’t matter, but I’ve now updated the post and the code to remove all instances of the value from the list.

Do you like data structures? Of course you do! Who doesn’t like data structures? Would you like more data structures? Naturally. So why not sign up for my Patreon and tell me so, so you can get more exciting blog posts like this! You’ll get access to drafts of upcoming blog posts, a slightly increased blogging rate from me, and the warm fuzzy feeling of supporting someone whose writing you enjoy.

This entry was posted in programming, Python on by .

My time: Now available to the highest bidder

This is an experiment.

Would you like me to do some work for you? Specifically, one day worth of work?

Well, that option is of course always open to you and you can email me to talk about it. But maybe you’ve been worried that I’m too busy, too expensive, etc. Or maybe you just haven’t got around to it.

So I’m going to try an experiment: I am auctioning off one work day of my time next month to the highest bidder. It can be for anything you like – Hypothesis, writing, having someone come in to take a look at your business and give you some pointers on your software development, or really anything else at all. This can be more or less whatever you want as long as it’s something I’m in principle willing to do for money (and if it’s paying me to do a concrete thing like write some software or an article rather than general consulting, I’m happy for this to include rights assignment).

Why am I doing this? The prompting thought was that I’m looking into ways to fill precisely one day of my time per month once I start doing my PhD, which will hopefully be in September – PhD stipends are a lot lower than developer salaries or day rates, so a top up would be nice, but I don’t want to do my PhD part time.

But once I thought about it I realised that it just seemed like a really fun experiment to try, so I figured I might as well try it now! Worst case scenario, we all learn something interesting and the cost of a day of my time.

Here are the rules:

  1. You are getting one day of work. If I think what you have asked for cannot be done in one day I will tell you up front, but if it takes longer than expected then I’m not going to continue working on it for free. I recommend picking things that either are intrinsically time bounded or can be stopped at any time and still be useful, but small projects that I’m confident can be done in a day are OK too.
  2. The auction will be a Vickrey auction. That is, the highest bidder wins the auction and pays the second highest bidder’s bid. That means that you will usually pay less than your bid (And also means that the game theoretically correct amount for you to bid is exactly the amount my time is worth to you).
  3. If there is only one bidder then that means they get a day of my work for free.
  4. The auction will run until midnight BST (UTC + 1) on April 30th. No bids made after that point will be counted.
  5. If there is a tie for top place then I will pick the one I most want to do among them (or flip a coin or something if I’m torn) and charge the full bid.
  6. If I judge a job to be illegal, unethical, or I just really don’t want to do it for any reason then I will remove it from the list and not count it for either the top or the second place bid. I will let you know if I do this if you would have been the first or second place bid and I don’t think you’re being deliberately abusive.
  7. I will publish both the winning bid and what they actually paid (and, if it is acceptable to the winner, who won) once the auction completes.
  8. The day of work must be redeemed at some point in May, after which I will send you an invoice. If we can’t come to a mutually acceptable time in May I will rerun the auction without your bid and offer the spot to the new winner and you won’t get charged anything.
  9. If the job cannot be done remotely I will also invoice you for travel (and, if necessary, accommodation) on top of the bid.

Note that yes you can bid ridiculously small amounts, or even zero, and if that’s the winning bid then I will honour it (though I will probably not repeat the experiment for very long if that happens). You can also bid ridiculously high amounts and hope that the next best bid isn’t ridiculous, but I don’t recommend it (and may query your willingness to pay before accepting…). The winning strategy really is to bid what you think my time is worth to you.

Interested? Just fill out this form! I will let you know if you win the bid, and we can then arrange dates accordingly.

If you’re still on the fence, here are some ideas for how a day of my time can be useful to you:

  1. I can write a short article (or story if you prefer!) about something I already know a reasonable amount about. Very few of my blog posts take more than a day of work.
  2. I can read a paper and discuss it with you and/or try to write a short precis and introduction to its ideas (whether this is feasible will vary based on the paper, but I can get a pretty good idea if it’s going to be with a brief skim).
  3. I can add tests using Hypothesis to a small Python project.
  4. If working with someone who knows that project well, I can add tests using Hypothesis to a larger Python project.
  5. I can come in to your company and answer peoples’ questions on a subject of your choosing, which will probably be Hypothesis, but could be Integer Linear Programming, Voting Theory, or general software development.
  6. I can come in to your company and take a look at your development practices and processes and offer you tips on how to improve them.
  7. I can come in to your company and help with improving your interviewing process.

Those are just some ideas though! The point is that I have a variety of expertise, a lot of which can be usefully applied in under a day if you pick a reasonably small project or just use me as a general consultant.

Note that if you do have any general desire to bring me in as a consultant you should put in a bid, because you will almost certainly get a very good deal on this auction compared to my normal rates.

This entry was posted in programming, Python, Work on by .

The Easy Way to Compile Automata to Regular Expressions

I have a confession: despite knowing a reasonably large amount about formal language theory and automata, I’ve never bothered to learn how to prove how to turn a finite automaton back into a regular expression.

My attitude has always been that I have a vague sense of how a proof would go and couldn’t be bothered to sort out the details. I’ve probably skimmed a proof in the past but it looked complicated and uninformative so I didn’t bother.

Due to reasons (a certain theorem about how you could represent regular expressions in a particular way required the details of a proof) I found I was forced to care recently, so I decided to actually sit down and figure it out.

The standard proofs did indeed seem complicated and uninformative, but fortunately it turns out that the general method of understanding regular language theory fundamentals applies in this case.

That method, of course, being to ask “What Would Janusz Brzozowski Do?”.

Starting from a Nondeterministic Finite Automaton (without \(\epsilon\)-transitions) there turns out to be a natural algebraic process which is fairly reminiscent of Gaussian Elimination that results in a fairly small regular expression matching the language.

The idea is this: Suppose we have a non-deterministic automaton without \(\epsilon\)-transitions. Label the states \(0, \ldots, n\). We consider the languages \(R_i\) of strings matched when starting from state \(i\).

By looking at the automaton we get a set of algebraic equations for each of these, which we can progressively rewrite and simplify to remove loops in them (e.g. where \(R_i\) is defined self-referentially, or the definitions of \(R_i\) and \(R_j\) depend on each other). Once we’ve successfully done that, we can just substitute in terms until we get an expression for \(R_0\), which will be our final result.

Note: In what follows we’ll adopt the usual convention of equating regular expressions with their regular language. Variables which correspond to a language will be upper case, ones corresponding to an expression will be lower-case.

Let \(s_{ij}\) be a regular expression matching the language of strings that cause a tradition from state \(i\) to \(j\). This is a language consisting of strings of length \(1\) – if the character \(c\) causes a transition from \(i\) to \(j\) then \(c \in s_{ij}\). If there are no transitions from \(i\) to \(j\) then \(s_{ij}\) is \(\phi\), the regular expression matching no strings, otherwise it is a union of finitely many basic regular expressions.

Note that in the general case there may be strings matched by both \(s_{ij}\) and \(s_{ik}\) with \(j \neq k\). This is because the automaton is non-deterministic. In a deterministic automaton the expressions \(s_{ij}, s_{ik}\) with \(i \neq k\) will have no common matches.

Let \(t_i\) be \(\phi\) if state \(i\) is not accepting and \(\epsilon\) if it is. i.e. \(t_i\) is a regular expression matching the set of empty strings matched from state \(i\).

Now, we have that \(R_i = t_i \cup \bigcup\limits_j s_{ij} R_j\) – a string in \(R_i\) is either the empty string (if \(i\) is accepting), or the result of transitioning to another state and matching from there.

We’ll now begin the process of substituting in terms to remove cycles.

The key idea is to find \(t_i’, s_{ij}’\) such that whenever \(j \leq i\), \(s_{ij}’ = \phi\). This means there can be no cycles because there are no backwards or self-references between the expressions.

It will no longer be the case that \(s_{ij}’\) only matches strings of length \(1\), but it will be the case that they don’t match the empty string (this is important).

How do we construct these modified terms? By induction!

Suppose we’ve constructed \(t_i\) and \(s_{ij}\) for \(i < k\). If we just substitute in and rearrange all of those terms into \(R_k = t_k \cup \bigcup\limits_j s_{kj} R_j\) then we almost get what we want: The only terms that can be non-empty are the ones where \(j \geq k\), because we’ve eliminated all the terms that were \(< k\) and replaced them with terms that are \(\geq k\).

So all we need to do is figure out how to remove the \(R_k\) term.

Fortunately, there’s a great tool for doing that. it’s called Arden’s theorem:

If we have languages \(X, A, B\) with \(\epsilon \not\in A\) and\(X = AX \cup B\), then \(X = A* B\) .

Conceptually this is fairly intuitive because you can just rerwite it as \(X = A(AX \cup B) \cup B\) and continue this infinitely to get \(X = B \cup AB \cup AAB \cup \ldots\), which is precisely \(X = A*B\). However this intuition breaks down somewhat – the condition that \(A\) does not contain the empty string is essential, because if we picked \(A = \{\epsilon\}\) and \(B = \emptyset\} then this reduces to \(X = X\), which every language satisfies.

The problem is essentially that the term in the \(\ldots\) “does not converge” in this case, so we need to do some more work to actually prove this rigorously.


First note that \(A*B\) is a possible solution for \(X\) because \(A* =  (A A*) \cup \epsilon\), so  \(A* B = (A A* B) \cup \epsilon B = A (A* B) \cup B\).

Now note that any language satisfying the equation has the following property: \(x \in X\) if and only if either \(x \in B\) or \(x = uv\) with \(u \in A\) and \(v \in X\).

This is simply by the definition: If \(x \not\in B\) then \(x \in AX\), so \(x\) starts with a string of \(A\), which is non-empty by assumption. Conversely, if \(x \in B\) then \(x \in X\) by definition, and if \(x = uv\) with \(u \in A\) and \(v \in L\).

Now, suppose we have two non-identical solutions to the equation. Say \(L\) and \(M\). Suppose there exists \(x \in L \setminus M\). Pick \(x\) so that it has minimal length among such words..

Then certainly \(x \not\in B\) or it would be in both, so by assumption we can write \(x = uv\) with \(u \in A\) and \(v \in L\).

But then we must have \(v \in M\), by the minimality of \(x\). But if \(v \in M\) then necessarily \(uv = x \in M\). This contradicts the assumption. Therefore no such \(x\) exists, and the two languages are equal.


So it’s important that all the coefficients on the right hand side don’t match the empty string, but this is again true inductively: The initial \(s_{ij}\) do not match the empty string, and every coefficient is either a concatenation or a union of two languages that don’t match the empty string, so in turn does not match the empty string.

This means that Arden’s theorem gives us the tool we need to remove the coefficient for \(R_k\) on the right hand: All we need to do is to prepend the star of that coefficient to the other terms and remove \(R_k\) from the equation.

This completes our inductive step, which in turn completes our algorithm: We run the induction over the whole set of equations, we now no longer have loops, and we can just substitute back in to get \(R_0\) as desired.

Let’s work through an example now to clear up the details.

Suppose we have the states 0, 1, 2. Only state 2 is accepting. Our characters are a, b and c. The allowed transitions are:

  • From 0: \(a, b \to 1\).
  • From 1: \(a \to 1, b \to 2\)
  • From 2: \(c \to 0, 1\).

This gives us the following three initial equations:

  • \(R_0 = (a | b) R_1\)
  • \(R_1 =a R_1 \cup b R_2\)
  • \(R_2 = \epsilon \cup c R_0 \cup c R_1\)

We now perform substitutions as follows:

  1. Our equation for \(R_0\) is in the desired form already so nothing needs to be done.
  2. Our equation for \(R_1\) has \(R_1\) on the right hand side, so we use Arden’s theorem to rewrite it as \(R_1 = a* b R_2\). It is now in the desired form.
  3. We substitute in our previous equation for \(R_0\) and get \(R_2 = \epsilon \cup c (a|b) R_1 \cup c R_1 = \epsilon \cup c(a|b|\epsilon) R_1\).
  4. We substitute in our previous equation for \(R_1\) and get  \(R_2  = \epsilon \cup c(a|b|\epsilon) a* b R_2 \).
  5. We apply Arden’s theorem one final time and get \(R_2 = (c(a|b|\epsilon) a* b)* | \epsilon = (c(a|b|\epsilon) a* b)*\).
  6. We now substitute this into our equation for \(R_1\) and get \(R_1 = a* b (c(a|b|\epsilon) a* b)*\)
  7. We now substitute this into our equation for \(R_0\) and get  \(R_0 = (a | b) a* b (c(a|b|\epsilon) a* b)*\)

We now have a (not all that pleasant) regular expression matching our original deterministic finite automaton, as desired.

In general I’m unclear on whether this is the “best” method for doing this. The above explanation is a lightly modified version of the one presented here, which compares it with several other methods, where it seems to come out ahead of the others.

It certainly isn’t optimal, in the sense that it doesn’t always produce a minimal regular expression. However, neither do any of the other standard methods. This seems to be a hard problem, and there aren’t as far as I know any good algorithms for doing it.

However, regardless of whether it’s the best, it is certainly the one that seems clearest to me: Once you have the key mechanism of Arden’s theorem, you just perform simple algebraic manipulation and the result pops out at the end.

If you want to see this working in practice, I’ve rather arbitrarily added in some code for doing this (albeit only with deterministic finite automata. The principle is the same though) to my FALBS project, which has basically become a dumping ground for all my formal languages code, along with a test that this does the right thing.

(Want more posts like this? Why not support my Patreon! It will cause me to write more, give you some influence over what I write, and give you access to early drafts of upcoming posts).

This entry was posted in Automata Theory, programming, Python on by .

Fully Automated Luxury Boltzmann Sampling for Regular Languages

Suppose you have some regular language. That is, some language you can define through some mix of single characters, concatenation, repetition and alternation. e.g. the language defined by the regular expression (0|1)*2 which consists of any number of 0s and 1s followed by a single 2.

Suppose further instead of matching you want to generate an instance of such a regular language. This can be quite useful as a building block for building interesting data generators, so I’ve been looking into doing that in Hypothesis. Lucas Wiman’s revex is an example of an existing project doing similarly.

It’s easy to do this naively from our building blocks:

  • To generate a single character, just return that character.
  • To generate one of x|y, pick one of x or y uniformly at random and then generate from that.
  • To generate xy, generate from x and y independently and concatenate the results.
  • To generate from x* pick the number of repetitions (e.g. as a geometric distribution) and then generate that many instances and concatenate them.

This almost works but it has a major problem: It’s quite biased.

Consider the regular expression (00)|(1[0-9]). That is, we match either the string 00 or the string 1 followed by any digit. There are 11 strings matched by this language, but under our above model we’ll pick 00 half the time!

Additionally, it’s biased in a way that depends on the structure of our regular expression. We could just have easily have written this as (10)|((00)|(1[1-9)). Now we produce 10 half the time, 00 a quarter of the time, and one of the remaining 9 strings the remaining quarter of the time.

So how do we define a distribution that is independent of our particular choice of regular expression?

One goal that might be reasonable is to be unbiased: To produce every string in the language with equal probability. Unfortunately, this doesn’t make sense if our language is infinite: There is no uniform distribution on an infinite countable set.

We can however hope for a way that has the weaker property of being unbiased only among strings with the same length. So in the above example we’d have got a uniform distribution, but if we had say the language 0*1 of arbitrarily many zeros followed by a single 1, some lengths of string would be more likely than others.

There turns out to be a nice way of doing this!  They’re called Boltzmann Samplers (warning: The linked paper is quite information dense and I only understand about a third of it myself). The Boltzmann samplers for a language define a family of probability distributions over it, controlled by a single real valued parameter, \(x \in [0, 1]\).

The idea is that you pick each string of length \(n\) with probability proportional to \(x^n\) (note: each string. So if there are \(k\) strings of length \(n\) then you pick some string of length \(n\) with probability proportionate to \(k x^n\)).

In general there is not a defined Boltzmann sampler for every possible value of \(x\). The closer \(x\) is to \(1\), the slower the probability of picking a string drops off. If \(x = 1\) then we have a uniform distribution (so this is only possible when the language is finite). If \(x = 0\) then we can only generate the empty string. The Boltzmann sampler will be defined as long as the corresponding infinite sum converges, which will definitely happen when \(x < \frac{1}{|A|}\) (where \(A\) is the alphabet for our language) but may happen for larger \(x\) if the language is relatively constrained, and the average size of the strings will increase as \(x\) increases.

It’s not a priori obvious that simulating a Boltzmann sampler is any easier than our original problem, but it turns out to be. The reason is that for certain primitive languages the Boltzmann sampler is easy to compute (e.g. for a language consisting only of a single fixed string it just returns that string), and for certain natural operations for combining languages (especially for our purposes disjoint unions) we can then simulate the Boltzmann sampler using the simulated Boltzmann samplers for the base languages.

We’ll see how this works in a moment, but first a digression.

An idea that turns out to be intimately connected to Boltzmann samplers is that of the counting generating function. If you have a language \(L\), you can define \(g_L(x) = \sum\limits_{n=0}^\infty |L_n| x^n\), where \(L_n\) is the subset of the language consisting of strings.

Counting generating functions have a number of important properties about how they combine when you combine languages. The one that will be most relevant for us is that if \(L\) and \(M\) are two disjoint languages then \(g_{L \cup M}(x) = g_L(x) + g_M(x)\). This is because \((L \cup M)_n = L_n \cup L_m\). Because these are disjoint this means that \(|(L \cup M)_n| = |L_n| + |M_n|\). (It’s important to note that this doesn’t work when the languages aren’t disjoint: You over-count the intersection).

Counting generating functions will be useful because we can treat them as a sort of “mass function” that tells us how to weight our decisions: If we can go one of two ways, but one of them has a lot more strings below it, we should pick that one more often. This is the basis of the connection between counting generating functions and Boltzmann samplers.

In particular, if we have two disjoint languages \(L\) and \(M\) and we can simulate the Boltzmann sampler of parameter \(x\) for each, we can simulate the Boltzmann sampler with parameter \(x\) for \(L \cup M\) as follows: We pick \(L\) with probability \(\frac{g_L(x)}{g_L(x) + g_M(x)}\), else we pick \(M\), then we simulate the Boltzmann sampler for the language we ended up with.

This is almost exactly the same as our simulation from the beginnings, we’ve just changed the weightings! It’s important to remember that this only works if \(L\) and \(M\) are disjoint though.

This property, it turns out, is enough to compute a Boltzmann sampler for any regular language.

The key idea is this: If you compile a regular language to a deterministic finite automaton (DFA), you can now look at the languages matched from each state in that DFA. Each of these languages can be represented as a disjoint union of other languages, which gives us a system of equations for the counting generating functions that we can just solve.

If \(L[i]\) is the language matched when starting from state i, then \(L[i] = \bigcup\limits_{c \in A}  c L[\tau(i, c)] \cup E[i]\), where \(\tau\) is the transition function and \(E[i]\) is the language that just matches the empty string if \(i\) is an accepting state, else is the empty language.

That is, every string matched when starting from \(i\) is either the empty string, or it is a single character prepended to a string from the language you get starting from the state that character causes a transition to. This is a disjoint union! Given a string in the union it is either empty, or you can look at its first character to identify which one of the branches it belongs to.

This means we can calculate the generating function as \(g_{L[i]}(x) = x \sum\limits_{c \in A} g_{L[\tau(i, c)]} + \epsilon[i]\), where \(\epsilon[i]\) is \(1\) if this is an accepting state or \(0\) otherwise.

(The extra factor of \(x\) comes from the fact that we’ve added an extra character to the beginning language, so every string has its length increased by \(1\)).

But this is a linear system of equations in the generating functions! There’s that unknown variable \(x\), but you can just treat that as meaning it’s a linear system of equations whose parameters live in the field of rational functions.

In particular, and importantly, we can use sympy to solve this rather than having to write our own linear solving routines (we can’t use a standard numpy-esque solver because of the symbolic parameter):

def compute_generating_functions(accepting, transitions):
    assert len(accepting) == len(transitions)
    n = len(accepting)
    z = sympy.Symbol('z', real=True)
    weights = {}
    for i in range(n):
        weights[(i, i)] = 1
        for _, j in transitions[i].items():
            key = (i, j)
            weights[key] = weights.get(key, 0) - z
    matrix = SparseMatrix(
        n, n, weights
    vector = sympy.Matrix(n, 1, list(map(int, accepting)))
    return z, matrix.LUsolve(vector)

This simultaneously solves the linear equations for all of the states of the DFA. So a result looks like this:

In []: rd.compute_generating_functions([False, False, True], [{0: 1}, {0: 2}, {0: 2}])
(z, Matrix([
 [z**2/(-z + 1)],
 [   z/(-z + 1)],
 [   1/(-z + 1)]]))

This calculates the generating functions for the language that matches any string of two or more 0 bytes.

Now that we’ve calculated the generating functions, we can randomly walk the DFA to run the desired Boltzmann sampler! The counting generating function provides the weights for our random walks, and moving to one of the languages in our disjoint union is precisely either stopping the walk or transitioning to another state.

We start at the initial state, and at each stage we choose between one of \(|A| + 1\) actions: We can either emit a single character and transition to the corresponding new state, or we can stop. Each of these are weighted by the generating function of the language applied to our Boltzmann sampling parameter: The option “halt now” is given weight \(1\), and the action “emit character c” is given weight \(x g_{L[\tau(i, c)]}(x)\) – i.e. \(x\) times the state weight we’ve already calculated for the target state.

This is just a discrete distribution over a finite number of possibilities, so we can use the alias method to build a sampler for it.

Thus our algorithm becomes:

  1. Build a DFA for the regular expression
  2. Use sympy to calculate the counting generation function for the language matched starting from each state in our DFA
  3. Pick a Boltzmann parameter
  4. Calculate all the state weights
  5. For each state (possibly lazily on demand) initialize an alias sampler for the actions that can be taken at that state
  6. Repeatedly draw from the sampler, either drawing a stop action or drawing an emit character action. Where we emit a character, make the corresponding state transition.

I’m not going to claim that this approach is efficient by any stretch. The DFA may be exponential in the size of our starting regular expression, and we’re then solving an \(n \times n\) linear equation on top of that. However for a lot of languages the resulting DFA won’t be too large, and this approach is fairly reasonable. It’s also nice as an existence proof.

Additionally, we can cache the first 5 steps and reuse them. Only the 6th step needs to be redone for new values of the parameter, and it at least is generally very fast – It’s O(n) in the length of the generated string, with pretty good constant factors as drawing from a pre-built alias sampler is generally very fast. You can also initialise the alias samplers lazily as you walk the graph.

If you want to see all of this fitting together, I created a small project with these concepts in it. It’s far from production ready, and is only lightly tested, but it seems to work pretty well based on initial experimentation and should be enough to understand the ideas.

(And if you want to see more posts like this, why not support my blog on Patreon?)

This entry was posted in programming, Python on by .