Category Archives: programming

Looking into doing a PhD

As regular readers of this blog have probably figured out, I’m a researchy sort of person.

A lot of my hobbies – maths, voting theory, weird corners of programming, etc – are research oriented, and most of my work has had some sort of research slant to it.

The last two years I’ve basically been engaged in a research project working on Hypothesis. It’s come quite far in that time, and I feel reasonably comfortable saying that it’s the best open source property based testing library on most metrics you’d care to choose. It has a number of novel features and implementation details that advance the state of the art.

It’s been pretty great working on Hypothesis like this, but it’s also been incredibly frustrating.

The big problem is that I do not have an academic background. I have a masters in mathematics (more technically I have a BA, an MA, and a CASM. Cambridge is weird. It’s entirely equivalent to a masters in mathematics though), but that’s where I stopped. Although it says “DR” in my online handle and the domain of this blog, those are just my initials and not my qualification.

As a result, I have little to no formal training or experience in doing academic research, and a similarly low understanding of who’s who and what’s what within the relevant fields. So I’ve been reading papers and trying to figure out the right people to talk to all on my own, and while it’s gone OK it’s still felt like fumbling around in the dark.

Which leads to the obvious solution that I spoilered in the title: If the problem is that I’m trying to do research outside of an academic context, the solution is to do research in an academic context.

So I’d like to do a PhD that is either about Hypothesis, or about something close enough to Hypothesis that each can benefit from the other.

There’s probably enough novel work in Hypothesis already that I could “just” clean it up, factor it out, and turn it into a PhD thesis as it is, but I’m not really expecting to do that (though I’d like that to be part of it). There are a number of additional directions that I think it would be worth exploring, and I expect most PhD funding will come with a focus subject attached which I would be happy to adapt to (a lot of the most interesting innovations in Hypothesis came because some external factor forced me to think about things in ways I wouldn’t otherwise have!). If you’d like to know more, I’ve written up a fairly long article about Hypothesis and why I think it’s interesting research on the main Hypothesis site.

Which, finally, brings me to the main point of the post: What I want from you.

I’m already looking into and approaching potential universities and interesting researchers there who might be good supervisors or able to recommend people who are. I’ve been in touch with a couple (some of whom might be reading this post. Hi), but I would also massively appreciate suggestions and introductions.

So, if you work in relevant areas or know of people who do and think it would be useful for me to talk to, please drop me an email at [email protected]. Or just leave a comment on this blog post, tweet at me, etc.

This entry was posted in Hypothesis, life, programming, Python on by .

Laziness is better when it’s visible

This is a trick I invented a while ago. I’m morally certain it’s a reinvention rather than an invention, but I’ve not really seen it in use and at the very least it doesn’t seem to be widely known. I recently ran into a situation where a library would benefit from it greatly and doesn’t use it, so I thought I would write it up.

Suppose we have a bunch of strings and we want to concatenate them.

def cat(my_strings):
    result = ''
    for s in my_strings:
       result += s
    return result

That’s how you concatenate strings, right?

Oh no, accidentally quadratic!

You can fix this using a special string buffer type, or in python by just using ”.join(my_strings), but wouldn’t it be nice if you didn’t have to? It’s often very convenient to build things up using expressions. Although it’s no great hardship for strings to do bulk operations, you run into the same problem in e.g. pulp where you have more complex expressions (and no corresponding sum method in the library). It would be great if this all just worked.

One way to do this sort of thing is to switch to an immutable tree based representation like a rope where the concatenation operation has a more reasonable complexity (usually log(n)).

But that then comes with its own costs. Using a tree structure slows down access and iteration – only by an O(log(n)) factor, but with relatively high constants. As a result you end up paying a relatively high performance penalty for what was mostly just wanted as a convenience (ropes do have other advantages, but that’s not what we’re looking at here).

But what if you didn’t have to do any of that? What if you could get O(1) concatenation and all the benefits of the compact representation? It sounds implausible, but I’m here to tell you you can!

Or, rather, that you almost can. Both of the above are true: You do get O(1) concatenation and you do get all the benefits of the compact representation, but you do end up paying some additional cost, because the idea is to use laziness. So although concatenation is O(1) you end up paying an additional O(n) cost the first time you want to use the result. Fortunately this still avoids the problem of sums being quadratic.

The key idea is to use a data structure that can swap out its implementation on demand. It starts by just storing the abstract expression tree that lead to it, and then switches to a more efficient representation as soon as you need it to.

e.g. the version of this for the above is that a string becomes a binary tree where the leaves are buffers and the branches indicate that the string is a concatenation of its left and right parts. Concatenation is then just creating a new split node, which is O(1).

Then, once we want the compact representation (which will tend to be as soon as we start doing interesting operations on the data – because the expression tree is entirely unnormalized there is very little we can usefully do to it that isn’t an O(n) operation!), we calculate that, store the result on the string and throw away the expression data that brought us here.

That is, as soon as we have forced the string, the string’s switches to a new representation using the forced buffer, essentially replacing the split node with a leaf node.

This feels like we’re back where we started – if you’re doing this lazily like that then you’re just summing together two string children so you’re quadratic again – but we’re not, for one very important reason: Because the implementation of the laziness is under our control, we can tell whether a string has already been forced or not. When forcing a node we then don’t force its child nodes, but instead just walk the tree and behave appropriately when we get to the leaves.

This sort of thing can be very useful, because the common cases where this runs into issues is that you have a complex expression graph and only actually care about a very small fraction of the subexpressions (e.g. in the sum case).

This isn’t always a win, in that it does behave suboptimally under some workloads (e.g. when you do care about a lot of the intermediate results but process them in the reverse of the order you created them), but it’s rarely a substantial loss and usually produces dramatic speedups by converting accidentally quadratic cases into the desired linear behaviour.

There are additional tricks you can build on top of this:

  • You can precompute some data so you don’t always have to force the structure. e.g. you can always calculate the length of the string in the above example without forcing it and still have the operations be O(1)
  • you can sometimes have operations that only require partially forcing the data structure (e.g. if you index into a string you might only have to force one half of it (or neither if the index is out of bounds!)
  • If you have more complex operations then you can do a sort of “query optimization” to rewrite the expression tree into a more efficient execution plan. For example, a thing I’ve done in the past is when the operation is intersection you can rewrite it so that intersections are processed in order of increasing size, which often ends up with you being able to terminate early because you’ve discovered that the end result is going to be empty regardless of what happens next.

Depending on circumstances, any of the above can be worth doing, but most of the big wins come from the basic trick which is almost always a valuable one if you’re running into this sort of problem.

Using this technique won’t always be an improvement – e.g. you’ll end up doing some duplicated work if you do something like x + x because it will effectively recompute forcing x twice – but most of the work loads on which it will behave particularly badly are ones that you should probably have avoided anyway with the normal approach. The only real downsides where you do suffer a hit from using this is that the laziness adds an additional check to each operation, which can be anywhere between a small and modest performance hit depending on how expensive the operation normally is. Additionally, if you want the operations to be thread safe then you’ll need a memory barrier of some sort (e.g. making the relevant field volatile) to get the visibility right, which adds another small hit.

So it’s not a universal win, but the cost is light enough and there are enough work loads where it improves behaviour substantially that it is often worth considering.

To finish off, and make this more concrete, here’s some Python code implementing this idea for strings:

(Like this post? Want to see more like it? Why not support my Patreon! You’ll get to see drafts of upcoming posts and also increase the amount I write)

This entry was posted in programming, Python on by .

Rigging elections with integer linear programming

No, this isn’t a post about politics, sorry, it’s just a post about voting theory.

As you might have noticed (and if not, this is the post announcing it), I have a book out! It’s called Voting by Example and is about the complexities of voting and why different voting systems might be interesting. You should go buy it.

But this isn’t just an ad post for the book, it’s about something else: How I came up with the examples in this book.

At its core is the following example. We have a student election where there are four candidates are running for class president. Each student casts a vote ranking the four candidates in order of most to least preferred.

The votes are as follows:

  • 32% of students voted Alex, Kim, Charlie, Pat
  • 27% of students voted Charlie, Pat, Kim, Alex
  • 25% of students voted Pat, Charlie, Kim, Alex
  • 16% of students voted Kim, Pat, Alex, Charlie

The point of this example is that each of four different ranked voting systems give a different answer:

It’s also constructed to have a minimal number of voting blocs, though it’s minimum amongst a slightly more specific set of elections than just those satisfying the above conditions.

Roughly what this means is:

  • Alex has the most first choice votes
  • When compared pairwise with any other candidate, the majority prefer Charlie
  • Pat wins a complicated procedure where people iteratively drop out depending on who is the favourite of the remaining candidates amongst the fewest voters
  • If you give people a score of 3 for each voter who ranks them first, two for each who ranks them second, and one for each who ranks them third, then Kim has the highest score.

If you want to know more than that I recommend the relevant Wikipedia links above, which are all very approachable.

The significance of this is that each of these is a popular (either in practice or amongst electoral theorists) way of saying who should be the winner. So the situation ends up being rather complex to decide.

But this isn’t a post about the significance. This is a post about how I constructed the example.

Historically I’ve generally created example elections with Hypothesis or a similar approach – randomly fuzzing until you get a result you want – but that wasn’t really going to work very well here due to the rather complicated set of constraints that are hard to satisfy by accident.

So I instead turned to my old friend, integer linear programming.

The idea of integer linear programming (ILP) is that we have a number of variables which are forced to take integer values. We can then impose linear constraints between them, and give a linear objective function to optimise for. For example we might have three variables \(v_1, v_2, v_3\) and the following constraints:

  • \(v_1, v_2, v_3 \geq 0\)
  • \(v_1 + 2 v_2 + 3 v_3 \leq 5\)

And then try to maximise \(v_1 + 2 v_2 + 4 v_3\).

Given a problem like this we can feed it to an ILP solver and it will spit out a solution. If we do that, it will tell us that an optimal solution for this problem is \(v_1 = 2, v_2 = 0, v_3 = 1\).

A lot of problems can be nicely turned into ILP problems, and there are some fairly good open source ILP solvers, so it’s often a nice way to solve complex combinatorial problems.

So how do we turn our election rigging problem into an integer linear program?

The key idea is this: 4! isn’t very large. It’s only 24. 24 is tiny from an ILP point of view.

This means that there’s no problem with creating a variable for each of the possible votes that someone could cast, representing the number of people who cast that vote. That is, we create integer variables \(v_p\) indexed by permutations with the constraints:

  • \(\sum v_p = 100\)
  • \(v_p \geq 0\)

Then the value of e.g. \(v_{(0, 1, 2, 3)}\) is the number of people who cast the exact vote \((0, 1, 2, 3)\) (or, by name, Alex,, Charlie, Pat, Kim).

We then use a trick to get nicer examples, which is that we try to minimise the number of non-zero votes. The idea is to create variables which, when minimised, just look like markers that say whether a vote is non-zero or not.

So we create supplementary 0/1 valued integer variables \(u_p\) with the constraints that \(v_p \leq 100 u_p\), and set the objective to minimise \(\sum u_p\). Then \(u_p\) will be set to \(0\) wherever it can be, and the only places where it can’t are where \(v_p\) is non-zero. Thus this minimises the number of voting blocs.

So that’s how we create our basic ILP problem, but right now it will just stick 100 votes on some arbitrary possible ballot. How do we then express the voting conditions?

Well, lets start with the plurality and Borda scores. These are pretty easy, because they constitute just calculating a score for each candidate for each permutation and adding up the scores. This means that the scores are just a linear function of the variables, which is exactly what an ILP is built on.

Victory is then just a simple matter of one candidate’s score exceeding another. You need to set some epsilon for the gap (linear programming can’t express \(<\), only \(\leq\)), but that’s OK – the scores are just integers, so we can just insist on a gap of \(1\).

The following code captures all of the above using Pulp, which is a very pleasant to use Python interface to a variety of ILP solvers:

The idea is that the additive_scores parameter takes a list of scoring functions and a partial list of winners given those functions and returns an election producing those orders.

So if we run this asking for the plurality winners to be (0, 1, 2, 3) and the Borda winners to be (3, 2, 1, 0) we get the following:

>>> build_election(4, 100, [
    (lambda p, c: int(p[0] == c), (0, 1, 2, 3)),
    (lambda p, c: 3 - p.index(c), (3, 2, 1, 0))])
[((0, 3, 1, 2), 35), ((1, 2, 3, 0), 34), ((2, 3, 0, 1), 31)]

So this is already looking quite close to our starting example.

Creating a Condorcet winner is similarly easy: Whether the majority prefers a candidate to another is again just an additive score. So we just need to add the requisite \(N\) constraint that our desired Condorcet candidate wins.

if condorcet_winner is not None:
    victories = {
        (i, j): lpsum(
            v for p, v in variables if p.index(i) > p.index(j) )
            for i in candidates
            for j in candidates
    }
    for c in candidates:
        if c != condorcet_winner:
            problem.addConstraint( victories[(condorcet_winner, c)] >=
                victories[(c, condorcet_winner)] + 1
            )

If we run this to force the Condorcet winner to be \(1\) we now get the following:

>>> build_election(4, 100, [
    (lambda p, c: int(p[0] == c), (0, 1, 2, 3)),
    (lambda p, c: 3 - p.index(c), (3, 2, 1, 0))],
    condorcet_winner=1,
)
[((0, 3, 1, 2), 28),
 ((1, 2, 3, 0), 27),
 ((2, 1, 3, 0), 24),
 ((3, 2, 0, 1), 21)]

This is pretty close to the desired result. We just need to figure out how to set the IRV winner.

This is a bit more fiddly because IRV isn’t a simple additive procedure, so we can’t simply set up scores for who wins it.

But where it is a simple additive procedure is to determine who drops out given who has already dropped out, because that’s simply a matter of calculating a modified plurality score with some of the candidates ignored.

So what we can do is specify the exact dropout order: This means we know who has dropped out at any point, so we can calculate the scores for who should drop out next and add the appropriate constraints.

The following code achieves this:

    if irv_dropout_order is not None:
        remaining_candidates = set(candidates)
        for i in irv_dropout_order:
            if len(remaining_candidates) <= 1:
                break
            assert i in remaining_candidates
            allocations = {j: [] for j in remaining_candidates}
            for p, v in variables:
                for c in p:
                    if c in remaining_candidates:
                        allocations[c].append(v)
                        break
            loser_allocations = sum(allocations.pop(i))
            remaining_candidates.remove(i)
            for vs in allocations.values():
                problem.addConstraint(loser_allocations + 1 <= sum(vs))

And running this we get the following:

>>> build_election(4, 100, [
    (lambda p, c: int(p[0] == c), (0, 1, 2, 3)),
    (lambda p, c: 3 - p.index(c), (3, 2, 1, 0))
], condorcet_winner=1, irv_dropout_order=(3, 1, 0, 2))
[((0, 3, 1, 2), 31),
 ((1, 2, 3, 0), 27),
 ((2, 1, 3, 0), 25),
 ((3, 2, 0, 1), 17)]

This isn’t quite the same example in the book (the first bloc has one fewer vote which the last bloc got in this), because the book example had a bit more code for optimizing it into a canonical form, but that code isn’t very interesting so we’ll skip it.

Here’s the full code:

I’m constantly surprised how nice integer linear programming ends up being for constructing examples. I knew I could do the Borda and plurality scores – that’s in fact the example that motivated me trying this out at all – but although I find it obvious in retrospect that you can also fix the Condorcet winner it definitely wasn’t a priori. The fact that it’s easy to also calculate the IRV dropout order was genuinely surprising.

This is also a nice example of how much fun small n programming can be. This isn’t just an O(n!) solution – it generates a problem of size O(n!) and then feeds it to a solver for an NP-hard problem! in principle that should be catastrophic. In practice, n is 4, so who cares? (n=4 is about the limit for this too – it just about works for n=5, and for n=6 it doesn’t really).

I also find it somewhat surprising how much freedom there is to get different results from different voting systems. I was motivated to do some of this by The ultimate of chaos resulting from weighted voting systems, so I knew there was a fair bit of freedom, but I was somewhat expecting that to be a pathology of weighted voting systems, so I’m still a little surprised. I guess there’s just quite a lot of room to play around with in a 24-dimensional simplex.

Which brings me to my final point: When you’re trying to understand what is possible in a domain, this sort of example generation focused programming is very useful for exploring the limits of the possible and building your intuition. I feel like this isn’t a thing people do enough, and I think we should all do more of it.

I also think you should buy my book.

(Or support my writing on Patreon)

This entry was posted in programming, Python, voting on by .

Faster random sampling by handling rejection well

Attention conservation notice: You probably don’t have the problem this post solves. It might be interesting anyway.

As part of some ongoing development for Hypothesis, I find myself in need of a well optimised solution to the discrete sampling problem: We have weights \(w_1, \ldots, w_n\) and we want to produce a number in \(\{1, \ldots, n\}\) such that \(i\) is drawn with probability proportional to \(w_i\).

All solutions to this problem are intrinsically \(O(n)\): If you have \(w_i = 1\) for \(i \neq j\) and \(w_j = 10^6\) then you need to inspect \(j\) to know whether to generate \(i \neq j\) often or rarely.

You can however use that \(O(n)\) work once to build a data structure for repeated drawing. For example you can use the alias method to build a data structure that lets you do future draws in \(O(1)\).

Which is great and all if you’re expecting to repeatedly draw from the same distribution over and over again. Unfortunately that’s not what my workload looks like – the distributions used are many and varied. Distributions will be reused, but not in any easy to predict manner.

You can of course do cached lookup to an alias method sampler. This works, but if your cache hit rate gets too high (which it often will be in my workload) this doesn’t actually improve matters. Additionally, depending on how good your caching is and how much is in there, it might not be much cheaper to look up something in the cache  than it is to just create a new one from scratch.

So what we’d like is a method that is significantly cheaper to initialise. Fortunately there is one, it’s called rejection sampling!

You just need to calculate \(W = \max w_i\). Now you iterate the following procedure:

  1. Pick \(i\) uniformly at random
  2. With probability \(\frac{w_i}{W}\) return \(i\)
  3. Go to 1

This is very cheap to initialise but not so great in terms of run time: If the weights are uniform then it’s optimal (because you always return \(i\) in step 2), but if they’re very far from uniform then you’re going to make a lot of trips (potentially \(O(n)\) in the case where one weight dominates everything else) through that loop.

But there turns out to be a hybrid solution that works pretty well and gets us most of the cheapness of initialising rejection sampling with most of the performance of our better sampling methods, with a tuning parameter deciding how much of each we want.

Pick an integer parameter \(R > 0\) (good choices for \(R\) are probably \(1\) or \(2\)) and then run the following algorithm:

  1. Pick \(i\) uniformly at random
  2. With probability \(\frac{w_i}{W}\) return \(i\)
  3. If we’ve reached this point fewer than \(R\) times, go to 1.
  4. Look up or initialise a new sampler for these weights and draw from that.

So we perform the cheap rejection sampling until it’s obvious that it’s not helping much and then move to a proper sampling method.

For cases where we’re near the optimum case for rejection sampling we’ll do well and mostly use it, for the rest of the cases we’ll spend a bit of an extra cost. Potentially we could also do some extra book keeping – if we tend to find rejection sampling never works we could just turn off that phase of the algorithm entirely.

Experiments so far in the thing I’m implementing suggests that this idea provides a modest improvement compared to initializing the sampler (I’m actually using a data structure from Succinct Sampling from Discrete Distributions, on grounds of it being substantially cheaper to initialize at the cost of being slightly more expensive to run. Plausibly this would give a much bigger improvement if I were using the alias method).

This probably isn’t that useful a trick. The work load I’m optimising for is a bit odd, and the performance improvements under that workload are only modest. Additionally, the baseline is already fairly fast. But it’s not something I’d seen before and it seemed worth noting down.

(Something something Patreon something. It feels a bit cheap to plug it here given the relatively niche nature of this post, but hey I was doing the Hypothesis research that lead to it for free, so I guess it balances out?)

This entry was posted in programming on by .

Proving or refuting regular expression equivalence

Suppose we are interested in the following problem: We have two regular expressions. Do these match exactly the same strings, and if not what is the smallest (shortlex minimal) string that one matches but not the other?

i.e. we either want to prove they are equivalent or provide the best possible refutation that they are not.

How do we do this?

More specifically, let’s consider this problem for extended regular expressions. That is, we have the following types of expressions:

  • \(\epsilon\), which matches only the empty string
  • \(c\), for \(c\) in our alphabet, which matches only the character \(c\)
  • \(\emptyset\), which matches no strings
  • \(\omega\) which matches every string
  • \(x^*\) which matches \(a_1\ldots a_n\) for any strings \(a_i\) matched by \(x\).
  • \(xy\) where \(x\) and \(y\) are extended regular expressions, which matches any strings \(ab\) where \(x\) matches \(a\) and \(y\) matches \(b\)
  • \(x \vee y\) which matches any string matched by either or both of \(x\) or \(y\)
  • \(x \wedge y\) which matches any string matched by both \(x\) and \(y\)
  • \(\neg x\) which matches any string not matched by \(x\).

That is, we extend the normal regular expressions with intersection and negation. This doesn’t actually increase the set of languages we can match, but it can lead to significantly more concise representations (I think, but am not sure, there are extended regular expressions which require exponentially larger regular expressions to represent).

This reduces to the problem of finding the lexmin smallest member of the regular language for an expression or showing that it’s empty: Given two extended regular expressions \(x, y\), we can define \(x \oplus y = (x \vee y) \wedge (\neg (x \wedge y))\) – the symmetric difference. This matches any string which is matched by one but not both of \(x\) and \(y\), so a string is a member of this regular language precisely if it is a refutation of the claim that \(x\) and \(y\) are equivalent.

If we can compile extended regular expressions into a deterministic finite automaton (which we can), then the problem is easy:  To find the lexmin smallest element of a regular language or show that it’s empty given a deterministic finite automaton for it, you just use Dijkstra’s Algorithm to search the graph for an accepting node. You stop when you either find one (in which case you’ve constructed a lexmin path to it along the way) or you’ve explored all the states (in which case there is no path to an accepting node and the language is empty). Job’s done.

That was easy. Are we done?

Well, no. There are a lot of evils hidden in that sentence “just construct a deterministic finite automaton”. The corresponding automaton to a regular language may be exponentially larger than the language, and even when it’s not it’s still an expensive operation. Can we avoid that?

First, let’s take a detour: There are a number of obvious rewrite rules we can use to simplify our lives, by replacing regular expressions with straightforwardly equivalent versions of them. We use \(x \approx y\) to indicate that we should consider \(x\) and \(y\) the same under these rewrite rules:

  • \(x \wedge x \approx x\)
  • \(x \wedge y \approx y \wedge x\)
  • \(x \wedge (y \wedge z) \approx (x \wedge y) \wedge z\)
  • \(x \wedge \emptyset \approx \emptyset\)
  • \(x \wedge \omega \approx x\)
  • \(x \vee x \approx x\)
  • \(x \vee y \approx y \vee x\)
  • \(x \vee (y \vee z) \approx (x \vee y) \vee z\)
  • \(x \vee \emptyset \approx x \)
  • \(x \vee \omega \approx \omega \)
  • \(x(yz) \approx (xy)z\)
  • \(x\emptyset \approx \emptyset\)
  • \(\emptyset x \approx \emptyset\)
  • \(x \epsilon \approx x\)
  • \(\epsilon x \approx x\)
  • \((x^*)^* \approx x^*\)
  • \(\epsilon^* \approx \epsilon\)
  • \(\emptyset^* \approx \epsilon\)
  • \(\omega^* \approx \omega\)
  • \(\neg \emptyset \approx \omega\)
  • \(\neg \omega \approx \emptyset\)
  • \(\neg(\neg x) \approx x\)

If \(x \approx y\) we say that \(x\) and \(y\) are similar.

All of these correspond to rules that apply for the resulting languages, so if \(x \approx y\) then \(x\) and \(y\) match the same language. In general though \(x\) and \(y\) might match the same languages without being similar.

Why do we care about this? Well the first reason is that it lets us have a fast test for when two regular expressions definitely are equivalent: If they are similar then they are equivalent. If they’re not we have to do more work to test this.

In particular, we can use smart constructors for our regular expression data type to apply all of these rewrite rules at construction time, which then makes similarity simply a matter of testing for structural equality. If we use hash consing (essentially: memoizing all constructors so that structurally equal values are represented by the exact same object) we can even make testing for similarity O(1).

Which is good, because we’re going to need it a lot: The real reason to care about similarity of regular expressions is that we can construct the a deterministic finite automaton using the Brzozowski derivative of regular expressions, with each of our states labelled by a regular expression – if we have some character in our alphabet \(c\) and a regular language \(L\) then the Brzozowski derivative is \(d(L, c) = \{v: cv \in L\}\) – the set of suffixes of strings starting with \(c\) in \(L\). We can extend this to regular expressions with the following set of rules:

  • \(d(\epsilon, c) = \emptyset\)
  • \(d(b, c) = \emptyset\) if \(b \neq c\), else \(d(c, c) = \epsilon\)
  • \(d(\emptyset, c) = \emptyset\)
  • \(d(\omega, c) = \omega\)
  • \(d(x^*, c) = d(x, c) x^*\)
  • \(d(xy, c) = d(x, c)y \vee \nu(x) d(y, c)\)
  • \(d(x \vee y, c) = d(x, c) \vee d(y, c)\)
  • \(d(x \wedge y, c) = d(x, c) \wedge d(y, c)\)
  • \(d(\neg x, c) = \neg d(x, c)\)

Where \(\nu(x)\) is \(\emptyset\) if the corresponding regular language doesn’t match the empty string, or \(\epsilon\) otherwise (this can be calculated directly from the expressions by similar rules).

The idea is that we can construct an automaton for a regular language by labelling states by expressions. States are accepting if the corresponding expression matches the empty string, and the transition from state \(x\) given character \(c\) goes to \(d(x, c)\).

If our similarity rule was enough to guarantee equivalence then this would be a minimal deterministic finite automaton for the language. However, even without that there are only finitely many states reachable from a given regular expression (this is where similarity becomes important! Without it this is not true. e.g. if you consider iterating the derivatives of \(c^*\) under \(c\) you’ll get infinitely deep nested chains of similar but unequal expressions).

So far it isn’t obvious why this is interesting, but it has one very important property: You can walk this automaton without having to compute any of the states in advance. This means that you can build it as you explore, so we can repeat the above approach of just using Dijkstra’s algorithm to find a refutation but we don’t have to build the automaton in advance.

It’s particularly easy to calculate the derivatives of the disjoint sum language too:

\(d(x \oplus y, c) = d(x \vee y, c) \wedge d(\neg (x \wedge y), c) =  (d(x, c) \vee (y, c)) \wedge (\neg (d(x, c) \wedge d(y, c))) = d(x, c) \oplus d(y, c)\).

So if we’ve got the derivatives of our original regular expressions cached in some way it’s very cheap to calculate derivatives of our compound regular expression.

So that’s a major improvement on our original algorithm – in a number of cases it will be exponentially faster!

There are a number of ways you can improve it further: You can prune down the possible set of starting characters for a string in an imprecise way to reduce the number of branches you explore when doing Dijkstra. You can also precalculate when two characters will lead to the same state and always consider only the smallest. This is a worthwhile set of optimisations to the algorithm, but doesn’t affect it in any really fundamental way.

Can we do better yet?

The problem is that although it’s easy to calculate, our new automaton potentially has a lot more states than we want. In general it will be fewer, but if there are \(M\) states in \(x\) and \(N\) in \(y\) then there might be as many as \(MN\) in the automaton for the symmetric difference regular expression. We’ll likely not have to explore all of them, but it’s unclear how many you might have to explore.

But you can do equivalence testing in nearly O(M + N) using an algorithm due to Hopcroft and Karp based on union-find (Note: the algorithm called the Hopcroft-Karp algorithm is confusingly something completely different). The idea is that assume that they are equivalent and we merge states until we find a contradiction, and that by aggressively merging states we can consider a lot fewer of them:

Given two automata \(S\) and \(T\) with initial states \(s\) and \(t\), we test equivalence as follows:

def equivalent(S, T):
    s = S.start
    t = T.start
    merges = UnionFind()
    merges.union(s, t)
 
    stack = [(s, t)]
    while stack:
        p, q = stack.pop()
        for a in ALPHABET:
            pa = merges.find(S.transition(p, a))
            qa = merges.find(S.transition(q, a))
            if pa.accepting != qa.accepting:
               return False
            if pa != qa:
                merges.union(pa, qa)
                stack.append((pa, qa))
    return True

The idea is that if two states are equivalent then the states they transition to under the same character must be equivalent, and if we’ve proved that an accepting and non-accepting state are equivalent then we’ve reached a contradiction and our initial assumption must have been false. This merges two states if and only if there is some string that leads to one in one automaton and the other in the other, so if there is some string that one matches but not the other it will eventually merge those two states (if it doesn’t stop earlier).

Because of the way we aggressively merge states, this runs in time O(k(M + N)) where k is the size of the alphabet and M and N are the size of the respective states (there’s actually an inverse Ackermann factor in there too, but the inverse Ackermann function is so close to constant it might as well be ignored).

The problem with this algorithm is that it doesn’t produce a refutation, only a yes or no. We can trivially modify it by pushing paths along with the states, but the paths generated aren’t necessarily useful and in particular don’t necessarily go to a state that is accepting in one and not in the other – we’ve replaced paths to states we believed to be equivalent, but because the core assumption is wrong this didn’t necessarily work. In particular there’s no guarantee that either of the states that we use to refute the result are accepting, only that we falsely believe them to be equivalent to two states where one is accepting and the other is not.

We can still make use of this by using this as a fast initial test; Use Hopcroft and Karp’s algorithm as a fast initial test for equivalence and if it returns false then construct a refutation once we know one must exist. This algorithm also still works well with the lazy construction.

But it would be nice to be able to produce a witness as an integrated part of running this algorithm.

And it turns out we can: The idea is to hybridise the two algorithms – use Dijkstra’s algorithm to build the witness, but instead of operating on a graph corresponding to a single automaton, we build a graph on pairs of states, one from each automaton – an edge then involves following the corresponding edge in each automaton. Rather than searching for an accepting node, we are now searching for a node where one state is accepting and the other is not.

The key thing we then borrow from the algorithm of Hopcroft and Karp is that we can skip states where we have merged the pairs, because there is nothing interesting down that route even if we haven’t seen it.

Here’s some code for the resulting algorithm:

def witness_difference(left, right):
    if left is right:
        return None
    if left.matches_empty != right.matches_empty:
        return b''
    merges = UnionFind()
    merges.union(left, right)
 
    queue = deque()
    queue.append((left, right, ()))
    while len(queue) &gt; 0:
        p, q, s = queue.popleft()
        for a in ALPHABET:
            pa = derivative(p, a)
            qa = derivative(q, a)
            if pa is qa:
                continue
            t = (a, s)
            if merges.find(pa) != merges.find(qa):
                if pa.matches_empty != qa.matches_empty:
                    result = bytearray()
                    while t:
                        a, t = t
                        result.append(a)
                    result.reverse()
                    return bytes(result)
                merges.union(pa, qa)
                queue.append((pa, qa, t))
    return None

One important difference here is that we always put the actual reached pairs of states onto the queue rather than the collapsed pair. This ensures that the paths we take are real paths to real states in the corresponding automaton. Also, because we stop as soon as we ever would merge an accepting and no-accepting state, the problem we worried about above can’t happen.

Also it’s worth noting that I’m not sure this algorithm produces a minimal refutation. It should produce a pretty small one, but there could be a smaller one reachable through two states which the algorithm has merged but that are not actually equivalent.

The requirement for witnesses is a slightly unusual one – most algorithms just focus on testing for equivalence – so it would be reasonable to ask why we care.

The reason I care is that I’m interested in the following more general question: I want to maintain a set of N regular languages as witnessed by regular expressions, and I want to be able to test for membership of this set and add new elements to it.

I’m still not sure what the best way to do this is, but all of the best ideas I have require refutations to prevent you from having to do O(N) equivalence tests. For example, the following works: Maintain a binary tree, where each branch node has a string and each child node is a regular expression. When testing for membership, walk the tree as follows: At a branch, if the language contains the string then go to the left child else go to the right child. At a leaf, perform an equivalence test. If it returns true then the language is a member.

Insertion is then nearly the same: We walk until we get a leaf. At the leaf we perform an equivalence test with a refutation. If the two languages are equivalent then there’s no insertion to perform. If they’re not equivalent, then we create a split node with the refutation string as the key and put whichever one contains it as a leaf for the left child and the other one as a right.

This isn’t a great algorithm, in that it doesn’t have any easy way to balance the tree, but it’s always better than doing N equivalence checks: Matching a short string is very cheap, and we always perform at most N – 1 matches (potentially many fewer) and exactly one equivalence check.

In fact, it is not in general possible to balance the tree; Consider N strings where string k matches only the string consisting of a sequence of k 0s. Then each test string has exactly one language which matches it, so every split must have one of its children be a leaf and the look up problem degenerates to always testing N strings.

Another algorithm is to maintain a trie of all the refutations found so far. You can then simultaneously walk the regular expression’s deterministic finite automaton and the trie to produce a bit vector where bit i is set if the language contains the i’th string. This bitvector is then the same as that of at most one existing language, which can be looked up using e.g. a hash table. So to test membership you produce the bitvector and check for presence in the hash table. If it’s not present then the language isn’t present. If it is present, then check equivalence with the unique existing language with that bitvector. Insertion is similar: If the bitvector isn’t present, add the language to the hash table and leave the set of refutations alone. Else, check for equivalence with a refutation to the unique existing language with the same bitvector. If they’re equivalent, do nothing. If they aren’t, add the refutation to the trie and update the bitvector for all existing languages (you can actually do this last step incrementally as you go, but it may not be worth it).

The second algorithm is probably strictly worse than the first because it involves an O(N) computations on a lot of inserts, as well as testing for membership being more expensive. It really depends on how much cheaper equivalence is than matching for your language workload, and how likely existing refutations are to witness future ones.

One reason why being able to do the above is interesting is that it allows us to extend the lazy construction of a deterministic finite automaton to a lazy construction of a minimal deterministic finite automaton: If we can maintain such a set relatively efficiently, then instead of merely conflating similar expressions into the same states we can label states with regular languages, giving us the proper derivative automaton rather than just our approximation to it.

So that’s where my ability to answer this leaves off a bit. There may be a good data structure for this, but if there is I haven’t found the right corner of the literature to tell me what it is.

if you want to have a play with some of this, here’s some (mostly untested so probably badly wrong in places) Python code implementing a lot of these ideas:

If you’d like to learn more about this, most of the information in this post either came from or was based on material from Regular-expression derivatives reexamined,  Testing the Equivalence of Regular Languages and Algorithms for testing equivalence of finite automata, with a grading tool for JFLAP (which are in turn based on a lot of other papers you can get to from their citations).

(And, as always, if you’d like to see more posts like this and feel able to spare the price of less than a coffee per month, I encourage you to donate to my Patreon for supporting my writing.)

This entry was posted in Automata Theory, programming on by .