Suppose we are interested in the following problem: We have two regular expressions. Do these match exactly the same strings, and if not what is the smallest (shortlex minimal) string that one matches but not the other?

i.e. we either want to prove they are equivalent or provide the best possible refutation that they are not.

How do we do this?

More specifically, let’s consider this problem for *extended* regular expressions. That is, we have the following types of expressions:

- \(\epsilon\), which matches only the empty string
- \(c\), for \(c\) in our alphabet, which matches only the character \(c\)
- \(\emptyset\), which matches no strings
- \(\omega\) which matches every string
- \(x^*\) which matches \(a_1\ldots a_n\) for any strings \(a_i\) matched by \(x\).
- \(xy\) where \(x\) and \(y\) are extended regular expressions, which matches any strings \(ab\) where \(x\) matches \(a\) and \(y\) matches \(b\)
- \(x \vee y\) which matches any string matched by either or both of \(x\) or \(y\)
- \(x \wedge y\) which matches any string matched by
*both*\(x\) and \(y\) - \(\neg x\) which matches any string not matched by \(x\).

That is, we extend the normal regular expressions with intersection and negation. This doesn’t actually increase the set of languages we can match, but it can lead to significantly more concise representations (I think, but am not sure, there are extended regular expressions which require exponentially larger regular expressions to represent).

This reduces to the problem of finding the lexmin smallest member of the regular language for an expression or showing that it’s empty: Given two extended regular expressions \(x, y\), we can define \(x \oplus y = (x \vee y) \wedge (\neg (x \wedge y))\) – the symmetric difference. This matches any string which is matched by one but not both of \(x\) and \(y\), so a string is a member of this regular language precisely if it is a refutation of the claim that \(x\) and \(y\) are equivalent.

If we can compile extended regular expressions into a deterministic finite automaton (which we can), then the problem is easy: To find the lexmin smallest element of a regular language or show that it’s empty given a deterministic finite automaton for it, you just use Dijkstra’s Algorithm to search the graph for an accepting node. You stop when you either find one (in which case you’ve constructed a lexmin path to it along the way) or you’ve explored all the states (in which case there is no path to an accepting node and the language is empty). Job’s done.

That was easy. Are we done?

Well, no. There are a lot of evils hidden in that sentence “just construct a deterministic finite automaton”. The corresponding automaton to a regular language may be exponentially larger than the language, and even when it’s not it’s still an expensive operation. Can we avoid that?

First, let’s take a detour: There are a number of obvious rewrite rules we can use to simplify our lives, by replacing regular expressions with straightforwardly equivalent versions of them. We use \(x \approx y\) to indicate that we should consider \(x\) and \(y\) the same under these rewrite rules:

- \(x \wedge x \approx x\)
- \(x \wedge y \approx y \wedge x\)
- \(x \wedge (y \wedge z) \approx (x \wedge y) \wedge z\)
- \(x \wedge \emptyset \approx \emptyset\)
- \(x \wedge \omega \approx x\)
- \(x \vee x \approx x\)
- \(x \vee y \approx y \vee x\)
- \(x \vee (y \vee z) \approx (x \vee y) \vee z\)
- \(x \vee \emptyset \approx x \)
- \(x \vee \omega \approx \omega \)
- \(x(yz) \approx (xy)z\)
- \(x\emptyset \approx \emptyset\)
- \(\emptyset x \approx \emptyset\)
- \(x \epsilon \approx x\)
- \(\epsilon x \approx x\)
- \((x^*)^* \approx x^*\)
- \(\epsilon^* \approx \epsilon\)
- \(\emptyset^* \approx \epsilon\)
- \(\omega^* \approx \omega\)
- \(\neg \emptyset \approx \omega\)
- \(\neg \omega \approx \emptyset\)
- \(\neg(\neg x) \approx x\)

If \(x \approx y\) we say that \(x\) and \(y\) are similar.

All of these correspond to rules that apply for the resulting languages, so if \(x \approx y\) then \(x\) and \(y\) match the same language. In general though \(x\) and \(y\) might match the same languages without being similar.

Why do we care about this? Well the first reason is that it lets us have a fast test for when two regular expressions definitely *are* equivalent: If they are similar then they are equivalent. If they’re not we have to do more work to test this.

In particular, we can use smart constructors for our regular expression data type to apply all of these rewrite rules at construction time, which then makes similarity simply a matter of testing for structural equality. If we use hash consing (essentially: memoizing all constructors so that structurally equal values are represented by the exact same object) we can even make testing for similarity O(1).

Which is good, because we’re going to need it a lot: The real reason to care about similarity of regular expressions is that we can construct the a deterministic finite automaton using the Brzozowski derivative of regular expressions, with each of our states labelled by a regular expression – if we have some character in our alphabet \(c\) and a regular language \(L\) then the Brzozowski derivative is \(d(L, c) = \{v: cv \in L\}\) – the set of suffixes of strings starting with \(c\) in \(L\). We can extend this to regular *expressions* with the following set of rules:

- \(d(\epsilon, c) = \emptyset\)
- \(d(b, c) = \emptyset\) if \(b \neq c\), else \(d(c, c) = \epsilon\)
- \(d(\emptyset, c) = \emptyset\)
- \(d(\omega, c) = \omega\)
- \(d(x^*, c) = d(x, c) x^*\)
- \(d(xy, c) = d(x, c)y \vee \nu(x) d(y, c)\)
- \(d(x \vee y, c) = d(x, c) \vee d(y, c)\)
- \(d(x \wedge y, c) = d(x, c) \wedge d(y, c)\)
- \(d(\neg x, c) = \neg d(x, c)\)

Where \(\nu(x)\) is \(\emptyset\) if the corresponding regular language doesn’t match the empty string, or \(\epsilon\) otherwise (this can be calculated directly from the expressions by similar rules).

The idea is that we can construct an automaton for a regular language by labelling states by expressions. States are accepting if the corresponding expression matches the empty string, and the transition from state \(x\) given character \(c\) goes to \(d(x, c)\).

If our similarity rule was enough to guarantee equivalence then this would be a minimal deterministic finite automaton for the language. However, even without that there are only finitely many states reachable from a given regular expression (this is where similarity becomes important! Without it this is not true. e.g. if you consider iterating the derivatives of \(c^*\) under \(c\) you’ll get infinitely deep nested chains of similar but unequal expressions).

So far it isn’t obvious why this is interesting, but it has one very important property: You can walk this automaton without having to compute any of the states in advance. This means that you can build it as you explore, so we can repeat the above approach of just using Dijkstra’s algorithm to find a refutation but we don’t have to build the automaton in advance.

It’s particularly easy to calculate the derivatives of the disjoint sum language too:

\(d(x \oplus y, c) = d(x \vee y, c) \wedge d(\neg (x \wedge y), c) = (d(x, c) \vee (y, c)) \wedge (\neg (d(x, c) \wedge d(y, c))) = d(x, c) \oplus d(y, c)\).

So if we’ve got the derivatives of our original regular expressions cached in some way it’s very cheap to calculate derivatives of our compound regular expression.

So that’s a major improvement on our original algorithm – in a number of cases it will be exponentially faster!

There are a number of ways you can improve it further: You can prune down the possible set of starting characters for a string in an imprecise way to reduce the number of branches you explore when doing Dijkstra. You can also precalculate when two characters will lead to the same state and always consider only the smallest. This is a worthwhile set of optimisations to the algorithm, but doesn’t affect it in any really fundamental way.

Can we do better yet?

The problem is that although it’s easy to calculate, our new automaton potentially has a lot more states than we want. In general it will be fewer, but if there are \(M\) states in \(x\) and \(N\) in \(y\) then there might be as many as \(MN\) in the automaton for the symmetric difference regular expression. We’ll likely not have to explore all of them, but it’s unclear how many you might have to explore.

But you can do equivalence testing in nearly O(M + N) using an algorithm due to Hopcroft and Karp based on union-find (Note: the algorithm called the Hopcroft-Karp algorithm is confusingly something completely different). The idea is that assume that they are equivalent and we merge states until we find a contradiction, and that by aggressively merging states we can consider a lot fewer of them:

Given two automata \(S\) and \(T\) with initial states \(s\) and \(t\), we test equivalence as follows:

def equivalent(S, T): s = S.start t = T.start merges = UnionFind() merges.union(s, t) stack = [(s, t)] while stack: p, q = stack.pop() for a in ALPHABET: pa = merges.find(S.transition(p, a)) qa = merges.find(S.transition(q, a)) if pa.accepting != qa.accepting: return False if pa != qa: merges.union(pa, qa) stack.append((pa, qa)) return True |

The idea is that if two states are equivalent then the states they transition to under the same character must be equivalent, and if we’ve proved that an accepting and non-accepting state are equivalent then we’ve reached a contradiction and our initial assumption must have been false. This merges two states if and only if there is some string that leads to one in one automaton and the other in the other, so if there is some string that one matches but not the other it will eventually merge those two states (if it doesn’t stop earlier).

Because of the way we aggressively merge states, this runs in time O(k(M + N)) where k is the size of the alphabet and M and N are the size of the respective states (there’s actually an inverse Ackermann factor in there too, but the inverse Ackermann function is so close to constant it might as well be ignored).

The problem with this algorithm is that it doesn’t produce a refutation, only a yes or no. We can trivially modify it by pushing paths along with the states, but the paths generated aren’t necessarily useful and in particular don’t necessarily go to a state that is accepting in one and not in the other – we’ve replaced paths to states we believed to be equivalent, but because the core assumption is wrong this didn’t necessarily work. In particular there’s no guarantee that either of the states that we use to refute the result are accepting, only that we falsely believe them to be equivalent to two states where one is accepting and the other is not.

We can still make use of this by using this as a fast initial test; Use Hopcroft and Karp’s algorithm as a fast initial test for equivalence and if it returns false *then* construct a refutation once we know one must exist. This algorithm also still works well with the lazy construction.

But it would be nice to be able to produce a witness as an integrated part of running this algorithm.

And it turns out we can: The idea is to hybridise the two algorithms – use Dijkstra’s algorithm to build the witness, but instead of operating on a graph corresponding to a single automaton, we build a graph on pairs of states, one from each automaton – an edge then involves following the corresponding edge in each automaton. Rather than searching for an accepting node, we are now searching for a node where one state is accepting and the other is not.

The key thing we then borrow from the algorithm of Hopcroft and Karp is that we can skip states where we have merged the pairs, because there is nothing interesting down that route even if we haven’t seen it.

Here’s some code for the resulting algorithm:

def witness_difference(left, right): if left is right: return None if left.matches_empty != right.matches_empty: return b'' merges = UnionFind() merges.union(left, right) queue = deque() queue.append((left, right, ())) while len(queue) > 0: p, q, s = queue.popleft() for a in ALPHABET: pa = derivative(p, a) qa = derivative(q, a) if pa is qa: continue t = (a, s) if merges.find(pa) != merges.find(qa): if pa.matches_empty != qa.matches_empty: result = bytearray() while t: a, t = t result.append(a) result.reverse() return bytes(result) merges.union(pa, qa) queue.append((pa, qa, t)) return None |

One important difference here is that we always put the actual reached pairs of states onto the queue rather than the collapsed pair. This ensures that the paths we take are real paths to real states in the corresponding automaton. Also, because we stop as soon as we ever would merge an accepting and no-accepting state, the problem we worried about above can’t happen.

Also it’s worth noting that I’m not sure this algorithm produces a *minimal* refutation. It should produce a pretty small one, but there could be a smaller one reachable through two states which the algorithm has merged but that are not actually equivalent.

The requirement for witnesses is a slightly unusual one – most algorithms just focus on testing for equivalence – so it would be reasonable to ask why we care.

The reason *I *care is that I’m interested in the following more general question: I want to maintain a set of N regular languages as witnessed by regular expressions, and I want to be able to test for membership of this set and add new elements to it.

I’m still not sure what the best way to do this is, but all of the best ideas I have require refutations to prevent you from having to do O(N) equivalence tests. For example, the following works: Maintain a binary tree, where each branch node has a string and each child node is a regular expression. When testing for membership, walk the tree as follows: At a branch, if the language contains the string then go to the left child else go to the right child. At a leaf, perform an equivalence test. If it returns true then the language is a member.

Insertion is then nearly the same: We walk until we get a leaf. At the leaf we perform an equivalence test with a refutation. If the two languages are equivalent then there’s no insertion to perform. If they’re not equivalent, then we create a split node with the refutation string as the key and put whichever one contains it as a leaf for the left child and the other one as a right.

This isn’t a *great* algorithm, in that it doesn’t have any easy way to balance the tree, but it’s always better than doing N equivalence checks: Matching a short string is very cheap, and we always perform at most N – 1 matches (potentially many fewer) and exactly one equivalence check.

In fact, it is not in general possible to balance the tree; Consider N strings where string k matches only the string consisting of a sequence of k 0s. Then each test string has exactly one language which matches it, so every split must have one of its children be a leaf and the look up problem degenerates to always testing N strings.

Another algorithm is to maintain a trie of all the refutations found so far. You can then simultaneously walk the regular expression’s deterministic finite automaton and the trie to produce a bit vector where bit i is set if the language contains the i’th string. This bitvector is then the same as that of at most one existing language, which can be looked up using e.g. a hash table. So to test membership you produce the bitvector and check for presence in the hash table. If it’s not present then the language isn’t present. If it is present, then check equivalence with the unique existing language with that bitvector. Insertion is similar: If the bitvector isn’t present, add the language to the hash table and leave the set of refutations alone. Else, check for equivalence with a refutation to the unique existing language with the same bitvector. If they’re equivalent, do nothing. If they aren’t, add the refutation to the trie and update the bitvector for all existing languages (you can actually do this last step incrementally as you go, but it may not be worth it).

The second algorithm is probably strictly worse than the first because it involves an O(N) computations on a lot of inserts, as well as testing for membership being more expensive. It really depends on how much cheaper equivalence is than matching for your language workload, and how likely existing refutations are to witness future ones.

One reason why being able to do the above is interesting is that it allows us to extend the lazy construction of a deterministic finite automaton to a lazy construction of a *minimal* deterministic finite automaton: If we can maintain such a set relatively efficiently, then instead of merely conflating similar expressions into the same states we can label states with regular languages, giving us the proper derivative automaton rather than just our approximation to it.

So that’s where my ability to answer this leaves off a bit. There may be a good data structure for this, but if there is I haven’t found the right corner of the literature to tell me what it is.

if you want to have a play with some of this, here’s some (mostly untested so probably badly wrong in places) Python code implementing a lot of these ideas:

If you’d like to learn more about this, most of the information in this post either came from or was based on material from Regular-expression derivatives reexamined, Testing the Equivalence of Regular Languages and Algorithms for testing equivalence of finite automata, with a grading tool for JFLAP (which are in turn based on a lot of other papers you can get to from their citations).

(And, as always, if you’d like to see more posts like this and feel able to spare the price of less than a coffee per month, I encourage you to donate to my Patreon for supporting my writing.)

nick jhttps://mobile.twitter.com/mattmight has done some work with derivatives if you haven’t come across his work.

davidPost authorYeah, I’m familiar with his work and it’s what got me interested in derivative approaches to languages in general. It’s got a bit of a different focus though – in particular essentially nothing I’ve written about in this post generalises to his work, because he’s using it for context free languages and testing the equivalence of context free languages is undecidable!