David R. MacIver's Blog: Ordering and mapping regular languages

Ordering and mapping regular languages

22 April 2017

I’ve had the following question for a while: How do I create a mapping of keys to values where the keys are regular expressions, and two regular expressions are considered equivalent if they correspond to the same language?

An example of why you might want to do this is e.g. when constructing a minimal deterministic finite automaton for a regular language you end up labelling states by regular expressions that represent the language matched when starting from that state. In order for the automaton to be minimal you need to have any two equivalent regular expressions correspond to the same state, so you need a way of taking a new regular expression and finding out which state it should go to.

It’s easy (if potentially expensive) to test regular expression equivalence once you know how, so the naive implementation of this is just to do a linear scan of all the regular expressions you’ve found so far. It’s O(n) lookup but it’s at least an existence proof.

In the past I’ve ended up implementing a crude hash function for regular languages and then just used a hash table. It works, but collisions are common unless you’re quite clever with your hashing, so it doesn’t work well.

But it turns out that there is a better way! Rather than using hashed data structures you can used ordered ones, because it turns out that there is a natural and easy to compute (or at least not substantially harder than testing equivalence) total ordering over the set of regular languages.

That way is this: If you have two regular languages \(L\) and \(M\) that are not equivalent, there is some \(x \in L \triangle M\), the symmetric difference. That is, we can find and \(x\) which is in one but not the other. Let \(x\) be the shortlex minimal such word (i.e. the lexicographically first word amongst those of minimal length). Then \(L < M\) if \(x \in L\), else \(M < L\).

The work in the previous post on regular language equivalence is thus enough to calculate the shortlex minimal element of an inequivalent pair of languages (though I still don’t know if the faster of the two algorithms gives the minimal one. But you can use the fast algorithm for equivalence checking and then the slightly slower algorithm to get a minimal refutation), so we can readily compute this ordering between two regular expressions. This, combined with any sort of ordered collection type (e.g. a balanced binary tree of some sort) gives us our desired mapping.

But why does this definition provide a total order?

Well, consider the enumeration of all words in increasing shortlex order as \(w_0, \ldots, w_n, \ldots\). Let \(l_n = 1\) if \(w_n \in L\), else \(l_n = 0\). Define \(m_n\) similarly for \(M\).

Then the above definition is equivalent to the reverse of the lexicographical ordering between \(l\) and \(m\)! If \(w_k\) is the smallest word in the symmetric difference then \(k\) is the first index at which \(l\) and \(m\) differ. If \(w_k \in L\) then \(l_k = 1\) and \(m_k = 0\), so \(l > k\), and vice versa. The lexicographical order is a total order, and the reverse of a total order is a total order, so the above definition is also a total order.

This definition has a number of nice properties:

Any language containing the empty word sorts before any language that doesn’t
The function \(L \to \overline{L}\) is order reversing.
\(L \cap M \leq L, M \leq L \cup M\)

I originally thought that union was coordinate-wise monotonic, but it’s not. Suppose we have four words \(a < b < c < d\), and consider the languages \(L = \{a, d\}, M = \{b, c\}\). Then \(L < M\) because the deciding value is \(a\). But now consider \(P = \{a, b\}\). Then \(L \cup P > M \cup P\) because the deciding element now ends up being \(c\), which is in \(M\).

I’ve yet to try this in practice, so it might turn out that there are some interestingly pathological failure modes for this comparison function, but I don’t think there are likely to be any that aren’t also present in testing regular expression equivalence itself.

Another open question here is which sorted map data structure to use? The comparison is relatively expensive, so it might be worth putting a bit of extra effort in to balance it. As such an AVL tree might be a fairly reasonable choice. I’m not sure.

Want more blog posts like this? Join the 30 others who are supporting my writing on Patreon!