# DFA Minimization as a Fixed Point

Attention conservation notice: Unless you fit into a very similar academic niche to me this post likely uses techniques you don’t know to solve problems you don’t care about. Even if you are in that niche this is more of a theoretical curiousity than a useful technique at present, but it was useful for me to think through this to see how some stuff connects up.

I’ve been reading a bit about probabilistic bisimilarity distances recently, and I wanted to explore how it relates to things I’ve previously studied about minimizing deterministic finite automata.

The basic trick they use is to realise the object we want (an equivalence relationship or a pseudometric in this case) and consider it as an element of some complete lattice, define a monotonic function on that lattice such that a fixed point of it will have the desired properties, and then use the Knaster-Tarski fixed point theorem to show that the set of such fixed points is no-empty. The desired object is then either the smallest or largest fixed point (both of which the theorem guarantees to exist, and indeed shows us how to construct).

This turns out to work with deterministic finite automata too!

Suppose we have some deterministic finite automaton $$M$$ with states $$S = \{s_1, \ldots, s_n\}$$, alphabet $$A$$, transition function $$\tau : A \times S \to S$$ and acceptance function $$\alpha: S \to \{0, 1\}$$. We want to construct the minimal deterministic finite automaton which matches the same language by merging equivalent states. This requires us to define an equivalence relationship $$\sim$$ on $$M_S$$ such that $$s_i \sim s_j$$ if and only if the languages starting from $$s_i$$ and $$s_j$$ are the same. This is called the Myhill-Nerode relationship.

We can realise this equivalence relationship as follows:

We use the observation that the set of equivalence relationships over any set forms a complete lattice (to get the upper bound of a set of equivalence relationships, just take their unions as sets and then take the transitive closure to get back to an equivalence relationship).

What we are looking for is an equivalence relationship with the following two properties:

1. If $$s_i \sim s_j$$ then $$\alpha(s_i) = \alpha(s_j)$$.
2. If $$s_i \sim s_j$$ and $$a \in A$$ then $$\tau(s_i, a) \sim \tau(s_j, j)$$.

i.e. no accepting state is equivalent to a non-accepting state, and the equivalence relationship works with the transition function so that if two states are equivalent they transition to equivalent states given the same character.

Any such equivalence relationship is called a bisimilarity.

We can find such an equivalence relationship as follows:

Let $$E$$ be the set of equivalence relations over $$S$$ and define $$F: E \to E$$ as $$F(R) = \{(s, t) \in S^2: \alpha(s) = \alpha(t) \wedge \forall a \in A, (\tau(a, s), \tau(a, t)) \in R\}$$.

(The fact that if $$R$$ is an equivalence relationship then so is $$F(R)$$ is not immediately obvious, but follows from some fairly straightforward mechanical checking that I will leave as an exercise for the interested reader)

This definition may seem somewhat non-obvious, so lets unpack it:

1. We require that the resulting equivalence relationship always respects $$\alpha$$. i.e. we throw away anything we would otherwise get where it disagrees on whether the state should be accepting.
2. We take the new relationship to be the set of pairs that the old relationship cannot distinguish any transitions from. So for example if we had states $$s_i, s_j$$ such that every transition from them went to an $$R$$-equivalent state, $$F(R)$$ would treat them as equivalent. Equally, if you had two R-equivalent states such that $$(\tau(a, s_i), \tau(a, s_j)) \not\in R$$ (i.e. R distinguishes some transition from them)

So you can think of $$F$$ as taking an equivalence relationship and returning something that is in some sense “a bit closer” to being a bisimilarity. If that intuition is correct, then it should be the case that if $$R$$ is a fixed point of $$F$$ then it is a bisimilarity.

Suppose $$R = F(R)$$. Then by construction of $$F$$ we have the first property satisfied (that if $$s \sim t$$ then $$\alpha(s) = \alpha(t)$$.

For the second, suppose that we have $$s, t, a$$ with $$(\tau(a, s), \tau(a, t)) \not\in R$$. Then by definition of $$F$$ we musts have that $$(s, t) \not\in F(R) = R$$. i.e. if $$(s, t) \in R$$ then $$(\tau(a, s), \tau(a, t)) \in R$$.

So in order to find a bisimilarity we just need to find fixed points of $$F$$, but this is exactly what the Knaster-Tarski fixed point theorem lets us do. $$F$$ is montonic, i.e. if $$R \subseteq R’$$ then $$F(R) \subseteq F(R’)$$ (another “not totally obvious but a mechanical exercise for the interested reader to check” claim), the set of equivalence relationships form a complete lattice, so the set of fixed points of $$F$$ also form a complete lattice.

In particular there is a greatest fixed point of $$F$$, called the bisimulation relationship. This is exactly the $$\sim$$ relationship we were looking for (because if a word causes two states to reach a different language then they could not have been bisimilar, so that relationship is necessarily maximal).

Additionally, the proof of the Knaster-Tarski fixed point theorem shows us how to construct this relationship: We just start with the largest equivalence relationship on $$S$$ and repeatedly apply $$F$$ to it until we get a fixed point (in general “repeatedly” might require transfinite induction, but because $$S$$ is finite so is the set of equivalence relationships over $$F$$ and as a result we only have to iterate finitely many times).

But if you squint at it right, this is actually just Hopcroft’s algorithm for DFA minimization! We start with an equivalence relationship which conflates everything, and then we split states apart as we find that they have the wrong label or lead to states that are now known to be equivalent, iterating this until we reach a point where there are no more states to split (i.e. a fixed point of $$F$$). The steps are slightly different in that we actually just split the states by labels once at the beginning and then don’t have to do so again, but because of the way $$F$$ behaves in these circumstances the difference is pretty inessential (in particular we always have $$F(R) \subseteq R$$ for $$R$$ obtained by iterating $$F$$ from the largest equivalence relationship, even though this does not hold for arbitrary $$R$$).

Is this observation useful?

Well, not really. It’s useful in the sense that it helped me understand why fixed points on lattices are such a useful technique, and it was nice to relate this back to some things I already understood. So it’s a useful exercise in pedagogy.

Are there any applications of this? Well, maybe. In the context of the probabilistic work the next step here is to look at other interesting lattices which can use similar structure, and that’s pretty interesting, but I’m not sure how useful it is in the language case. Also most of the ways in which it’s useful are likely better served by transforming it into the probabilistic case.

This entry was posted in Automata Theory on by .

# Refining L* search

Attention conservation notice: If you don’t care about fine grained algorithmic improvements to inference algorithms for regular languages, you don’t care about this post.

Also, apologies, I know mathjax isn’t currently not working so you will see some raw LaTeX. If you turn on unsafe scripts (i.e. “not loaded over https”) it will work, but due to some technical hiccups this is currently difficult for me to fix. I’m on the case.

L* search (Angluin, D. Learning regular sets from queries and counterexamples. Inf. Comput. 75, 2 (1987), 87–106) is a really neat algorithm that lets you infer a deterministic finite automaton (DFA) for an unknown language.

I’ve long been of the suspicion that it would be useful in Hypothesis. Unfortunately to date reality has disagreed with me. The key problem is that L* search is really expensive and any attempt to use it would almost immediately blow Hypothesis’s normal budget for number of tests run.

Starting from a trivial automaton that accepts no strings, it proceeds in two stages:

1. It uses membership queries for the language to complete the DFA, finding transitions from states that provably do not go to any of the existing known states, and stopping only when it has a consistent transition table for every known state and every member of the alphabet.
2. It then asks for a counter-example to its conjectured DFA. If there is none, we have correctly inferred the language. If there is one, it is used to improve our ability to distinguish states, enlarging the number of states by at least one.

Because L* search is maintaining a minimal DFA, where the only states that are distinguished are ones that it can actually prove are inequivalent, this guarantees that the algorithm terminates with a correct DFA in time at most O(n), where n is the size of the minimal DFA of the language.

It does this by maintaining a set of experiments. These are strings that can be used to distinguish states – if by following the experiment from two states you end up in an accepting state in one and a non-accepting state in another, the two states must be inequivalent.

The rest of this post assumes you understand L* search. If you don’t, I recommend reading my prior more detailed explanation of it.

I use an improvement proposed in “Rivest, R. L., and Schapire, R. E. Inference of finite automata using homing sequences.
Inf. Comput. 103, 2 (1993), 299–347” that lets you take the cost of the counter-example step down to logarithmic in the length of the counter-example and ensures that each counter-example only adds one experiment to the list.

The problem is that the completion step is still extremely expensive: It requires evaluating $$|A||S||E|$$ membership queries, where $$A$$ is the alphabet, $$S$$ are the states we currently have, and $$E$$ is the number of experiments we currently have (you can remove some of these because some state transitions are provable because of the representation of the states as strings – $$s$$ always transitions to $$sa$$ under $$a$$ if both strings are in the state set, but you can’t remove more than $$|S| – 1$$ evaluations this way).

One of the problems I have with using this in Hypothesis is that $$|A| = 256$$, which is both intrinsically fairly large and also actually larger than Hypothesis’s typical testing budget. Thus there’s no way to fit even one L* pass into a normal test run!

So, can we get that $$|A|$$ factor down?

It turns out, yes, at the cost of potentially needing more counter-examples (which is a cost I’m entirely willing to pay).

The observation follows one from “Owens, S., Reppy, J. H., and Turon, A. Regular-expression derivatives re-examined. J. Funct. Program. 19, 2 (2009), 173–190.”, which is that when you have a large alphabet you will typically find at a given state in a DFA many characters are equivalent to each-other (that is, they result in a transition to the same state). For example, when $$|A| \geq |S|$$, even though there are $$|A|$$ possible valid characters at a given state there are only $$|S|$$ up to equivalence.

If we take advantage of this, we could reduce the number of queries substantially, and it turns out that it is easy to do.

The idea is simple: When we create a state, we initialise a partition refinement data structure along-side it. Initially this is set to consider all members of the alphabet to be in the same equivalence class.

When expanding our table, we now assume that any characters in the same partition are equivalent, and we only actually perform membership queries with, say, the smallest element of the partition. Thus instead of the $$|A||S|$$ term we only perform $$\sum |P_i|$$ term, where $$P_i$$ is the set of partitions associated with state $$i$$. In particular this is $$O(|S| \min(|A|, |S|)$$.

Then when we find an experiment $$e$$ and a transition $$s \to t$$ via $$c$$, instead of immediately adding to our experiment set we check whether we just have too coarse a partition. We do this by checking if $$sce \in \mathcal{L} = sc’e \in \mathcal{L}$$, where $$c’$$ is the “canonical” element of the partition that we selected when determining that this was the correct transition (if $$c = c’$$ we can skip this bit). If this is not the case then we use it to refine the partition by splitting it into two based on whether $$sde \in \mathcal{L}$$. We then add the experiment to the list and rebuild the DFA. If this does not result in expanding our list of states (this can only happen if we did end up refining the partition), we then remove the experiment from the list! This can happen because we can end up with partitions that our existing experiments were sufficient to distinguish but where we never checked, and by not adding redundant experiments in that case we bound the number of experiments to the number of states.

This is correct for the same reasons that L* is correct – at any given stage, either our partitions are correct, or we refine them. However it will take an additional $$\sum (|P_i| – 1)$$ counter-example steps to converge (where the $$P_i -$$ here are the partitions from our final state, i.e. the true partitions).

Counter-examples are somewhat costly – regardless of how much it costs to find (which may be “free” in some sense), a counter example $$u$$ takes  $$O(\log(|u|)$$ membership queries to process. If we assume all counter-examples are minimized and so $$|u| \leq n$$ where $$n$$ is the number of states in the true minimal DFA, then that’s $$O(\log(n)$$

Suppose the true minimal DFA has $$n$$ states and that $$m = \sum |P_i|$$ in it. Then at each stage we either add one or more states, or we increase the number of partitions, thus we perform at most $$n + \sum (|P_i – 1) = n + m – n = m$$ counter-example queries.

Each completion step takes $$O(m |E|$$ to run, and we only added experiments in the case where we increased the number of states, so each step runs in $$O(m n)$$, and the entire thing runs in at most $$O((mn + \log(n)) m)$$. $$m \geq n$$, so we can discard the $$\log(m)$$ term and the whole thing runs in $$O(m^2 n)$$.

In contrast, classic L* the steps take $$O(|A| n^2)$$ and we run in at most $$n$$ steps, so the complexity is $$O(|A| n^3$$. So this is a complexity improvement if and only if $$n^2 |A| \geq m^2$$ asymptotically, or in other words if $$\frac{m}{n} \geq \sqrt{|A|}$$ (technically this is an invalid way to reason about asymptotics, but those were pretty tight estimates).

$$\frac{m}{n}$$ is the average number of distinct equivalent classes over states. For my use case I expect this to be very low – typically the number of equivalence classes will be one!

For the more general use case this isn’t obviously a win – if $$|A|$$ is small it almost certainly isn’t. e.g. when $$|A| = 2$$ this condition is equivalent to saying that 60% of states don’t depend on the character to determine their transition at all.

I haven’t attempted to reason about it much, but one possible fix would be an adaptive algorithm that tracks the average on the states and partitions found so far, and as soon as it exceeds $$\sqrt{|A|}$$ switches over to normal L*. My suspicion is that if you measure cost purely in terms of membership queries, this adaptive version is a substantial saving on almost any language model.

This entry was posted in Automata Theory on by .

# Ordering and mapping regular languages

I’ve had the following question for a while: How do I create a mapping of keys to values where the keys are regular expressions, and two regular expressions are considered equivalent if they correspond to the same language?

An example of why you might want to do this is e.g. when constructing a minimal deterministic finite automaton for a regular language you end up labelling states by regular expressions that represent the language matched when starting from that state. In order for the automaton to be minimal you need to have any two equivalent regular expressions correspond to the same state, so you need a way of taking a new regular expression and finding out which state it should go to.

It’s easy (if potentially expensive) to test regular expression equivalence once you know how, so the naive implementation of this is just to do a linear scan of all the regular expressions you’ve found so far. It’s O(n) lookup but it’s at least an existence proof.

In the past I’ve ended up implementing a crude hash function for regular languages and then just used a hash table. It works, but collisions are common unless you’re quite clever with your hashing, so it doesn’t work well.

But it turns out that there is a better way! Rather than using hashed data structures you can used ordered ones, because it turns out that there is a natural and easy to compute (or at least not substantially harder than testing equivalence) total ordering over the set of regular languages.

That way is this: If you have two regular languages $$L$$ and $$M$$ that are not equivalent, there is some $$x \in L \triangle M$$, the symmetric difference. That is, we can find and $$x$$ which is in one but not the other. Let $$x$$ be the shortlex minimal such word (i.e. the lexicographically first word amongst those of minimal length). Then $$L < M$$ if $$x \in L$$, else $$M < L$$.

The work in the previous post on regular language equivalence is thus enough to calculate the shortlex minimal element of an inequivalent pair of languages (though I still don’t know if the faster of the two algorithms gives the minimal one. But you can use the fast algorithm for equivalence checking and then the slightly slower algorithm to get a minimal refutation), so we can readily compute this ordering between two regular expressions. This, combined with any sort of ordered collection type (e.g. a balanced binary tree of some sort) gives us our desired mapping.

But why does this definition provide a total order?

Well, consider the enumeration of all words in increasing shortlex order as $$w_0, \ldots, w_n, \ldots$$. Let $$l_n = 1$$ if $$w_n \in L$$, else $$l_n = 0$$. Define $$m_n$$ similarly for $$M$$.

Then the above definition is equivalent to the reverse of the lexicographical ordering between $$l$$ and $$m$$! If $$w_k$$ is the smallest word in the symmetric difference then $$k$$ is the first index at which $$l$$ and $$m$$ differ. If  $$w_k \in L$$ then $$l_k = 1$$ and $$m_k = 0$$, so $$l > k$$, and vice versa. The lexicographical order is a total order, and the reverse of a total order is a total order, so the above definition is also a total order.

This definition has a number of nice properties:

• Any language containing the empty word sorts before any language that doesn’t
• The function $$L \to \overline{L}$$ is order reversing.
• $$L \cap M \leq L, M \leq L \cup M$$

I originally thought that union was coordinate-wise monotonic, but it’s not. Suppose we have four words $$a < b < c < d$$, and consider the languages $$L = \{a, d\}, M = \{b, c\}$$. Then $$L < M$$ because the deciding value is $$a$$. But now consider $$P = \{a, b\}$$. Then $$L \cup P > M \cup P$$ because the deciding element now ends up being $$c$$, which is in $$M$$.

I’ve yet to try this in practice, so it might turn out that there are some interestingly pathological failure modes for this comparison function, but I don’t think there are likely to be any that aren’t also present in testing regular expression equivalence itself.

Another open question here is which sorted map data structure to use? The comparison is relatively expensive, so it might be worth putting a bit of extra effort in to balance it. As such an AVL tree might be a fairly reasonable choice. I’m not sure.

Want more blog posts like this? Join the 30 others who are supporting my writing on Patreon!

This entry was posted in Automata Theory, programming on by .

# The Easy Way to Compile Automata to Regular Expressions

I have a confession: despite knowing a reasonably large amount about formal language theory and automata, I’ve never bothered to learn how to prove how to turn a finite automaton back into a regular expression.

My attitude has always been that I have a vague sense of how a proof would go and couldn’t be bothered to sort out the details. I’ve probably skimmed a proof in the past but it looked complicated and uninformative so I didn’t bother.

Due to reasons (a certain theorem about how you could represent regular expressions in a particular way required the details of a proof) I found I was forced to care recently, so I decided to actually sit down and figure it out.

The standard proofs did indeed seem complicated and uninformative, but fortunately it turns out that the general method of understanding regular language theory fundamentals applies in this case.

That method, of course, being to ask “What Would Janusz Brzozowski Do?”.

Starting from a Nondeterministic Finite Automaton (without $$\epsilon$$-transitions) there turns out to be a natural algebraic process which is fairly reminiscent of Gaussian Elimination that results in a fairly small regular expression matching the language.

The idea is this: Suppose we have a non-deterministic automaton without $$\epsilon$$-transitions. Label the states $$0, \ldots, n$$. We consider the languages $$R_i$$ of strings matched when starting from state $$i$$.

By looking at the automaton we get a set of algebraic equations for each of these, which we can progressively rewrite and simplify to remove loops in them (e.g. where $$R_i$$ is defined self-referentially, or the definitions of $$R_i$$ and $$R_j$$ depend on each other). Once we’ve successfully done that, we can just substitute in terms until we get an expression for $$R_0$$, which will be our final result.

Note: In what follows we’ll adopt the usual convention of equating regular expressions with their regular language. Variables which correspond to a language will be upper case, ones corresponding to an expression will be lower-case.

Let $$s_{ij}$$ be a regular expression matching the language of strings that cause a tradition from state $$i$$ to $$j$$. This is a language consisting of strings of length $$1$$ – if the character $$c$$ causes a transition from $$i$$ to $$j$$ then $$c \in s_{ij}$$. If there are no transitions from $$i$$ to $$j$$ then $$s_{ij}$$ is $$\phi$$, the regular expression matching no strings, otherwise it is a union of finitely many basic regular expressions.

Note that in the general case there may be strings matched by both $$s_{ij}$$ and $$s_{ik}$$ with $$j \neq k$$. This is because the automaton is non-deterministic. In a deterministic automaton the expressions $$s_{ij}, s_{ik}$$ with $$i \neq k$$ will have no common matches.

Let $$t_i$$ be $$\phi$$ if state $$i$$ is not accepting and $$\epsilon$$ if it is. i.e. $$t_i$$ is a regular expression matching the set of empty strings matched from state $$i$$.

Now, we have that $$R_i = t_i \cup \bigcup\limits_j s_{ij} R_j$$ – a string in $$R_i$$ is either the empty string (if $$i$$ is accepting), or the result of transitioning to another state and matching from there.

We’ll now begin the process of substituting in terms to remove cycles.

The key idea is to find $$t_i’, s_{ij}’$$ such that whenever $$j \leq i$$, $$s_{ij}’ = \phi$$. This means there can be no cycles because there are no backwards or self-references between the expressions.

It will no longer be the case that $$s_{ij}’$$ only matches strings of length $$1$$, but it will be the case that they don’t match the empty string (this is important).

How do we construct these modified terms? By induction!

Suppose we’ve constructed $$t_i$$ and $$s_{ij}$$ for $$i < k$$. If we just substitute in and rearrange all of those terms into $$R_k = t_k \cup \bigcup\limits_j s_{kj} R_j$$ then we almost get what we want: The only terms that can be non-empty are the ones where $$j \geq k$$, because we’ve eliminated all the terms that were $$< k$$ and replaced them with terms that are $$\geq k$$.

So all we need to do is figure out how to remove the $$R_k$$ term.

Fortunately, there’s a great tool for doing that. it’s called Arden’s theorem:

If we have languages $$X, A, B$$ with $$\epsilon \not\in A$$ and$$X = AX \cup B$$, then $$X = A* B$$ .

Conceptually this is fairly intuitive because you can just rerwite it as $$X = A(AX \cup B) \cup B$$ and continue this infinitely to get $$X = B \cup AB \cup AAB \cup \ldots$$, which is precisely $$X = A*B$$. However this intuition breaks down somewhat – the condition that $$A$$ does not contain the empty string is essential, because if we picked $$A = \{\epsilon\}$$ and $$B = \emptyset\} then this reduces to \(X = X$$, which every language satisfies.

The problem is essentially that the term in the $$\ldots$$ “does not converge” in this case, so we need to do some more work to actually prove this rigorously.

Proof:

First note that $$A*B$$ is a possible solution for $$X$$ because $$A* = (A A*) \cup \epsilon$$, so  $$A* B = (A A* B) \cup \epsilon B = A (A* B) \cup B$$.

Now note that any language satisfying the equation has the following property: $$x \in X$$ if and only if either $$x \in B$$ or $$x = uv$$ with $$u \in A$$ and $$v \in X$$.

This is simply by the definition: If $$x \not\in B$$ then $$x \in AX$$, so $$x$$ starts with a string of $$A$$, which is non-empty by assumption. Conversely, if $$x \in B$$ then $$x \in X$$ by definition, and if $$x = uv$$ with $$u \in A$$ and $$v \in L$$.

Now, suppose we have two non-identical solutions to the equation. Say $$L$$ and $$M$$. Suppose there exists $$x \in L \setminus M$$. Pick $$x$$ so that it has minimal length among such words..

Then certainly $$x \not\in B$$ or it would be in both, so by assumption we can write $$x = uv$$ with $$u \in A$$ and $$v \in L$$.

But then we must have $$v \in M$$, by the minimality of $$x$$. But if $$v \in M$$ then necessarily $$uv = x \in M$$. This contradicts the assumption. Therefore no such $$x$$ exists, and the two languages are equal.

QED

So it’s important that all the coefficients on the right hand side don’t match the empty string, but this is again true inductively: The initial $$s_{ij}$$ do not match the empty string, and every coefficient is either a concatenation or a union of two languages that don’t match the empty string, so in turn does not match the empty string.

This means that Arden’s theorem gives us the tool we need to remove the coefficient for $$R_k$$ on the right hand: All we need to do is to prepend the star of that coefficient to the other terms and remove $$R_k$$ from the equation.

This completes our inductive step, which in turn completes our algorithm: We run the induction over the whole set of equations, we now no longer have loops, and we can just substitute back in to get $$R_0$$ as desired.

Let’s work through an example now to clear up the details.

Suppose we have the states 0, 1, 2. Only state 2 is accepting. Our characters are a, b and c. The allowed transitions are:

• From 0: $$a, b \to 1$$.
• From 1: $$a \to 1, b \to 2$$
• From 2: $$c \to 0, 1$$.

This gives us the following three initial equations:

• $$R_0 = (a | b) R_1$$
• $$R_1 =a R_1 \cup b R_2$$
• $$R_2 = \epsilon \cup c R_0 \cup c R_1$$

We now perform substitutions as follows:

1. Our equation for $$R_0$$ is in the desired form already so nothing needs to be done.
2. Our equation for $$R_1$$ has $$R_1$$ on the right hand side, so we use Arden’s theorem to rewrite it as $$R_1 = a* b R_2$$. It is now in the desired form.
3. We substitute in our previous equation for $$R_0$$ and get $$R_2 = \epsilon \cup c (a|b) R_1 \cup c R_1 = \epsilon \cup c(a|b|\epsilon) R_1$$.
4. We substitute in our previous equation for $$R_1$$ and get  $$R_2 = \epsilon \cup c(a|b|\epsilon) a* b R_2$$.
5. We apply Arden’s theorem one final time and get $$R_2 = (c(a|b|\epsilon) a* b)* | \epsilon = (c(a|b|\epsilon) a* b)*$$.
6. We now substitute this into our equation for $$R_1$$ and get $$R_1 = a* b (c(a|b|\epsilon) a* b)*$$
7. We now substitute this into our equation for $$R_0$$ and get  $$R_0 = (a | b) a* b (c(a|b|\epsilon) a* b)*$$

We now have a (not all that pleasant) regular expression matching our original deterministic finite automaton, as desired.

In general I’m unclear on whether this is the “best” method for doing this. The above explanation is a lightly modified version of the one presented here, which compares it with several other methods, where it seems to come out ahead of the others.

It certainly isn’t optimal, in the sense that it doesn’t always produce a minimal regular expression. However, neither do any of the other standard methods. This seems to be a hard problem, and there aren’t as far as I know any good algorithms for doing it.

However, regardless of whether it’s the best, it is certainly the one that seems clearest to me: Once you have the key mechanism of Arden’s theorem, you just perform simple algebraic manipulation and the result pops out at the end.

If you want to see this working in practice, I’ve rather arbitrarily added in some code for doing this (albeit only with deterministic finite automata. The principle is the same though) to my FALBS project, which has basically become a dumping ground for all my formal languages code, along with a test that this does the right thing.

(Want more posts like this? Why not support my Patreon! It will cause me to write more, give you some influence over what I write, and give you access to early drafts of upcoming posts).

This entry was posted in Automata Theory, programming, Python on by .

# Proving or refuting regular expression equivalence

Suppose we are interested in the following problem: We have two regular expressions. Do these match exactly the same strings, and if not what is the smallest (shortlex minimal) string that one matches but not the other?

i.e. we either want to prove they are equivalent or provide the best possible refutation that they are not.

How do we do this?

More specifically, let’s consider this problem for extended regular expressions. That is, we have the following types of expressions:

• $$\epsilon$$, which matches only the empty string
• $$c$$, for $$c$$ in our alphabet, which matches only the character $$c$$
• $$\emptyset$$, which matches no strings
• $$\omega$$ which matches every string
• $$x^*$$ which matches $$a_1\ldots a_n$$ for any strings $$a_i$$ matched by $$x$$.
• $$xy$$ where $$x$$ and $$y$$ are extended regular expressions, which matches any strings $$ab$$ where $$x$$ matches $$a$$ and $$y$$ matches $$b$$
• $$x \vee y$$ which matches any string matched by either or both of $$x$$ or $$y$$
• $$x \wedge y$$ which matches any string matched by both $$x$$ and $$y$$
• $$\neg x$$ which matches any string not matched by $$x$$.

That is, we extend the normal regular expressions with intersection and negation. This doesn’t actually increase the set of languages we can match, but it can lead to significantly more concise representations (I think, but am not sure, there are extended regular expressions which require exponentially larger regular expressions to represent).

This reduces to the problem of finding the lexmin smallest member of the regular language for an expression or showing that it’s empty: Given two extended regular expressions $$x, y$$, we can define $$x \oplus y = (x \vee y) \wedge (\neg (x \wedge y))$$ – the symmetric difference. This matches any string which is matched by one but not both of $$x$$ and $$y$$, so a string is a member of this regular language precisely if it is a refutation of the claim that $$x$$ and $$y$$ are equivalent.

If we can compile extended regular expressions into a deterministic finite automaton (which we can), then the problem is easy:  To find the lexmin smallest element of a regular language or show that it’s empty given a deterministic finite automaton for it, you just use Dijkstra’s Algorithm to search the graph for an accepting node. You stop when you either find one (in which case you’ve constructed a lexmin path to it along the way) or you’ve explored all the states (in which case there is no path to an accepting node and the language is empty). Job’s done.

That was easy. Are we done?

Well, no. There are a lot of evils hidden in that sentence “just construct a deterministic finite automaton”. The corresponding automaton to a regular language may be exponentially larger than the language, and even when it’s not it’s still an expensive operation. Can we avoid that?

First, let’s take a detour: There are a number of obvious rewrite rules we can use to simplify our lives, by replacing regular expressions with straightforwardly equivalent versions of them. We use $$x \approx y$$ to indicate that we should consider $$x$$ and $$y$$ the same under these rewrite rules:

• $$x \wedge x \approx x$$
• $$x \wedge y \approx y \wedge x$$
• $$x \wedge (y \wedge z) \approx (x \wedge y) \wedge z$$
• $$x \wedge \emptyset \approx \emptyset$$
• $$x \wedge \omega \approx x$$
• $$x \vee x \approx x$$
• $$x \vee y \approx y \vee x$$
• $$x \vee (y \vee z) \approx (x \vee y) \vee z$$
• $$x \vee \emptyset \approx x$$
• $$x \vee \omega \approx \omega$$
• $$x(yz) \approx (xy)z$$
• $$x\emptyset \approx \emptyset$$
• $$\emptyset x \approx \emptyset$$
• $$x \epsilon \approx x$$
• $$\epsilon x \approx x$$
• $$(x^*)^* \approx x^*$$
• $$\epsilon^* \approx \epsilon$$
• $$\emptyset^* \approx \epsilon$$
• $$\omega^* \approx \omega$$
• $$\neg \emptyset \approx \omega$$
• $$\neg \omega \approx \emptyset$$
• $$\neg(\neg x) \approx x$$

If $$x \approx y$$ we say that $$x$$ and $$y$$ are similar.

All of these correspond to rules that apply for the resulting languages, so if $$x \approx y$$ then $$x$$ and $$y$$ match the same language. In general though $$x$$ and $$y$$ might match the same languages without being similar.

Why do we care about this? Well the first reason is that it lets us have a fast test for when two regular expressions definitely are equivalent: If they are similar then they are equivalent. If they’re not we have to do more work to test this.

In particular, we can use smart constructors for our regular expression data type to apply all of these rewrite rules at construction time, which then makes similarity simply a matter of testing for structural equality. If we use hash consing (essentially: memoizing all constructors so that structurally equal values are represented by the exact same object) we can even make testing for similarity O(1).

Which is good, because we’re going to need it a lot: The real reason to care about similarity of regular expressions is that we can construct the a deterministic finite automaton using the Brzozowski derivative of regular expressions, with each of our states labelled by a regular expression – if we have some character in our alphabet $$c$$ and a regular language $$L$$ then the Brzozowski derivative is $$d(L, c) = \{v: cv \in L\}$$ – the set of suffixes of strings starting with $$c$$ in $$L$$. We can extend this to regular expressions with the following set of rules:

• $$d(\epsilon, c) = \emptyset$$
• $$d(b, c) = \emptyset$$ if $$b \neq c$$, else $$d(c, c) = \epsilon$$
• $$d(\emptyset, c) = \emptyset$$
• $$d(\omega, c) = \omega$$
• $$d(x^*, c) = d(x, c) x^*$$
• $$d(xy, c) = d(x, c)y \vee \nu(x) d(y, c)$$
• $$d(x \vee y, c) = d(x, c) \vee d(y, c)$$
• $$d(x \wedge y, c) = d(x, c) \wedge d(y, c)$$
• $$d(\neg x, c) = \neg d(x, c)$$

Where $$\nu(x)$$ is $$\emptyset$$ if the corresponding regular language doesn’t match the empty string, or $$\epsilon$$ otherwise (this can be calculated directly from the expressions by similar rules).

The idea is that we can construct an automaton for a regular language by labelling states by expressions. States are accepting if the corresponding expression matches the empty string, and the transition from state $$x$$ given character $$c$$ goes to $$d(x, c)$$.

If our similarity rule was enough to guarantee equivalence then this would be a minimal deterministic finite automaton for the language. However, even without that there are only finitely many states reachable from a given regular expression (this is where similarity becomes important! Without it this is not true. e.g. if you consider iterating the derivatives of $$c^*$$ under $$c$$ you’ll get infinitely deep nested chains of similar but unequal expressions).

So far it isn’t obvious why this is interesting, but it has one very important property: You can walk this automaton without having to compute any of the states in advance. This means that you can build it as you explore, so we can repeat the above approach of just using Dijkstra’s algorithm to find a refutation but we don’t have to build the automaton in advance.

It’s particularly easy to calculate the derivatives of the disjoint sum language too:

$$d(x \oplus y, c) = d(x \vee y, c) \wedge d(\neg (x \wedge y), c) = (d(x, c) \vee (y, c)) \wedge (\neg (d(x, c) \wedge d(y, c))) = d(x, c) \oplus d(y, c)$$.

So if we’ve got the derivatives of our original regular expressions cached in some way it’s very cheap to calculate derivatives of our compound regular expression.

So that’s a major improvement on our original algorithm – in a number of cases it will be exponentially faster!

There are a number of ways you can improve it further: You can prune down the possible set of starting characters for a string in an imprecise way to reduce the number of branches you explore when doing Dijkstra. You can also precalculate when two characters will lead to the same state and always consider only the smallest. This is a worthwhile set of optimisations to the algorithm, but doesn’t affect it in any really fundamental way.

Can we do better yet?

The problem is that although it’s easy to calculate, our new automaton potentially has a lot more states than we want. In general it will be fewer, but if there are $$M$$ states in $$x$$ and $$N$$ in $$y$$ then there might be as many as $$MN$$ in the automaton for the symmetric difference regular expression. We’ll likely not have to explore all of them, but it’s unclear how many you might have to explore.

But you can do equivalence testing in nearly O(M + N) using an algorithm due to Hopcroft and Karp based on union-find (Note: the algorithm called the Hopcroft-Karp algorithm is confusingly something completely different). The idea is that assume that they are equivalent and we merge states until we find a contradiction, and that by aggressively merging states we can consider a lot fewer of them:

Given two automata $$S$$ and $$T$$ with initial states $$s$$ and $$t$$, we test equivalence as follows:

def equivalent(S, T): s = S.start t = T.start merges = UnionFind() merges.union(s, t)   stack = [(s, t)] while stack: p, q = stack.pop() for a in ALPHABET: pa = merges.find(S.transition(p, a)) qa = merges.find(S.transition(q, a)) if pa.accepting != qa.accepting: return False if pa != qa: merges.union(pa, qa) stack.append((pa, qa)) return True

The idea is that if two states are equivalent then the states they transition to under the same character must be equivalent, and if we’ve proved that an accepting and non-accepting state are equivalent then we’ve reached a contradiction and our initial assumption must have been false. This merges two states if and only if there is some string that leads to one in one automaton and the other in the other, so if there is some string that one matches but not the other it will eventually merge those two states (if it doesn’t stop earlier).

Because of the way we aggressively merge states, this runs in time O(k(M + N)) where k is the size of the alphabet and M and N are the size of the respective states (there’s actually an inverse Ackermann factor in there too, but the inverse Ackermann function is so close to constant it might as well be ignored).

The problem with this algorithm is that it doesn’t produce a refutation, only a yes or no. We can trivially modify it by pushing paths along with the states, but the paths generated aren’t necessarily useful and in particular don’t necessarily go to a state that is accepting in one and not in the other – we’ve replaced paths to states we believed to be equivalent, but because the core assumption is wrong this didn’t necessarily work. In particular there’s no guarantee that either of the states that we use to refute the result are accepting, only that we falsely believe them to be equivalent to two states where one is accepting and the other is not.

We can still make use of this by using this as a fast initial test; Use Hopcroft and Karp’s algorithm as a fast initial test for equivalence and if it returns false then construct a refutation once we know one must exist. This algorithm also still works well with the lazy construction.

But it would be nice to be able to produce a witness as an integrated part of running this algorithm.

And it turns out we can: The idea is to hybridise the two algorithms – use Dijkstra’s algorithm to build the witness, but instead of operating on a graph corresponding to a single automaton, we build a graph on pairs of states, one from each automaton – an edge then involves following the corresponding edge in each automaton. Rather than searching for an accepting node, we are now searching for a node where one state is accepting and the other is not.

The key thing we then borrow from the algorithm of Hopcroft and Karp is that we can skip states where we have merged the pairs, because there is nothing interesting down that route even if we haven’t seen it.

Here’s some code for the resulting algorithm:

def witness_difference(left, right): if left is right: return None if left.matches_empty != right.matches_empty: return b'' merges = UnionFind() merges.union(left, right)   queue = deque() queue.append((left, right, ())) while len(queue) &gt; 0: p, q, s = queue.popleft() for a in ALPHABET: pa = derivative(p, a) qa = derivative(q, a) if pa is qa: continue t = (a, s) if merges.find(pa) != merges.find(qa): if pa.matches_empty != qa.matches_empty: result = bytearray() while t: a, t = t result.append(a) result.reverse() return bytes(result) merges.union(pa, qa) queue.append((pa, qa, t)) return None

One important difference here is that we always put the actual reached pairs of states onto the queue rather than the collapsed pair. This ensures that the paths we take are real paths to real states in the corresponding automaton. Also, because we stop as soon as we ever would merge an accepting and no-accepting state, the problem we worried about above can’t happen.

Also it’s worth noting that I’m not sure this algorithm produces a minimal refutation. It should produce a pretty small one, but there could be a smaller one reachable through two states which the algorithm has merged but that are not actually equivalent.

The requirement for witnesses is a slightly unusual one – most algorithms just focus on testing for equivalence – so it would be reasonable to ask why we care.

The reason I care is that I’m interested in the following more general question: I want to maintain a set of N regular languages as witnessed by regular expressions, and I want to be able to test for membership of this set and add new elements to it.

I’m still not sure what the best way to do this is, but all of the best ideas I have require refutations to prevent you from having to do O(N) equivalence tests. For example, the following works: Maintain a binary tree, where each branch node has a string and each child node is a regular expression. When testing for membership, walk the tree as follows: At a branch, if the language contains the string then go to the left child else go to the right child. At a leaf, perform an equivalence test. If it returns true then the language is a member.

Insertion is then nearly the same: We walk until we get a leaf. At the leaf we perform an equivalence test with a refutation. If the two languages are equivalent then there’s no insertion to perform. If they’re not equivalent, then we create a split node with the refutation string as the key and put whichever one contains it as a leaf for the left child and the other one as a right.

This isn’t a great algorithm, in that it doesn’t have any easy way to balance the tree, but it’s always better than doing N equivalence checks: Matching a short string is very cheap, and we always perform at most N – 1 matches (potentially many fewer) and exactly one equivalence check.

In fact, it is not in general possible to balance the tree; Consider N strings where string k matches only the string consisting of a sequence of k 0s. Then each test string has exactly one language which matches it, so every split must have one of its children be a leaf and the look up problem degenerates to always testing N strings.

Another algorithm is to maintain a trie of all the refutations found so far. You can then simultaneously walk the regular expression’s deterministic finite automaton and the trie to produce a bit vector where bit i is set if the language contains the i’th string. This bitvector is then the same as that of at most one existing language, which can be looked up using e.g. a hash table. So to test membership you produce the bitvector and check for presence in the hash table. If it’s not present then the language isn’t present. If it is present, then check equivalence with the unique existing language with that bitvector. Insertion is similar: If the bitvector isn’t present, add the language to the hash table and leave the set of refutations alone. Else, check for equivalence with a refutation to the unique existing language with the same bitvector. If they’re equivalent, do nothing. If they aren’t, add the refutation to the trie and update the bitvector for all existing languages (you can actually do this last step incrementally as you go, but it may not be worth it).

The second algorithm is probably strictly worse than the first because it involves an O(N) computations on a lot of inserts, as well as testing for membership being more expensive. It really depends on how much cheaper equivalence is than matching for your language workload, and how likely existing refutations are to witness future ones.

One reason why being able to do the above is interesting is that it allows us to extend the lazy construction of a deterministic finite automaton to a lazy construction of a minimal deterministic finite automaton: If we can maintain such a set relatively efficiently, then instead of merely conflating similar expressions into the same states we can label states with regular languages, giving us the proper derivative automaton rather than just our approximation to it.

So that’s where my ability to answer this leaves off a bit. There may be a good data structure for this, but if there is I haven’t found the right corner of the literature to tell me what it is.

if you want to have a play with some of this, here’s some (mostly untested so probably badly wrong in places) Python code implementing a lot of these ideas:

If you’d like to learn more about this, most of the information in this post either came from or was based on material from Regular-expression derivatives reexamined,  Testing the Equivalence of Regular Languages and Algorithms for testing equivalence of finite automata, with a grading tool for JFLAP (which are in turn based on a lot of other papers you can get to from their citations).

(And, as always, if you’d like to see more posts like this and feel able to spare the price of less than a coffee per month, I encourage you to donate to my Patreon for supporting my writing.)

This entry was posted in Automata Theory, programming on by .