The Easy Way to Compile Automata to Regular Expressions

I have a confession: despite knowing a reasonably large amount about formal language theory and automata, I’ve never bothered to learn how to prove how to turn a finite automaton back into a regular expression.

My attitude has always been that I have a vague sense of how a proof would go and couldn’t be bothered to sort out the details. I’ve probably skimmed a proof in the past but it looked complicated and uninformative so I didn’t bother.

Due to reasons (a certain theorem about how you could represent regular expressions in a particular way required the details of a proof) I found I was forced to care recently, so I decided to actually sit down and figure it out.

The standard proofs did indeed seem complicated and uninformative, but fortunately it turns out that the general method of understanding regular language theory fundamentals applies in this case.

That method, of course, being to ask “What Would Janusz Brzozowski Do?”.

Starting from a Nondeterministic Finite Automaton (without \(\epsilon\)-transitions) there turns out to be a natural algebraic process which is fairly reminiscent of Gaussian Elimination that results in a fairly small regular expression matching the language.

The idea is this: Suppose we have a non-deterministic automaton without \(\epsilon\)-transitions. Label the states \(0, \ldots, n\). We consider the languages \(R_i\) of strings matched when starting from state \(i\).

By looking at the automaton we get a set of algebraic equations for each of these, which we can progressively rewrite and simplify to remove loops in them (e.g. where \(R_i\) is defined self-referentially, or the definitions of \(R_i\) and \(R_j\) depend on each other). Once we’ve successfully done that, we can just substitute in terms until we get an expression for \(R_0\), which will be our final result.

Note: In what follows we’ll adopt the usual convention of equating regular expressions with their regular language. Variables which correspond to a language will be upper case, ones corresponding to an expression will be lower-case.

Let \(s_{ij}\) be a regular expression matching the language of strings that cause a tradition from state \(i\) to \(j\). This is a language consisting of strings of length \(1\) – if the character \(c\) causes a transition from \(i\) to \(j\) then \(c \in s_{ij}\). If there are no transitions from \(i\) to \(j\) then \(s_{ij}\) is \(\phi\), the regular expression matching no strings, otherwise it is a union of finitely many basic regular expressions.

Note that in the general case there may be strings matched by both \(s_{ij}\) and \(s_{ik}\) with \(j \neq k\). This is because the automaton is non-deterministic. In a deterministic automaton the expressions \(s_{ij}, s_{ik}\) with \(i \neq k\) will have no common matches.

Let \(t_i\) be \(\phi\) if state \(i\) is not accepting and \(\epsilon\) if it is. i.e. \(t_i\) is a regular expression matching the set of empty strings matched from state \(i\).

Now, we have that \(R_i = t_i \cup \bigcup\limits_j s_{ij} R_j\) – a string in \(R_i\) is either the empty string (if \(i\) is accepting), or the result of transitioning to another state and matching from there.

We’ll now begin the process of substituting in terms to remove cycles.

The key idea is to find \(t_i’, s_{ij}’\) such that whenever \(j \leq i\), \(s_{ij}’ = \phi\). This means there can be no cycles because there are no backwards or self-references between the expressions.

It will no longer be the case that \(s_{ij}’\) only matches strings of length \(1\), but it will be the case that they don’t match the empty string (this is important).

How do we construct these modified terms? By induction!

Suppose we’ve constructed \(t_i\) and \(s_{ij}\) for \(i < k\). If we just substitute in and rearrange all of those terms into \(R_k = t_k \cup \bigcup\limits_j s_{kj} R_j\) then we almost get what we want: The only terms that can be non-empty are the ones where \(j \geq k\), because we’ve eliminated all the terms that were \(< k\) and replaced them with terms that are \(\geq k\).

So all we need to do is figure out how to remove the \(R_k\) term.

Fortunately, there’s a great tool for doing that. it’s called Arden’s theorem:

If we have languages \(X, A, B\) with \(\epsilon \not\in A\) and\(X = AX \cup B\), then \(X = A* B\) .

Conceptually this is fairly intuitive because you can just rerwite it as \(X = A(AX \cup B) \cup B\) and continue this infinitely to get \(X = B \cup AB \cup AAB \cup \ldots\), which is precisely \(X = A*B\). However this intuition breaks down somewhat – the condition that \(A\) does not contain the empty string is essential, because if we picked \(A = \{\epsilon\}\) and \(B = \emptyset\} then this reduces to \(X = X\), which every language satisfies.

The problem is essentially that the term in the \(\ldots\) “does not converge” in this case, so we need to do some more work to actually prove this rigorously.

Proof:

First note that \(A*B\) is a possible solution for \(X\) because \(A* =  (A A*) \cup \epsilon\), so  \(A* B = (A A* B) \cup \epsilon B = A (A* B) \cup B\).

Now note that any language satisfying the equation has the following property: \(x \in X\) if and only if either \(x \in B\) or \(x = uv\) with \(u \in A\) and \(v \in X\).

This is simply by the definition: If \(x \not\in B\) then \(x \in AX\), so \(x\) starts with a string of \(A\), which is non-empty by assumption. Conversely, if \(x \in B\) then \(x \in X\) by definition, and if \(x = uv\) with \(u \in A\) and \(v \in L\).

Now, suppose we have two non-identical solutions to the equation. Say \(L\) and \(M\). Suppose there exists \(x \in L \setminus M\). Pick \(x\) so that it has minimal length among such words..

Then certainly \(x \not\in B\) or it would be in both, so by assumption we can write \(x = uv\) with \(u \in A\) and \(v \in L\).

But then we must have \(v \in M\), by the minimality of \(x\). But if \(v \in M\) then necessarily \(uv = x \in M\). This contradicts the assumption. Therefore no such \(x\) exists, and the two languages are equal.

QED

So it’s important that all the coefficients on the right hand side don’t match the empty string, but this is again true inductively: The initial \(s_{ij}\) do not match the empty string, and every coefficient is either a concatenation or a union of two languages that don’t match the empty string, so in turn does not match the empty string.

This means that Arden’s theorem gives us the tool we need to remove the coefficient for \(R_k\) on the right hand: All we need to do is to prepend the star of that coefficient to the other terms and remove \(R_k\) from the equation.

This completes our inductive step, which in turn completes our algorithm: We run the induction over the whole set of equations, we now no longer have loops, and we can just substitute back in to get \(R_0\) as desired.

Let’s work through an example now to clear up the details.

Suppose we have the states 0, 1, 2. Only state 2 is accepting. Our characters are a, b and c. The allowed transitions are:

  • From 0: \(a, b \to 1\).
  • From 1: \(a \to 1, b \to 2\)
  • From 2: \(c \to 0, 1\).

This gives us the following three initial equations:

  • \(R_0 = (a | b) R_1\)
  • \(R_1 =a R_1 \cup b R_2\)
  • \(R_2 = \epsilon \cup c R_0 \cup c R_1\)

We now perform substitutions as follows:

  1. Our equation for \(R_0\) is in the desired form already so nothing needs to be done.
  2. Our equation for \(R_1\) has \(R_1\) on the right hand side, so we use Arden’s theorem to rewrite it as \(R_1 = a* b R_2\). It is now in the desired form.
  3. We substitute in our previous equation for \(R_0\) and get \(R_2 = \epsilon \cup c (a|b) R_1 \cup c R_1 = \epsilon \cup c(a|b|\epsilon) R_1\).
  4. We substitute in our previous equation for \(R_1\) and get  \(R_2  = \epsilon \cup c(a|b|\epsilon) a* b R_2 \).
  5. We apply Arden’s theorem one final time and get \(R_2 = (c(a|b|\epsilon) a* b)* | \epsilon = (c(a|b|\epsilon) a* b)*\).
  6. We now substitute this into our equation for \(R_1\) and get \(R_1 = a* b (c(a|b|\epsilon) a* b)*\)
  7. We now substitute this into our equation for \(R_0\) and get  \(R_0 = (a | b) a* b (c(a|b|\epsilon) a* b)*\)

We now have a (not all that pleasant) regular expression matching our original deterministic finite automaton, as desired.

In general I’m unclear on whether this is the “best” method for doing this. The above explanation is a lightly modified version of the one presented here, which compares it with several other methods, where it seems to come out ahead of the others.

It certainly isn’t optimal, in the sense that it doesn’t always produce a minimal regular expression. However, neither do any of the other standard methods. This seems to be a hard problem, and there aren’t as far as I know any good algorithms for doing it.

However, regardless of whether it’s the best, it is certainly the one that seems clearest to me: Once you have the key mechanism of Arden’s theorem, you just perform simple algebraic manipulation and the result pops out at the end.

If you want to see this working in practice, I’ve rather arbitrarily added in some code for doing this (albeit only with deterministic finite automata. The principle is the same though) to my FALBS project, which has basically become a dumping ground for all my formal languages code, along with a test that this does the right thing.


(Want more posts like this? Why not support my Patreon! It will cause me to write more, give you some influence over what I write, and give you access to early drafts of upcoming posts).

This entry was posted in Automata Theory, programming, Python on by .