# Refining L* search

Attention conservation notice: If you don’t care about fine grained algorithmic improvements to inference algorithms for regular languages, you don’t care about this post.

Also, apologies, I know mathjax isn’t currently not working so you will see some raw LaTeX. If you turn on unsafe scripts (i.e. “not loaded over https”) it will work, but due to some technical hiccups this is currently difficult for me to fix. I’m on the case.

L* search (Angluin, D. Learning regular sets from queries and counterexamples. Inf. Comput. 75, 2 (1987), 87–106) is a really neat algorithm that lets you infer a deterministic finite automaton (DFA) for an unknown language.

I’ve long been of the suspicion that it would be useful in Hypothesis. Unfortunately to date reality has disagreed with me. The key problem is that L* search is really expensive and any attempt to use it would almost immediately blow Hypothesis’s normal budget for number of tests run.

Starting from a trivial automaton that accepts no strings, it proceeds in two stages:

1. It uses membership queries for the language to complete the DFA, finding transitions from states that provably do not go to any of the existing known states, and stopping only when it has a consistent transition table for every known state and every member of the alphabet.
2. It then asks for a counter-example to its conjectured DFA. If there is none, we have correctly inferred the language. If there is one, it is used to improve our ability to distinguish states, enlarging the number of states by at least one.

Because L* search is maintaining a minimal DFA, where the only states that are distinguished are ones that it can actually prove are inequivalent, this guarantees that the algorithm terminates with a correct DFA in time at most O(n), where n is the size of the minimal DFA of the language.

It does this by maintaining a set of experiments. These are strings that can be used to distinguish states – if by following the experiment from two states you end up in an accepting state in one and a non-accepting state in another, the two states must be inequivalent.

The rest of this post assumes you understand L* search. If you don’t, I recommend reading my prior more detailed explanation of it.

I use an improvement proposed in “Rivest, R. L., and Schapire, R. E. Inference of finite automata using homing sequences.
Inf. Comput. 103, 2 (1993), 299–347” that lets you take the cost of the counter-example step down to logarithmic in the length of the counter-example and ensures that each counter-example only adds one experiment to the list.

The problem is that the completion step is still extremely expensive: It requires evaluating $$|A||S||E|$$ membership queries, where $$A$$ is the alphabet, $$S$$ are the states we currently have, and $$E$$ is the number of experiments we currently have (you can remove some of these because some state transitions are provable because of the representation of the states as strings – $$s$$ always transitions to $$sa$$ under $$a$$ if both strings are in the state set, but you can’t remove more than $$|S| – 1$$ evaluations this way).

One of the problems I have with using this in Hypothesis is that $$|A| = 256$$, which is both intrinsically fairly large and also actually larger than Hypothesis’s typical testing budget. Thus there’s no way to fit even one L* pass into a normal test run!

So, can we get that $$|A|$$ factor down?

It turns out, yes, at the cost of potentially needing more counter-examples (which is a cost I’m entirely willing to pay).

The observation follows one from “Owens, S., Reppy, J. H., and Turon, A. Regular-expression derivatives re-examined. J. Funct. Program. 19, 2 (2009), 173–190.”, which is that when you have a large alphabet you will typically find at a given state in a DFA many characters are equivalent to each-other (that is, they result in a transition to the same state). For example, when $$|A| \geq |S|$$, even though there are $$|A|$$ possible valid characters at a given state there are only $$|S|$$ up to equivalence.

If we take advantage of this, we could reduce the number of queries substantially, and it turns out that it is easy to do.

The idea is simple: When we create a state, we initialise a partition refinement data structure along-side it. Initially this is set to consider all members of the alphabet to be in the same equivalence class.

When expanding our table, we now assume that any characters in the same partition are equivalent, and we only actually perform membership queries with, say, the smallest element of the partition. Thus instead of the $$|A||S|$$ term we only perform $$\sum |P_i|$$ term, where $$P_i$$ is the set of partitions associated with state $$i$$. In particular this is $$O(|S| \min(|A|, |S|)$$.

Then when we find an experiment $$e$$ and a transition $$s \to t$$ via $$c$$, instead of immediately adding to our experiment set we check whether we just have too coarse a partition. We do this by checking if $$sce \in \mathcal{L} = sc’e \in \mathcal{L}$$, where $$c’$$ is the “canonical” element of the partition that we selected when determining that this was the correct transition (if $$c = c’$$ we can skip this bit). If this is not the case then we use it to refine the partition by splitting it into two based on whether $$sde \in \mathcal{L}$$. We then add the experiment to the list and rebuild the DFA. If this does not result in expanding our list of states (this can only happen if we did end up refining the partition), we then remove the experiment from the list! This can happen because we can end up with partitions that our existing experiments were sufficient to distinguish but where we never checked, and by not adding redundant experiments in that case we bound the number of experiments to the number of states.

This is correct for the same reasons that L* is correct – at any given stage, either our partitions are correct, or we refine them. However it will take an additional $$\sum (|P_i| – 1)$$ counter-example steps to converge (where the $$P_i -$$ here are the partitions from our final state, i.e. the true partitions).

Counter-examples are somewhat costly – regardless of how much it costs to find (which may be “free” in some sense), a counter example $$u$$ takes  $$O(\log(|u|)$$ membership queries to process. If we assume all counter-examples are minimized and so $$|u| \leq n$$ where $$n$$ is the number of states in the true minimal DFA, then that’s $$O(\log(n)$$

Suppose the true minimal DFA has $$n$$ states and that $$m = \sum |P_i|$$ in it. Then at each stage we either add one or more states, or we increase the number of partitions, thus we perform at most $$n + \sum (|P_i – 1) = n + m – n = m$$ counter-example queries.

Each completion step takes $$O(m |E|$$ to run, and we only added experiments in the case where we increased the number of states, so each step runs in $$O(m n)$$, and the entire thing runs in at most $$O((mn + \log(n)) m)$$. $$m \geq n$$, so we can discard the $$\log(m)$$ term and the whole thing runs in $$O(m^2 n)$$.

In contrast, classic L* the steps take $$O(|A| n^2)$$ and we run in at most $$n$$ steps, so the complexity is $$O(|A| n^3$$. So this is a complexity improvement if and only if $$n^2 |A| \geq m^2$$ asymptotically, or in other words if $$\frac{m}{n} \geq \sqrt{|A|}$$ (technically this is an invalid way to reason about asymptotics, but those were pretty tight estimates).

$$\frac{m}{n}$$ is the average number of distinct equivalent classes over states. For my use case I expect this to be very low – typically the number of equivalence classes will be one!

For the more general use case this isn’t obviously a win – if $$|A|$$ is small it almost certainly isn’t. e.g. when $$|A| = 2$$ this condition is equivalent to saying that 60% of states don’t depend on the character to determine their transition at all.

I haven’t attempted to reason about it much, but one possible fix would be an adaptive algorithm that tracks the average on the states and partitions found so far, and as soon as it exceeds $$\sqrt{|A|}$$ switches over to normal L*. My suspicion is that if you measure cost purely in terms of membership queries, this adaptive version is a substantial saving on almost any language model.

This entry was posted in Automata Theory on by .

# Ordering and mapping regular languages

I’ve had the following question for a while: How do I create a mapping of keys to values where the keys are regular expressions, and two regular expressions are considered equivalent if they correspond to the same language?

An example of why you might want to do this is e.g. when constructing a minimal deterministic finite automaton for a regular language you end up labelling states by regular expressions that represent the language matched when starting from that state. In order for the automaton to be minimal you need to have any two equivalent regular expressions correspond to the same state, so you need a way of taking a new regular expression and finding out which state it should go to.

It’s easy (if potentially expensive) to test regular expression equivalence once you know how, so the naive implementation of this is just to do a linear scan of all the regular expressions you’ve found so far. It’s O(n) lookup but it’s at least an existence proof.

In the past I’ve ended up implementing a crude hash function for regular languages and then just used a hash table. It works, but collisions are common unless you’re quite clever with your hashing, so it doesn’t work well.

But it turns out that there is a better way! Rather than using hashed data structures you can used ordered ones, because it turns out that there is a natural and easy to compute (or at least not substantially harder than testing equivalence) total ordering over the set of regular languages.

That way is this: If you have two regular languages $$L$$ and $$M$$ that are not equivalent, there is some $$x \in L \triangle M$$, the symmetric difference. That is, we can find and $$x$$ which is in one but not the other. Let $$x$$ be the shortlex minimal such word (i.e. the lexicographically first word amongst those of minimal length). Then $$L < M$$ if $$x \in L$$, else $$M < L$$.

The work in the previous post on regular language equivalence is thus enough to calculate the shortlex minimal element of an inequivalent pair of languages (though I still don’t know if the faster of the two algorithms gives the minimal one. But you can use the fast algorithm for equivalence checking and then the slightly slower algorithm to get a minimal refutation), so we can readily compute this ordering between two regular expressions. This, combined with any sort of ordered collection type (e.g. a balanced binary tree of some sort) gives us our desired mapping.

But why does this definition provide a total order?

Well, consider the enumeration of all words in increasing shortlex order as $$w_0, \ldots, w_n, \ldots$$. Let $$l_n = 1$$ if $$w_n \in L$$, else $$l_n = 0$$. Define $$m_n$$ similarly for $$M$$.

Then the above definition is equivalent to the reverse of the lexicographical ordering between $$l$$ and $$m$$! If $$w_k$$ is the smallest word in the symmetric difference then $$k$$ is the first index at which $$l$$ and $$m$$ differ. If  $$w_k \in L$$ then $$l_k = 1$$ and $$m_k = 0$$, so $$l > k$$, and vice versa. The lexicographical order is a total order, and the reverse of a total order is a total order, so the above definition is also a total order.

This definition has a number of nice properties:

• Any language containing the empty word sorts before any language that doesn’t
• The function $$L \to \overline{L}$$ is order reversing.
• $$L \cap M \leq L, M \leq L \cup M$$

I originally thought that union was coordinate-wise monotonic, but it’s not. Suppose we have four words $$a < b < c < d$$, and consider the languages $$L = \{a, d\}, M = \{b, c\}$$. Then $$L < M$$ because the deciding value is $$a$$. But now consider $$P = \{a, b\}$$. Then $$L \cup P > M \cup P$$ because the deciding element now ends up being $$c$$, which is in $$M$$.

I’ve yet to try this in practice, so it might turn out that there are some interestingly pathological failure modes for this comparison function, but I don’t think there are likely to be any that aren’t also present in testing regular expression equivalence itself.

Another open question here is which sorted map data structure to use? The comparison is relatively expensive, so it might be worth putting a bit of extra effort in to balance it. As such an AVL tree might be a fairly reasonable choice. I’m not sure.

Want more blog posts like this? Join the 30 others who are supporting my writing on Patreon!

This entry was posted in Automata Theory, programming on by .

# The Easy Way to Compile Automata to Regular Expressions

I have a confession: despite knowing a reasonably large amount about formal language theory and automata, I’ve never bothered to learn how to prove how to turn a finite automaton back into a regular expression.

My attitude has always been that I have a vague sense of how a proof would go and couldn’t be bothered to sort out the details. I’ve probably skimmed a proof in the past but it looked complicated and uninformative so I didn’t bother.

Due to reasons (a certain theorem about how you could represent regular expressions in a particular way required the details of a proof) I found I was forced to care recently, so I decided to actually sit down and figure it out.

The standard proofs did indeed seem complicated and uninformative, but fortunately it turns out that the general method of understanding regular language theory fundamentals applies in this case.

That method, of course, being to ask “What Would Janusz Brzozowski Do?”.

Starting from a Nondeterministic Finite Automaton (without $$\epsilon$$-transitions) there turns out to be a natural algebraic process which is fairly reminiscent of Gaussian Elimination that results in a fairly small regular expression matching the language.

The idea is this: Suppose we have a non-deterministic automaton without $$\epsilon$$-transitions. Label the states $$0, \ldots, n$$. We consider the languages $$R_i$$ of strings matched when starting from state $$i$$.

By looking at the automaton we get a set of algebraic equations for each of these, which we can progressively rewrite and simplify to remove loops in them (e.g. where $$R_i$$ is defined self-referentially, or the definitions of $$R_i$$ and $$R_j$$ depend on each other). Once we’ve successfully done that, we can just substitute in terms until we get an expression for $$R_0$$, which will be our final result.

Note: In what follows we’ll adopt the usual convention of equating regular expressions with their regular language. Variables which correspond to a language will be upper case, ones corresponding to an expression will be lower-case.

Let $$s_{ij}$$ be a regular expression matching the language of strings that cause a tradition from state $$i$$ to $$j$$. This is a language consisting of strings of length $$1$$ – if the character $$c$$ causes a transition from $$i$$ to $$j$$ then $$c \in s_{ij}$$. If there are no transitions from $$i$$ to $$j$$ then $$s_{ij}$$ is $$\phi$$, the regular expression matching no strings, otherwise it is a union of finitely many basic regular expressions.

Note that in the general case there may be strings matched by both $$s_{ij}$$ and $$s_{ik}$$ with $$j \neq k$$. This is because the automaton is non-deterministic. In a deterministic automaton the expressions $$s_{ij}, s_{ik}$$ with $$i \neq k$$ will have no common matches.

Let $$t_i$$ be $$\phi$$ if state $$i$$ is not accepting and $$\epsilon$$ if it is. i.e. $$t_i$$ is a regular expression matching the set of empty strings matched from state $$i$$.

Now, we have that $$R_i = t_i \cup \bigcup\limits_j s_{ij} R_j$$ – a string in $$R_i$$ is either the empty string (if $$i$$ is accepting), or the result of transitioning to another state and matching from there.

We’ll now begin the process of substituting in terms to remove cycles.

The key idea is to find $$t_i’, s_{ij}’$$ such that whenever $$j \leq i$$, $$s_{ij}’ = \phi$$. This means there can be no cycles because there are no backwards or self-references between the expressions.

It will no longer be the case that $$s_{ij}’$$ only matches strings of length $$1$$, but it will be the case that they don’t match the empty string (this is important).

How do we construct these modified terms? By induction!

Suppose we’ve constructed $$t_i$$ and $$s_{ij}$$ for $$i < k$$. If we just substitute in and rearrange all of those terms into $$R_k = t_k \cup \bigcup\limits_j s_{kj} R_j$$ then we almost get what we want: The only terms that can be non-empty are the ones where $$j \geq k$$, because we’ve eliminated all the terms that were $$< k$$ and replaced them with terms that are $$\geq k$$.

So all we need to do is figure out how to remove the $$R_k$$ term.

Fortunately, there’s a great tool for doing that. it’s called Arden’s theorem:

If we have languages $$X, A, B$$ with $$\epsilon \not\in A$$ and$$X = AX \cup B$$, then $$X = A* B$$ .

Conceptually this is fairly intuitive because you can just rerwite it as $$X = A(AX \cup B) \cup B$$ and continue this infinitely to get $$X = B \cup AB \cup AAB \cup \ldots$$, which is precisely $$X = A*B$$. However this intuition breaks down somewhat – the condition that $$A$$ does not contain the empty string is essential, because if we picked $$A = \{\epsilon\}$$ and $$B = \emptyset\} then this reduces to \(X = X$$, which every language satisfies.

The problem is essentially that the term in the $$\ldots$$ “does not converge” in this case, so we need to do some more work to actually prove this rigorously.

Proof:

First note that $$A*B$$ is a possible solution for $$X$$ because $$A* = (A A*) \cup \epsilon$$, so  $$A* B = (A A* B) \cup \epsilon B = A (A* B) \cup B$$.

Now note that any language satisfying the equation has the following property: $$x \in X$$ if and only if either $$x \in B$$ or $$x = uv$$ with $$u \in A$$ and $$v \in X$$.

This is simply by the definition: If $$x \not\in B$$ then $$x \in AX$$, so $$x$$ starts with a string of $$A$$, which is non-empty by assumption. Conversely, if $$x \in B$$ then $$x \in X$$ by definition, and if $$x = uv$$ with $$u \in A$$ and $$v \in L$$.

Now, suppose we have two non-identical solutions to the equation. Say $$L$$ and $$M$$. Suppose there exists $$x \in L \setminus M$$. Pick $$x$$ so that it has minimal length among such words..

Then certainly $$x \not\in B$$ or it would be in both, so by assumption we can write $$x = uv$$ with $$u \in A$$ and $$v \in L$$.

But then we must have $$v \in M$$, by the minimality of $$x$$. But if $$v \in M$$ then necessarily $$uv = x \in M$$. This contradicts the assumption. Therefore no such $$x$$ exists, and the two languages are equal.

QED

So it’s important that all the coefficients on the right hand side don’t match the empty string, but this is again true inductively: The initial $$s_{ij}$$ do not match the empty string, and every coefficient is either a concatenation or a union of two languages that don’t match the empty string, so in turn does not match the empty string.

This means that Arden’s theorem gives us the tool we need to remove the coefficient for $$R_k$$ on the right hand: All we need to do is to prepend the star of that coefficient to the other terms and remove $$R_k$$ from the equation.

This completes our inductive step, which in turn completes our algorithm: We run the induction over the whole set of equations, we now no longer have loops, and we can just substitute back in to get $$R_0$$ as desired.

Let’s work through an example now to clear up the details.

Suppose we have the states 0, 1, 2. Only state 2 is accepting. Our characters are a, b and c. The allowed transitions are:

• From 0: $$a, b \to 1$$.
• From 1: $$a \to 1, b \to 2$$
• From 2: $$c \to 0, 1$$.

This gives us the following three initial equations:

• $$R_0 = (a | b) R_1$$
• $$R_1 =a R_1 \cup b R_2$$
• $$R_2 = \epsilon \cup c R_0 \cup c R_1$$

We now perform substitutions as follows:

1. Our equation for $$R_0$$ is in the desired form already so nothing needs to be done.
2. Our equation for $$R_1$$ has $$R_1$$ on the right hand side, so we use Arden’s theorem to rewrite it as $$R_1 = a* b R_2$$. It is now in the desired form.
3. We substitute in our previous equation for $$R_0$$ and get $$R_2 = \epsilon \cup c (a|b) R_1 \cup c R_1 = \epsilon \cup c(a|b|\epsilon) R_1$$.
4. We substitute in our previous equation for $$R_1$$ and get  $$R_2 = \epsilon \cup c(a|b|\epsilon) a* b R_2$$.
5. We apply Arden’s theorem one final time and get $$R_2 = (c(a|b|\epsilon) a* b)* | \epsilon = (c(a|b|\epsilon) a* b)*$$.
6. We now substitute this into our equation for $$R_1$$ and get $$R_1 = a* b (c(a|b|\epsilon) a* b)*$$
7. We now substitute this into our equation for $$R_0$$ and get  $$R_0 = (a | b) a* b (c(a|b|\epsilon) a* b)*$$

We now have a (not all that pleasant) regular expression matching our original deterministic finite automaton, as desired.

In general I’m unclear on whether this is the “best” method for doing this. The above explanation is a lightly modified version of the one presented here, which compares it with several other methods, where it seems to come out ahead of the others.

It certainly isn’t optimal, in the sense that it doesn’t always produce a minimal regular expression. However, neither do any of the other standard methods. This seems to be a hard problem, and there aren’t as far as I know any good algorithms for doing it.

However, regardless of whether it’s the best, it is certainly the one that seems clearest to me: Once you have the key mechanism of Arden’s theorem, you just perform simple algebraic manipulation and the result pops out at the end.

If you want to see this working in practice, I’ve rather arbitrarily added in some code for doing this (albeit only with deterministic finite automata. The principle is the same though) to my FALBS project, which has basically become a dumping ground for all my formal languages code, along with a test that this does the right thing.

(Want more posts like this? Why not support my Patreon! It will cause me to write more, give you some influence over what I write, and give you access to early drafts of upcoming posts).

This entry was posted in Automata Theory, programming, Python on by .

# Proving or refuting regular expression equivalence

Suppose we are interested in the following problem: We have two regular expressions. Do these match exactly the same strings, and if not what is the smallest (shortlex minimal) string that one matches but not the other?

i.e. we either want to prove they are equivalent or provide the best possible refutation that they are not.

How do we do this?

More specifically, let’s consider this problem for extended regular expressions. That is, we have the following types of expressions:

• $$\epsilon$$, which matches only the empty string
• $$c$$, for $$c$$ in our alphabet, which matches only the character $$c$$
• $$\emptyset$$, which matches no strings
• $$\omega$$ which matches every string
• $$x^*$$ which matches $$a_1\ldots a_n$$ for any strings $$a_i$$ matched by $$x$$.
• $$xy$$ where $$x$$ and $$y$$ are extended regular expressions, which matches any strings $$ab$$ where $$x$$ matches $$a$$ and $$y$$ matches $$b$$
• $$x \vee y$$ which matches any string matched by either or both of $$x$$ or $$y$$
• $$x \wedge y$$ which matches any string matched by both $$x$$ and $$y$$
• $$\neg x$$ which matches any string not matched by $$x$$.

That is, we extend the normal regular expressions with intersection and negation. This doesn’t actually increase the set of languages we can match, but it can lead to significantly more concise representations (I think, but am not sure, there are extended regular expressions which require exponentially larger regular expressions to represent).

This reduces to the problem of finding the lexmin smallest member of the regular language for an expression or showing that it’s empty: Given two extended regular expressions $$x, y$$, we can define $$x \oplus y = (x \vee y) \wedge (\neg (x \wedge y))$$ – the symmetric difference. This matches any string which is matched by one but not both of $$x$$ and $$y$$, so a string is a member of this regular language precisely if it is a refutation of the claim that $$x$$ and $$y$$ are equivalent.

If we can compile extended regular expressions into a deterministic finite automaton (which we can), then the problem is easy:  To find the lexmin smallest element of a regular language or show that it’s empty given a deterministic finite automaton for it, you just use Dijkstra’s Algorithm to search the graph for an accepting node. You stop when you either find one (in which case you’ve constructed a lexmin path to it along the way) or you’ve explored all the states (in which case there is no path to an accepting node and the language is empty). Job’s done.

That was easy. Are we done?

Well, no. There are a lot of evils hidden in that sentence “just construct a deterministic finite automaton”. The corresponding automaton to a regular language may be exponentially larger than the language, and even when it’s not it’s still an expensive operation. Can we avoid that?

First, let’s take a detour: There are a number of obvious rewrite rules we can use to simplify our lives, by replacing regular expressions with straightforwardly equivalent versions of them. We use $$x \approx y$$ to indicate that we should consider $$x$$ and $$y$$ the same under these rewrite rules:

• $$x \wedge x \approx x$$
• $$x \wedge y \approx y \wedge x$$
• $$x \wedge (y \wedge z) \approx (x \wedge y) \wedge z$$
• $$x \wedge \emptyset \approx \emptyset$$
• $$x \wedge \omega \approx x$$
• $$x \vee x \approx x$$
• $$x \vee y \approx y \vee x$$
• $$x \vee (y \vee z) \approx (x \vee y) \vee z$$
• $$x \vee \emptyset \approx x$$
• $$x \vee \omega \approx \omega$$
• $$x(yz) \approx (xy)z$$
• $$x\emptyset \approx \emptyset$$
• $$\emptyset x \approx \emptyset$$
• $$x \epsilon \approx x$$
• $$\epsilon x \approx x$$
• $$(x^*)^* \approx x^*$$
• $$\epsilon^* \approx \epsilon$$
• $$\emptyset^* \approx \epsilon$$
• $$\omega^* \approx \omega$$
• $$\neg \emptyset \approx \omega$$
• $$\neg \omega \approx \emptyset$$
• $$\neg(\neg x) \approx x$$

If $$x \approx y$$ we say that $$x$$ and $$y$$ are similar.

All of these correspond to rules that apply for the resulting languages, so if $$x \approx y$$ then $$x$$ and $$y$$ match the same language. In general though $$x$$ and $$y$$ might match the same languages without being similar.

Why do we care about this? Well the first reason is that it lets us have a fast test for when two regular expressions definitely are equivalent: If they are similar then they are equivalent. If they’re not we have to do more work to test this.

In particular, we can use smart constructors for our regular expression data type to apply all of these rewrite rules at construction time, which then makes similarity simply a matter of testing for structural equality. If we use hash consing (essentially: memoizing all constructors so that structurally equal values are represented by the exact same object) we can even make testing for similarity O(1).

Which is good, because we’re going to need it a lot: The real reason to care about similarity of regular expressions is that we can construct the a deterministic finite automaton using the Brzozowski derivative of regular expressions, with each of our states labelled by a regular expression – if we have some character in our alphabet $$c$$ and a regular language $$L$$ then the Brzozowski derivative is $$d(L, c) = \{v: cv \in L\}$$ – the set of suffixes of strings starting with $$c$$ in $$L$$. We can extend this to regular expressions with the following set of rules:

• $$d(\epsilon, c) = \emptyset$$
• $$d(b, c) = \emptyset$$ if $$b \neq c$$, else $$d(c, c) = \epsilon$$
• $$d(\emptyset, c) = \emptyset$$
• $$d(\omega, c) = \omega$$
• $$d(x^*, c) = d(x, c) x^*$$
• $$d(xy, c) = d(x, c)y \vee \nu(x) d(y, c)$$
• $$d(x \vee y, c) = d(x, c) \vee d(y, c)$$
• $$d(x \wedge y, c) = d(x, c) \wedge d(y, c)$$
• $$d(\neg x, c) = \neg d(x, c)$$

Where $$\nu(x)$$ is $$\emptyset$$ if the corresponding regular language doesn’t match the empty string, or $$\epsilon$$ otherwise (this can be calculated directly from the expressions by similar rules).

The idea is that we can construct an automaton for a regular language by labelling states by expressions. States are accepting if the corresponding expression matches the empty string, and the transition from state $$x$$ given character $$c$$ goes to $$d(x, c)$$.

If our similarity rule was enough to guarantee equivalence then this would be a minimal deterministic finite automaton for the language. However, even without that there are only finitely many states reachable from a given regular expression (this is where similarity becomes important! Without it this is not true. e.g. if you consider iterating the derivatives of $$c^*$$ under $$c$$ you’ll get infinitely deep nested chains of similar but unequal expressions).

So far it isn’t obvious why this is interesting, but it has one very important property: You can walk this automaton without having to compute any of the states in advance. This means that you can build it as you explore, so we can repeat the above approach of just using Dijkstra’s algorithm to find a refutation but we don’t have to build the automaton in advance.

It’s particularly easy to calculate the derivatives of the disjoint sum language too:

$$d(x \oplus y, c) = d(x \vee y, c) \wedge d(\neg (x \wedge y), c) = (d(x, c) \vee (y, c)) \wedge (\neg (d(x, c) \wedge d(y, c))) = d(x, c) \oplus d(y, c)$$.

So if we’ve got the derivatives of our original regular expressions cached in some way it’s very cheap to calculate derivatives of our compound regular expression.

So that’s a major improvement on our original algorithm – in a number of cases it will be exponentially faster!

There are a number of ways you can improve it further: You can prune down the possible set of starting characters for a string in an imprecise way to reduce the number of branches you explore when doing Dijkstra. You can also precalculate when two characters will lead to the same state and always consider only the smallest. This is a worthwhile set of optimisations to the algorithm, but doesn’t affect it in any really fundamental way.

Can we do better yet?

The problem is that although it’s easy to calculate, our new automaton potentially has a lot more states than we want. In general it will be fewer, but if there are $$M$$ states in $$x$$ and $$N$$ in $$y$$ then there might be as many as $$MN$$ in the automaton for the symmetric difference regular expression. We’ll likely not have to explore all of them, but it’s unclear how many you might have to explore.

But you can do equivalence testing in nearly O(M + N) using an algorithm due to Hopcroft and Karp based on union-find (Note: the algorithm called the Hopcroft-Karp algorithm is confusingly something completely different). The idea is that assume that they are equivalent and we merge states until we find a contradiction, and that by aggressively merging states we can consider a lot fewer of them:

Given two automata $$S$$ and $$T$$ with initial states $$s$$ and $$t$$, we test equivalence as follows:

def equivalent(S, T): s = S.start t = T.start merges = UnionFind() merges.union(s, t)   stack = [(s, t)] while stack: p, q = stack.pop() for a in ALPHABET: pa = merges.find(S.transition(p, a)) qa = merges.find(S.transition(q, a)) if pa.accepting != qa.accepting: return False if pa != qa: merges.union(pa, qa) stack.append((pa, qa)) return True

The idea is that if two states are equivalent then the states they transition to under the same character must be equivalent, and if we’ve proved that an accepting and non-accepting state are equivalent then we’ve reached a contradiction and our initial assumption must have been false. This merges two states if and only if there is some string that leads to one in one automaton and the other in the other, so if there is some string that one matches but not the other it will eventually merge those two states (if it doesn’t stop earlier).

Because of the way we aggressively merge states, this runs in time O(k(M + N)) where k is the size of the alphabet and M and N are the size of the respective states (there’s actually an inverse Ackermann factor in there too, but the inverse Ackermann function is so close to constant it might as well be ignored).

The problem with this algorithm is that it doesn’t produce a refutation, only a yes or no. We can trivially modify it by pushing paths along with the states, but the paths generated aren’t necessarily useful and in particular don’t necessarily go to a state that is accepting in one and not in the other – we’ve replaced paths to states we believed to be equivalent, but because the core assumption is wrong this didn’t necessarily work. In particular there’s no guarantee that either of the states that we use to refute the result are accepting, only that we falsely believe them to be equivalent to two states where one is accepting and the other is not.

We can still make use of this by using this as a fast initial test; Use Hopcroft and Karp’s algorithm as a fast initial test for equivalence and if it returns false then construct a refutation once we know one must exist. This algorithm also still works well with the lazy construction.

But it would be nice to be able to produce a witness as an integrated part of running this algorithm.

And it turns out we can: The idea is to hybridise the two algorithms – use Dijkstra’s algorithm to build the witness, but instead of operating on a graph corresponding to a single automaton, we build a graph on pairs of states, one from each automaton – an edge then involves following the corresponding edge in each automaton. Rather than searching for an accepting node, we are now searching for a node where one state is accepting and the other is not.

The key thing we then borrow from the algorithm of Hopcroft and Karp is that we can skip states where we have merged the pairs, because there is nothing interesting down that route even if we haven’t seen it.

Here’s some code for the resulting algorithm:

def witness_difference(left, right): if left is right: return None if left.matches_empty != right.matches_empty: return b'' merges = UnionFind() merges.union(left, right)   queue = deque() queue.append((left, right, ())) while len(queue) &gt; 0: p, q, s = queue.popleft() for a in ALPHABET: pa = derivative(p, a) qa = derivative(q, a) if pa is qa: continue t = (a, s) if merges.find(pa) != merges.find(qa): if pa.matches_empty != qa.matches_empty: result = bytearray() while t: a, t = t result.append(a) result.reverse() return bytes(result) merges.union(pa, qa) queue.append((pa, qa, t)) return None

One important difference here is that we always put the actual reached pairs of states onto the queue rather than the collapsed pair. This ensures that the paths we take are real paths to real states in the corresponding automaton. Also, because we stop as soon as we ever would merge an accepting and no-accepting state, the problem we worried about above can’t happen.

Also it’s worth noting that I’m not sure this algorithm produces a minimal refutation. It should produce a pretty small one, but there could be a smaller one reachable through two states which the algorithm has merged but that are not actually equivalent.

The requirement for witnesses is a slightly unusual one – most algorithms just focus on testing for equivalence – so it would be reasonable to ask why we care.

The reason I care is that I’m interested in the following more general question: I want to maintain a set of N regular languages as witnessed by regular expressions, and I want to be able to test for membership of this set and add new elements to it.

I’m still not sure what the best way to do this is, but all of the best ideas I have require refutations to prevent you from having to do O(N) equivalence tests. For example, the following works: Maintain a binary tree, where each branch node has a string and each child node is a regular expression. When testing for membership, walk the tree as follows: At a branch, if the language contains the string then go to the left child else go to the right child. At a leaf, perform an equivalence test. If it returns true then the language is a member.

Insertion is then nearly the same: We walk until we get a leaf. At the leaf we perform an equivalence test with a refutation. If the two languages are equivalent then there’s no insertion to perform. If they’re not equivalent, then we create a split node with the refutation string as the key and put whichever one contains it as a leaf for the left child and the other one as a right.

This isn’t a great algorithm, in that it doesn’t have any easy way to balance the tree, but it’s always better than doing N equivalence checks: Matching a short string is very cheap, and we always perform at most N – 1 matches (potentially many fewer) and exactly one equivalence check.

In fact, it is not in general possible to balance the tree; Consider N strings where string k matches only the string consisting of a sequence of k 0s. Then each test string has exactly one language which matches it, so every split must have one of its children be a leaf and the look up problem degenerates to always testing N strings.

Another algorithm is to maintain a trie of all the refutations found so far. You can then simultaneously walk the regular expression’s deterministic finite automaton and the trie to produce a bit vector where bit i is set if the language contains the i’th string. This bitvector is then the same as that of at most one existing language, which can be looked up using e.g. a hash table. So to test membership you produce the bitvector and check for presence in the hash table. If it’s not present then the language isn’t present. If it is present, then check equivalence with the unique existing language with that bitvector. Insertion is similar: If the bitvector isn’t present, add the language to the hash table and leave the set of refutations alone. Else, check for equivalence with a refutation to the unique existing language with the same bitvector. If they’re equivalent, do nothing. If they aren’t, add the refutation to the trie and update the bitvector for all existing languages (you can actually do this last step incrementally as you go, but it may not be worth it).

The second algorithm is probably strictly worse than the first because it involves an O(N) computations on a lot of inserts, as well as testing for membership being more expensive. It really depends on how much cheaper equivalence is than matching for your language workload, and how likely existing refutations are to witness future ones.

One reason why being able to do the above is interesting is that it allows us to extend the lazy construction of a deterministic finite automaton to a lazy construction of a minimal deterministic finite automaton: If we can maintain such a set relatively efficiently, then instead of merely conflating similar expressions into the same states we can label states with regular languages, giving us the proper derivative automaton rather than just our approximation to it.

So that’s where my ability to answer this leaves off a bit. There may be a good data structure for this, but if there is I haven’t found the right corner of the literature to tell me what it is.

if you want to have a play with some of this, here’s some (mostly untested so probably badly wrong in places) Python code implementing a lot of these ideas:

If you’d like to learn more about this, most of the information in this post either came from or was based on material from Regular-expression derivatives reexamined,  Testing the Equivalence of Regular Languages and Algorithms for testing equivalence of finite automata, with a grading tool for JFLAP (which are in turn based on a lot of other papers you can get to from their citations).

(And, as always, if you’d like to see more posts like this and feel able to spare the price of less than a coffee per month, I encourage you to donate to my Patreon for supporting my writing.)

This entry was posted in Automata Theory, programming on by .

# L* Search and Inference of Deterministic Finite Automata

This post follows on from my previous one about the Myhill–Nerode theorem. You should read that one first.

When trying to construct a deterministic finite automaton for a language in practice, one thing we need to know is how it’s represented. If all we have is a yes/no predicate, the problem is in general intractable (it could be any set of strings at all!). If all we have is a set of examples and non-examples and we want to construct a minimal deterministic finite automaton (DFA), the problem is NP hard.

This post is about what can do if we augment the yes/no predicate only slightly: We have what is called a minimally adequate teacher. It can answer two types of questions about our language $$L$$:

1. Is this string in the language?
2. Given this DFA, does it correctly recognise the language and if not what is an example of a string where it gives the wrong answer?

In this scenario there is an polynomial time algorithm in the size of the minimal DFA (and logarithmic in the size of the longest counterexample the teacher returns) which will terminate in a correct DFA for the language.

This algorithm was originally presented by Dana Angluin in her paper “Learning Regular Sets from Queries and Counterexamples” under the name L* search. The version I’m going to present here is in fact a slightly later version of the algorithm presented by Ronald L. Rivest and Robert E. Schapire in “Inference of Finite Automata Using Homing Sequences” which improves on the algorithmic complexity and makes the underlying nature of the algorithm slightly clearer. Regardless, I’m still going to call it L* search.

The idea of L* search (although not how it was originally presented) is that we are going to try to construct the Myhill–Nerode automaton , $$Z$$, directly. We will do this as follows:

1. We will maintain a sequence $$w_1, \ldots, w_m$$ of witnesses and a sequence $$t_1, \ldots t_n$$ of tests. $$w_i$$ is a sequence of strings such that each string is not word equivalent to any other, i.e. is a member of a distinct equivalence class in $$Z$$, and $$t_i$$ is a sequence of tests that witness that: i.e. For any $$i, j \leq m$$ with $$i \ne j$$ there is some $$k$$ such that $$w_i t_k \in L \ne w_j t_k \in L$$.
2. We start with $$w_1 = t_1 = \epsilon$$.
3. At each stage we either terminate or we will extend $$t_i$$ by one test and $$p_i$$ by one or more strings.
4. This must terminate, because we’re creating a sequence of non word equivalent strings so we can’t have more elements in it than there are states in $$Z$$ (because the states are equivalence classes of strings). It’s not currently obvious that we can always make progress prior to that point, but that will emerge in the construction of the algorithm.

The core idea is to use the $$t_i$$ to ‘finger print’ every string. The $$t_i$$ give us a mapping $$\mathrm{row}: S \to \{0, 1\}^n$$ where $$\mathrm{row}(s)_i$$ is $$1$$ if $$st_i \in L$$, else $$0$$. We can calculate this fingerprint because we are allowed to ask the teacher whether a string is in the language. We know for a fact that if $$\mathrm{row}(s) \ne \mathrm{row}(t)$$ then $$s$$ and $$t$$ aren’t word equivalent, because one of the $$t_i$$ must witness that.

Given our current sequence of tests $$t_i$$ and our witnesses $$w_i$$, we use the row function to construct the following automaton whose states are the $$w_i$$. In the course of constructing it we may have to extend $$w_i$$:

For each $$w_i$$ and each $$a \in A$$, we consider $$\mathrm{row}(w_i a)$$. Define  $$\delta(w_i, j) = w_j$$, where $$w_j$$ is chosen such that $$\mathrm{row}(w_i a) = \mathrm{row}(w_j)$$. If no such $$w_j$$ exists, then extend the sequence by an element by adding $$w_i a$$ to the end so that it does.

We continue this process until the transition function has been defined for all elements of the sequence (including the ones newly added).

We then define the initial state of this finite state machine to be $$w_1 = \epsilon$$, and a state $$w_i$$ to be accepting if and only if $$w_i \in L$$.

The resulting state machine is more or less the same as the the Myhill–Nerode automaton. The only differences are:

1. We have explicitly chosen an instance of each equivalence class (the $$w_i$$)
2. Instead of considering whether states are distinguishable by any test, we only ask if they are distinguishable by our current experiments $$e_i$$. This can cause us to conflate more states than we should, and fixing that will be the core point of what we do next.

So we now have a DFA to present to the teacher. At this point we perform a counter-example query. Either our teacher says that the DFA is correct, in which case we’re done, or it presents us with a counter-example in the form of a string that the DFA predicts the wrong answer for. Call our current DFA $$Q$$, and that counter-example $$c$$.

What we want to do now is to take that mispredicted string and extract from it an experiment that shows us where we went wrong: At some point we must have made a transition $$\delta(w_i, a) = w_j$$ when in actual fact $$w_ia$$ and $$w_j$$ are not word equivalent (if we never made such a transition then the path we walked is the same as the path we would have walked in $$Z$$, which does correctly recognise the language). So what we want to do is find an experiment $$t$$ such that $$w_i a t \in L \ne w_j t \in L$$, and add $$t$$ as our next experiment.

If we do that, then when we build the next version of the DFA with the new set of experiments the number of states must increase: We already know that $$w_i a$$ is not word-equivalent to $$w_k$$ for any $$k \ne j$$. We have just shown that it is also not word equivalent to $$w_j$$. Thus, in order to find a state to transition to when we reach $$i, a$$, we must either have already added a new state or we add one there and then.

The idea to find such a transition, and an experiment that would distinguish it, is this: We use the suffixes of $$c$$ as our experiment candidates.

We’ll look at a sequence of strings that should be equivalent if our DFA were correct, and look for the exact moment that something goes wrong to find our experiment.

Consider splitting $$c$$ up into a sequence of strings $$u_i v_i$$, with $$0 \leq i \leq |c|$$ and $$|u_i| = i$$. i.e. $$u_i$$ is the first $$i$$ characters of $$c$$ and $$v_i$$ is the remaining set of characters.

(Notation: $$|s|$$ is the length of the string $$s$$).

Now, we can replace $$u_i$$ with $$\delta_Q(q_0, u_i)$$ – the string that represents the state that you get to after reading the first $$i$$ characters – and not have this change whether the result is in the language recognized by $$Q$$. However, the same is not true of the language $$L$$: If we consider $$i = |c|$$ then this string is $$\delta(q_0, c)$$, and by the fact that $$c$$ is a counter-example we know that $$\delta_Q(\epsilon, c) \in L \ne c \in L$$.

Call a value of $$i$$ bad if it has the property that $$\delta_Q(\epsilon, u_i) v_i \in L \ne c \in L$$. i.e. if replacing the state we expect to be in after $$i$$ characters with the representative for that state changes whether the string is recognised as being in the desired language. Necessarily, $$|c|$$ is bad and $$0$$ is good (because $$u_i = \epsilon$$ so the replacement string is just $$c$$ again).

What we want to do is find an index $$i$$ such that $$i$$ is good and $$i + 1$$ is bad. We’ll see how to do that in a moment, but first lets see why it would help:

What we want to show is that if we have such an $$i$$ then the transition from $$\delta(q_0, u_i)$$ to \delta(q_0,  u_{i + 1})\) via $$c_{i + 1}$$ is the bad transition we were looking for, and that $$v_{i+1}$$ is a test that witnesses this.

To see this, note that because $$i + 1$$ is bad we must have $$\delta(q_0, u_{i + 1}) v_{i + 1} \ne c \in L$$ and because $$i$$ is good we must have $$\delta(q_0, u_{i}) v_{i} = c \in L$$, so $$\delta(q_0, u_{i}) v_{i} \ne \delta(q_0, u_{i + 1}) v_{i + 1}$$. But $$v_i = c_{i + 1} v_{i + 1}$$, so this means we have   $$\delta(q_0, u_{i}) c_{i + 1} v_{i + 1} \ne \delta(q_0, u_{i + 1}) v_{i + 1}$$.

i.e. $$v_{i + 1}$$ is an experiment distinguishing  $$\delta(q_0, u_{i}) c_{i + 1}$$ from $$\delta(q_0, u_{i + 1})$$. But by construction,  $$\delta(q_0, u_{i + 1}) = \delta(\delta(q_0, u_i), c)$$, which makes $$v_{i + 1}$$ our desired experiment to add to the list.

Now, to complete the process: How do we find such an $$i$$?

The most direct approach would be to just do a linear scan, but we can do better: You can use binary search, by using the observation that the general binary search algorithm works just fine without anything to do with sorting at all if you regard it as looking for a point at which a function changes in a particular way.

In python pseudocode, we can find such an $$i$$ as follows:

# Invariants: lo is good, hi is bad lo = 0 hi = len(c) while lo + 1 < hi: mid = (lo + hi) // 2 if bad(mid): hi = mid else: lo = mid

This then takes $$\log_2(|c|)$$ queries and when it terminates we can take $$i = \mathrm{lo}$$.

Which, was the final step in our puzzle for the construction: As long as our automaton does not recognise $$L$$, we can find a new experiment, add it to the list, and thus force our automaton to keep growing. When it can grow no longer it must be equivalent to $$Z$$, because it is an automaton of the same size which recognises $$L$$.

We’re almost done, but lets do some analysis on how many queries we made here (in this analysis I will assume that all answers to membership queries are cached so we only make them once).

Suppose our final set of tests is $$|T|$$. Note that $$|T| \leq |Z|$$ because every time we added to $$T$$ we increased the number of states by at least one, and the algorithm terminated when the number of states was equal to $$|Z|$$.

We have performed precisely $$|T|$$ counter-example queries because every time we performed a counter-example query we added exactly $$1$$ element to $$T$$.

Now lets see how many membership queries we’ve performed.

We performed membership queries in two places: First, when building the transition table. Secondly, when trying to construct the tests.

For the first: For every witness, every $$a \in A$$ and every $$t \in T$$ we have calculated whether it is a member of $$L$$ in order to compute the rows and the transitions. So that comes to $$|Z||A||T| \leq |A| |Z|^2$$ membership queries.

In calculating the test, we had to perform one membership query at every point in the binary search in order to determine if mid were bad. Therefore we performed $$O(\log(|c|)$$ membership queries. Letting $$m$$ be the maximum length of any counter-example provided in the course of our run, we thus performed a total of $$O(|T| \log(m)) = O(|Z| \log(m))$$ membership queries.

So, in our final analysis, running L* search with a minimally adequate teacher for our language requires:

1. $$O(|A||Z|^2 + |Z| \log(m))$$ membership queries
2. $$O(|Z|)$$ counterexample queries

In fact these estimates are a bit too conservative. As we’ll see in a later post, it is sometimes going to be the case that $$|T| \ll |Z|$$.

Why do we care?

Personally, I mostly care because I think L* is a super-neat algorithm, and the concept of a minimally adequate teacher as a learning environment is surprisingly general.

You can also put this into a stochastic setting to relax the requirement of the existence of a minimally adequate teacher (which may be hard to obtain) , in which case you instead get to build DFAs which give the correct answer with at least some chosen probability when the string is drawn from a fixed choice of distribution.

However, the evidence seems to be that most people in the field don’t care about this algorithm in the field of language learning. There hasn’t been a great deal of research on it in recent years that I’ve been able to find, and most of the research on it from back in the day was less concerned with languages than automata: You frequently got this framed in the context of a robot learning to navigate a world.

I recently read Grammatical Inference: Learning Automata and Grammars, which if you found this post interesting I can highly recommend, as there won’t be many more like it in this series.  The next post will use some of the ideas from it, but will be moving away from the area of inference.

This entry was posted in Automata Theory on by .