Category Archives: Automata Theory

L* Search and Inference of Deterministic Finite Automata

This post follows on from my previous one about the Myhill–Nerode theorem. You should read that one first.

When trying to construct a deterministic finite automaton for a language in practice, one thing we need to know is how it’s represented. If all we have is a yes/no predicate, the problem is in general intractable (it could be any set of strings at all!). If all we have is a set of examples and non-examples and we want to construct a minimal deterministic finite automaton (DFA), the problem is NP hard.

This post is about what can do if we augment the yes/no predicate only slightly: We have what is called a minimally adequate teacher. It can answer two types of questions about our language \(L\):

Is this string in the language?
Given this DFA, does it correctly recognise the language and if not what is an example of a string where it gives the wrong answer?

In this scenario there is an polynomial time algorithm in the size of the minimal DFA (and logarithmic in the size of the longest counterexample the teacher returns) which will terminate in a correct DFA for the language.

This algorithm was originally presented by Dana Angluin in her paper “Learning Regular Sets from Queries and Counterexamples” under the name L* search. The version I’m going to present here is in fact a slightly later version of the algorithm presented by Ronald L. Rivest and Robert E. Schapire in “Inference of Finite Automata Using Homing Sequences” which improves on the algorithmic complexity and makes the underlying nature of the algorithm slightly clearer. Regardless, I’m still going to call it L* search.

The idea of L* search (although not how it was originally presented) is that we are going to try to construct the Myhill–Nerode automaton , \(Z\), directly. We will do this as follows:

We will maintain a sequence \(w_1, \ldots, w_m\) of witnesses and a sequence \(t_1, \ldots t_n\) of tests. \(w_i\) is a sequence of strings such that each string is not word equivalent to any other, i.e. is a member of a distinct equivalence class in \(Z\), and \(t_i\) is a sequence of tests that witness that: i.e. For any \(i, j \leq m\) with \(i \ne j\) there is some \(k\) such that \(w_i t_k \in L \ne w_j t_k \in L\).
We start with \(w_1 = t_1 = \epsilon\).
At each stage we either terminate or we will extend \(t_i\) by one test and \(p_i\) by one or more strings.
This must terminate, because we’re creating a sequence of non word equivalent strings so we can’t have more elements in it than there are states in \(Z\) (because the states are equivalence classes of strings). It’s not currently obvious that we can always make progress prior to that point, but that will emerge in the construction of the algorithm.

The core idea is to use the \(t_i\) to ‘finger print’ every string. The \(t_i\) give us a mapping \(\mathrm{row}: S \to \{0, 1\}^n\) where \(\mathrm{row}(s)_i\) is \(1\) if \(st_i \in L\), else \(0\). We can calculate this fingerprint because we are allowed to ask the teacher whether a string is in the language. We know for a fact that if \(\mathrm{row}(s) \ne \mathrm{row}(t)\) then \(s\) and \(t\) aren’t word equivalent, because one of the \(t_i\) must witness that.

Given our current sequence of tests \(t_i\) and our witnesses \(w_i\), we use the row function to construct the following automaton whose states are the \(w_i\). In the course of constructing it we may have to extend \(w_i\):

For each \(w_i\) and each \(a \in A\), we consider \(\mathrm{row}(w_i a)\). Define \(\delta(w_i, j) = w_j\), where \(w_j\) is chosen such that \(\mathrm{row}(w_i a) = \mathrm{row}(w_j) \). If no such \(w_j\) exists, then extend the sequence by an element by adding \(w_i a\) to the end so that it does.

We continue this process until the transition function has been defined for all elements of the sequence (including the ones newly added).

We then define the initial state of this finite state machine to be \(w_1 = \epsilon\), and a state \(w_i\) to be accepting if and only if \(w_i \in L\).

The resulting state machine is more or less the same as the the Myhill–Nerode automaton. The only differences are:

We have explicitly chosen an instance of each equivalence class (the \(w_i\))
Instead of considering whether states are distinguishable by any test, we only ask if they are distinguishable by our current experiments \(e_i\). This can cause us to conflate more states than we should, and fixing that will be the core point of what we do next.

So we now have a DFA to present to the teacher. At this point we perform a counter-example query. Either our teacher says that the DFA is correct, in which case we’re done, or it presents us with a counter-example in the form of a string that the DFA predicts the wrong answer for. Call our current DFA \(Q\), and that counter-example \(c\).

What we want to do now is to take that mispredicted string and extract from it an experiment that shows us where we went wrong: At some point we must have made a transition \(\delta(w_i, a) = w_j\) when in actual fact \(w_ia\) and \(w_j\) are not word equivalent (if we never made such a transition then the path we walked is the same as the path we would have walked in \(Z\), which does correctly recognise the language). So what we want to do is find an experiment \(t\) such that \(w_i a t \in L \ne w_j t \in L\), and add \(t\) as our next experiment.

If we do that, then when we build the next version of the DFA with the new set of experiments the number of states must increase: We already know that \(w_i a\) is not word-equivalent to \(w_k\) for any \(k \ne j\). We have just shown that it is also not word equivalent to \(w_j\). Thus, in order to find a state to transition to when we reach \(i, a\), we must either have already added a new state or we add one there and then.

The idea to find such a transition, and an experiment that would distinguish it, is this: We use the suffixes of \(c\) as our experiment candidates.

We’ll look at a sequence of strings that should be equivalent if our DFA were correct, and look for the exact moment that something goes wrong to find our experiment.

Consider splitting \(c\) up into a sequence of strings \(u_i v_i\), with \(0 \leq i \leq |c|\) and \(|u_i| = i\). i.e. \(u_i\) is the first \(i\) characters of \(c\) and \(v_i\) is the remaining set of characters.

(Notation: \(|s|\) is the length of the string \(s\)).

Now, we can replace \(u_i\) with \(\delta_Q(q_0, u_i)\) – the string that represents the state that you get to after reading the first \(i\) characters – and not have this change whether the result is in the language recognized by \(Q\). However, the same is not true of the language \(L\): If we consider \(i = |c|\) then this string is \(\delta(q_0, c)\), and by the fact that \(c\) is a counter-example we know that \(\delta_Q(\epsilon, c) \in L \ne c \in L\).

Call a value of \(i\) bad if it has the property that \(\delta_Q(\epsilon, u_i) v_i \in L \ne c \in L\). i.e. if replacing the state we expect to be in after \(i\) characters with the representative for that state changes whether the string is recognised as being in the desired language. Necessarily, \(|c|\) is bad and \(0\) is good (because \(u_i = \epsilon\) so the replacement string is just \(c\) again).

What we want to do is find an index \(i\) such that \(i\) is good and \(i + 1\) is bad. We’ll see how to do that in a moment, but first lets see why it would help:

What we want to show is that if we have such an \(i\) then the transition from \(\delta(q_0, u_i)\) to \delta(q_0, u_{i + 1})\) via \(c_{i + 1}\) is the bad transition we were looking for, and that \(v_{i+1}\) is a test that witnesses this.

To see this, note that because \(i + 1\) is bad we must have \(\delta(q_0, u_{i + 1}) v_{i + 1} \ne c \in L\) and because \(i\) is good we must have \(\delta(q_0, u_{i}) v_{i} = c \in L\), so \(\delta(q_0, u_{i}) v_{i} \ne \delta(q_0, u_{i + 1}) v_{i + 1} \). But \(v_i = c_{i + 1} v_{i + 1}\), so this means we have \(\delta(q_0, u_{i}) c_{i + 1} v_{i + 1} \ne \delta(q_0, u_{i + 1}) v_{i + 1} \).

i.e. \(v_{i + 1}\) is an experiment distinguishing \(\delta(q_0, u_{i}) c_{i + 1}\) from \(\delta(q_0, u_{i + 1})\). But by construction, \(\delta(q_0, u_{i + 1}) = \delta(\delta(q_0, u_i), c)\), which makes \(v_{i + 1}\) our desired experiment to add to the list.

Now, to complete the process: How do we find such an \(i\)?

The most direct approach would be to just do a linear scan, but we can do better: You can use binary search, by using the observation that the general binary search algorithm works just fine without anything to do with sorting at all if you regard it as looking for a point at which a function changes in a particular way.

In python pseudocode, we can find such an \(i\) as follows:

# Invariants: lo is good, hi is bad
lo = 0
hi = len(c)
while lo + 1 < hi:
    mid = (lo + hi) // 2
    if bad(mid):
         hi = mid
    else:
         lo = mid

This then takes \(\log_2(|c|)\) queries and when it terminates we can take \(i = \mathrm{lo}\).

Which, was the final step in our puzzle for the construction: As long as our automaton does not recognise \(L\), we can find a new experiment, add it to the list, and thus force our automaton to keep growing. When it can grow no longer it must be equivalent to \(Z\), because it is an automaton of the same size which recognises \(L\).

We’re almost done, but lets do some analysis on how many queries we made here (in this analysis I will assume that all answers to membership queries are cached so we only make them once).

Suppose our final set of tests is \(|T|\). Note that \(|T| \leq |Z|\) because every time we added to \(T\) we increased the number of states by at least one, and the algorithm terminated when the number of states was equal to \(|Z|\).

We have performed precisely \(|T|\) counter-example queries because every time we performed a counter-example query we added exactly \(1\) element to \(T\).

Now lets see how many membership queries we’ve performed.

We performed membership queries in two places: First, when building the transition table. Secondly, when trying to construct the tests.

For the first: For every witness, every \(a \in A\) and every \(t \in T\) we have calculated whether it is a member of \(L\) in order to compute the rows and the transitions. So that comes to \(|Z||A||T| \leq |A| |Z|^2\) membership queries.

In calculating the test, we had to perform one membership query at every point in the binary search in order to determine if mid were bad. Therefore we performed \(O(\log(|c|)\) membership queries. Letting \(m\) be the maximum length of any counter-example provided in the course of our run, we thus performed a total of \(O(|T| \log(m)) = O(|Z| \log(m))\) membership queries.

So, in our final analysis, running L* search with a minimally adequate teacher for our language requires:

\(O(|A||Z|^2 + |Z| \log(m))\) membership queries
\(O(|Z|)\) counterexample queries

In fact these estimates are a bit too conservative. As we’ll see in a later post, it is sometimes going to be the case that \(|T| \ll |Z|\).

Why do we care?

Personally, I mostly care because I think L* is a super-neat algorithm, and the concept of a minimally adequate teacher as a learning environment is surprisingly general.

You can also put this into a stochastic setting to relax the requirement of the existence of a minimally adequate teacher (which may be hard to obtain) , in which case you instead get to build DFAs which give the correct answer with at least some chosen probability when the string is drawn from a fixed choice of distribution.

However, the evidence seems to be that most people in the field don’t care about this algorithm in the field of language learning. There hasn’t been a great deal of research on it in recent years that I’ve been able to find, and most of the research on it from back in the day was less concerned with languages than automata: You frequently got this framed in the context of a robot learning to navigate a world.

I recently read Grammatical Inference: Learning Automata and Grammars, which if you found this post interesting I can highly recommend, as there won’t be many more like it in this series. The next post will use some of the ideas from it, but will be moving away from the area of inference.

The Derivative as an Alternative Minimal Deterministic Automaton

This is another post about formal language theory. It builds on yesterday’s post about the the Myhill–Nerode theorem and you should read that one first. It’s not one of my originally intended posts for this series, so there are still at least two more to go.

There turns out to be another construction for the minimal deterministic automaton of a language that is often nicer to work with in practice. This is to use the Brzozowski derivative.

The idea of the Brzozowski derivative is that given a language and some prefix you can construct a new language for the suffixes of strings in the language starting with that prefix. i.e for a language \(L\) and a string \(u\), the derivative is \(\delta(L, u) = \{v \in S: uv \in L\}\). The intuition is that we read some amount of the input for the string and then have a new language matching only what’s left.

You can use this to construct a different deterministic automaton matching the language as follows:

Let \(Q\) be the set of all languages on \(A\). Note that this is a very large set, but in practice the subset of it which is reachable will usually be much smaller.
Let \(q_0 = L\).
Let \(\delta(q, a)\) be the Brzozowski derivative as defined above
Let \(F = \{ M \subseteq S \}, \epsilon \in M\} \).

i.e. we label states with languages, we evolve a state by taking the derivative, and a state is accepting if and only if the corresponding language matches the empty string.

In order to see that this really is an automaton matching the language, the first thing to notice is that the Brzozowski derivative and the transition function agree on all strings, not just on single character ones (you can prove this by induction on the length of the string and the definition of how we extend the transition function).

This then lets us make sense of the definition of whether a state is accepting: A state is accepting if and only if it matches the empty string, i.e. a string \(s\) is accepted if and only if \(\epsilon \in \delta(L, s\)\). But this happens precisely when \(s\epsilon \in L\), by definition of the derivative, and \(s\epsilon = s\) because \(\epsilon\) is the empty string. Thus \(s\) is accepted if and only if \(s \in L\), as desired.

We define the derivative automaton, \(D\), to be the automaton constructed by restricting the above to its set of reachable states.

This turns out to be a minimal deterministic automaton too, which we can see by showing that it’s isomorphic to the Myhill–Nerode automaton.

We’ll do this by showing that the reduction function \(r: D \to Z\) must be injective. i.e. if \(S \ne T\) then \(r(S) \ne r(T)\).

We have some string \(u\) in one of \(S, T\) and not the other. Lets say it’s in \(S\) but not \(T\) by swapping \(S\) and \(T\) if necessary.

Then \(\delta(S, x)\) is an accepting state and \(\delta(T, x\)\) is not. Thus \(\delta(r(S), x)\) is accepting and \(\delta(r(T), x)\) is not, and so \(r(S) \ne r(T)\).

Because the derivative automaton is isomorphic to the Myhill–Nerode automaton, we thus get for free that it must in fact be a minimal finite automaton itself.

Why do we care?

Well, the main reason we care is that in practice this is a much more tractable construction than the Myhill–Nerode automaton. Given standard definitions of a regular or context-free language, you can calculate the derivative directly, and you can build the infinite automaton lazily as you go rather than having to do a up front construction. This turns out to be very useful.

One limitation is that determining the equivalence of even context free languages is undecidable, so when used in practice this will not always produce a minimized automaton, because some states that should be conflated will not be. However a lot of simple reductions are possible for keeping the number of duplicated states relatively tractable.

I’ve never actually seen someone suggest using this approach for DFA minimization. It might be tractable, because there are decent algorithms for testing DFA equivalence without minimization, but it seems sufficiently obvious from adjacent theory that I assume that if it worked someone would have tried it already.

The Myhill–Nerode theorem and the Minimal Deterministic Automaton

I recently learned a whole bunch about a random corner of language theory that I hadn’t previously known much about. I thought it might be interesting to write up some of it. So this is the first post amongst several (I currently have two others planned but they may spawn more).

This is a post about the Myhill–Nerode theorem, behind which is a nice construction (in the abstract sense. You couldn’t perform it directly on an actual machine) for the minimal deterministic automaton matching any language.

Lets start with some terms: We’ve got some finite non-empty alphabet \(A\). A string on \(A\) is any finite sequence of elements from \(A\). A language on \(A\) is any set of strings.

For this post we’ll fix \(A\) and then \(S\) will be the set of strings on \(A\).

String notation:

\(\epsilon\) is the string of length \(0\)
If \(u, v\) are strings then \(uv\) is the string made by concatenating them in order.
Similarly if \(a \in A\) then \(au\) is the string that consists of prepending \(a\) to \(u\) and \(ua\) is the string that consists of appending \(a\) to \(u\).

In general we are interested in languages which have nice compact definitions rather than having to specify the whole set. One nice way of defining languages which often provides this is a deterministic automaton. These also have the advantage of being very simple to work with computationally once you have one.

A deterministic automaton on \(A\) consists of:

A set of states \(Q\), (possibly infinite – if it’s a finite set we call it a deterministic finite automaton)
An initial state \(q_0 \in Q\)
An update function \(\delta : Q \times A \to Q\)
A set of accepting states \(F \subseteq Q\).

We can extend the update function to take any finite string: Let \(\delta(q, \epsilon) = q\) and \(\delta(q, au) = \delta(\delta(q, a), u)\). i.e. we just follow the update function along the string.

We will additionally assume that every state \(Q\) is reachable. That is it’s \(\delta(q_0, s)\) for some string \(s\). This is a fairly harmless restriction because for any automaton we can just create the automaton that replaces \(Q\) and \(F\) with their reachable subsets.

A deterministic automaton recognises a language \(L\) if \(s \in L\) if and only if \(\delta(q_0, s) \in F\). i.e. if you follow the automaton along the string then you end up in an accepting state if and only if the string is in the desired language.

Every language can be recognised by a deterministic automaton:

Let \(Q = S\)
Let \(q_0 = \epsilon\)
Let \(\delta(s, a) = sa\)
Let \(F = L\)

This works because every state in this automaton has a unique string that reaches it from the initial state (which is the string representing the state), and the state is accepting if and only if that string is in the language, so by construction recognises the language.

This simplifies matters not even slightly because we’ve just reframed the problem and determining if a state is accepting is equivalent to determining whether the string is in our language, and the automaton constructed this way is always infinite (though may have finitely many accepting states if \(L\) is finite), even if a very small finite automaton might recognise our language. If anything we’ve made the problem worse because we’ve taken a potentially finite language and moved to an always infinite representation.

So, can we find a smaller automaton that recognises a given language?

In fact we can. For any language there is a unique minimal deterministic automaton for any language. We’ll see exactly what that means in a bit, but first we’ll construct the automaton we want.

The idea for constructing it is this: We take the above automaton, and we try to collapse it by conflating equivalent states. We look for equivalent states with the notion of a test.

A test is just a string. The idea is that if we have some string \(t\) and two strings \(u\) and \(v\) then if \((ut \in L) \ne (vt \in L)\), \(u\) and \(v\) we must have \(\delta(q_0, u) \ne \delta(q_0, v)\) for any automaton recognising \(L\) – if not then we would have \(\delta(q_0, ut) = \delta(q_0, vt)\), which is impossible because one is an accepting state and the other is not.

Two strings are said to be word equivalent if there is no test that distinguishes them. The idea is that we can then shrink our above automaton by conflating all states that are word equivalent.

So if \(s\) is a string let \([s]\) be the set of all strings that are word equivalent to \(s\). We can then define the following automaton:

Let \(Q = \{[s]: s \in S\}\)
Let \(q_0 = [\epsilon]\)
Let \(\delta([s], a) = [sa]\)
Let \(F = \{[s]: s \in L\}\)

We’ll call this the Myhill-Nerode automaton and label it \(Z\).

We first need to check that \(delta\) is well defined. i.e. if \(u\) is word equivalent to \(v\) then \(ua\) is word equivalent to \(va\). This follows from the fact that if \(t\) is a test distinguishing \(ua\) and \(va\) then \(at\) is a test distinguishing \(u\) and \(v\).

So this is a well defined state machine, but does it recognise \(L\)?

First note:

\(\delta(q_0, s) = [s]\) (more or less by definition)
If \(u, v\) are word-equivalent then \(u \in L = v \in L\) because otherwise \(\epsilon\) is a test distinguishing them

So the \(\delta(q_0, s)\) is an accepting state if and only if it is \([t]\) for some \(t \in L\), which is true if and only if \(s \in L\). i.e. this automaton recognises \(L\).

So we’ve got another automaton recognising \(L\), and it sure looks smaller than the first one, but is it minimal?

And what do we even mean by minimal?

For that we’ll need the idea of an automaton reduction: Let \(P, Q\) be automata (call the initial state of \(P\) \(p_0\) and we’ll use subscripts to distinguish \(\delta\) and \(F\). \(Q\) is a reduction of \(P\) if there is a function \(r: P \to Q\) such that:

\(r(p_0) = q_0\)
If \(p \in P\) and \(a \in A\) then \(r(\delta_P(p, a)) = \delta_Q(r(p), a)\).
\(r(p) \in F_Q\) if and only if \(p \in F_P\)
Every \(q \in Q\) is \(r(p)\) for some \(p\)

i.e. we label every state in \(P\) with a state in \(Q\) such that everything lines up correctly for transitions.

Our claim of minimality is this: If \(Q\) is any automaton recognising \(L\), then \(Z\) is a reduction of \(Q\).

The key property for proving this is that if \(u\) and \(v\) are two strings with \(\delta_Q(q_0, u) = \delta_Q(q_0, v) = q\) then they must be word equivalent. Otherwise there would be some test \(t\) such that \(\delta(q, t) \ne \delta(q, t)\). So we choose as our reduction function \(r\) the function \(r(q) = [s]\) where \(s\) is any string such that \(\delta_Q(q_0, s) = q\) (recall we’re assuming all states are reachable).

Lets check that this has the desired properties:

\(\delta_Q(q_0, \epsilon) = q_0\), so \(r(q_0) = [\epsilon] = z_0\)
If \(\delta_Q(q_0, s) = q\) then \(delta_Q(q_0, sa) = \delta(q, a)\), so \(r(\delta_Q(q_0, s), a)) = [sa] = \delta_Z([s], a) = \delta_Z(r(q), a)\) as desired
This follows automatically because both \(Q\) and \(Z\) recognise \(L\)
\([s] = r(\delta(q_0, s))\)

So for any automaton we can reduce it to \(Z\), as desired.

And finally we need to prove uniqueness: The uniqueness claim is that if \(Q\) is any other automaton that has this property then a reduction \(r: Z \to Q\) must also have the property that if \(s \ne t\) then \(r(s) \ne r(t)\). i.e. \(r\) is an isomorphism, just relabelling the vertices.

Proof: Suppose \([u] \ne [v]\) are two distinct states of \(Z\). Then there is some \(t\) such that \(\delta_Z([u], t) \in F_Z \ne \delta_Z(([v], t) \in F_Z\). But then by the properties of the reduction we must have \(\delta_Q(r([u]), t) \in F_Q \ne \delta_Q((r([v]), t) \in F_Q\) and thus \(r([u]) \ne r([v])\) as desired.

We’ll finally finish with the title of this post: The Myhill–Nerode theorem.

In the language we have so far, this states that \(L\) is regular if and only if the \(Z\) we constructed is finite. But this follows immediately from what we’ve done above: \(Z\) is an automaton recognizing \(L\), so by definition if \(Z\) is finite then \(L\) must be regular. Conversely, if there is any finite automaton that recognizes \(L\) then it reduces to \(Z\), so \(Z\) must also be finite.

Why do we care about all of this?

Well, as far as I know, the Myhill–Nerode theorem itself isn’t actually very important. It’s mostly an intellectual curiousity. I do like that it shows that every language regardless of structural properties has a minimal deterministic automaton, but I’m not sure I have any actual use for that fact.

But the machinery used to prove it is independently quite interesting. In particular, minimal (or at least small) deterministic finite automaton are very useful in practice and we can use a similar construction starting from an existing finite automaton to produce one (this is Hopcroft’s Algorithm).

Moreover, the idea of a test that we started from proves quite interesting, and you can use it to construct alternative (often more compact) representations of a DFA, and it forms the basis for the inference of regular languages in certain circumstances. More on that in later posts.