Attention conservation notice: If you don’t care about fine grained algorithmic improvements to inference algorithms for regular languages, you don’t care about this post.

Also, apologies, I know mathjax isn’t currently not working so you will see some raw LaTeX. If you turn on unsafe scripts (i.e. “not loaded over https”) it will work, but due to some technical hiccups this is currently difficult for me to fix. I’m on the case.

L* search (Angluin, D. Learning regular sets from queries and counterexamples. Inf. Comput. 75, 2 (1987), 87–106) is a really neat algorithm that lets you infer a deterministic finite automaton (DFA) for an unknown language.

I’ve long been of the suspicion that it would be useful in Hypothesis. Unfortunately to date reality has disagreed with me. The key problem is that L* search is *really* expensive and any attempt to use it would almost immediately blow Hypothesis’s normal budget for number of tests run.

Starting from a trivial automaton that accepts no strings, it proceeds in two stages:

- It uses
*membership queries*for the language to complete the DFA, finding transitions from states that provably do not go to any of the existing known states, and stopping only when it has a consistent transition table for every known state and every member of the alphabet. - It then asks for a counter-example to its conjectured DFA. If there is none, we have correctly inferred the language. If there is one, it is used to improve our ability to distinguish states, enlarging the number of states by at least one.

Because L* search is maintaining a *minimal* DFA, where the only states that are distinguished are ones that it can actually prove are inequivalent, this guarantees that the algorithm terminates with a correct DFA in time at most O(n), where n is the size of the minimal DFA of the language.

It does this by maintaining a set of *experiments*. These are strings that can be used to distinguish states – if by following the experiment from two states you end up in an accepting state in one and a non-accepting state in another, the two states must be inequivalent.

The rest of this post assumes you understand L* search. If you don’t, I recommend reading my prior more detailed explanation of it.

I use an improvement proposed in “Rivest, R. L., and Schapire, R. E. Inference of finite automata using homing sequences.

Inf. Comput. 103, 2 (1993), 299–347″ that lets you take the cost of the counter-example step down to logarithmic in the length of the counter-example and ensures that each counter-example only adds one experiment to the list.

The problem is that the *completion* step is still extremely expensive: It requires evaluating \(|A||S||E|\) membership queries, where \(A\) is the alphabet, \(S\) are the states we currently have, and \(E\) is the number of experiments we currently have (you can remove some of these because some state transitions are provable because of the representation of the states as strings – \(s\) always transitions to \(sa\) under \(a\) if both strings are in the state set, but you can’t remove more than \(|S| – 1\) evaluations this way).

One of the problems I have with using this in Hypothesis is that \(|A| = 256\), which is both intrinsically fairly large and also actually larger than Hypothesis’s typical testing budget. Thus there’s no way to fit even one L* pass into a normal test run!

So, can we get that \(|A|\) factor down?

It turns out, yes, at the cost of potentially needing more counter-examples (which is a cost I’m entirely willing to pay).

The observation follows one from “Owens, S., Reppy, J. H., and Turon, A. Regular-expression derivatives re-examined. J. Funct. Program. 19, 2 (2009), 173–190.”, which is that when you have a large alphabet you will typically find at a given state in a DFA many characters are equivalent to each-other (that is, they result in a transition to the same state). For example, when \(|A| \geq |S|\), even though there are \(|A|\) possible valid characters at a given state there are only \(|S|\) up to equivalence.

If we take advantage of this, we could reduce the number of queries substantially, and it turns out that it is easy to do.

The idea is simple: When we create a state, we initialise a partition refinement data structure along-side it. Initially this is set to consider *all* members of the alphabet to be in the same equivalence class.

When expanding our table, we now assume that any characters in the same partition are equivalent, and we only actually perform membership queries with, say, the smallest element of the partition. Thus instead of the \(|A||S|\) term we only perform \(\sum |P_i|\) term, where \(P_i\) is the set of partitions associated with state \(i\). In particular this is \(O(|S| \min(|A|, |S|)\).

Then when we find an experiment \(e\) and a transition \(s \to t\) via \(c\), instead of immediately adding to our experiment set we check whether we just have too coarse a partition. We do this by checking if \(sce \in \mathcal{L} = sc’e \in \mathcal{L}\), where \(c’\) is the “canonical” element of the partition that we selected when determining that this was the correct transition (if \(c = c’\) we can skip this bit). If this is *not* the case then we use it to refine the partition by splitting it into two based on whether \(sde \in \mathcal{L}\). We then add the experiment to the list and rebuild the DFA. If this does not result in expanding our list of states (this can only happen if we did end up refining the partition), we then remove the experiment from the list! This can happen because we can end up with partitions that our existing experiments were sufficient to distinguish but where we never checked, and by not adding redundant experiments in that case we bound the number of experiments to the number of states.

This is correct for the same reasons that L* is correct – at any given stage, either our partitions are correct, or we refine them. However it will take an additional \(\sum (|P_i| – 1)\) counter-example steps to converge (where the \(P_i -\) here are the partitions from our final state, i.e. the true partitions).

Counter-examples are somewhat costly – regardless of how much it costs to find (which may be “free” in some sense), a counter example \(u\) takes \(O(\log(|u|)\) membership queries to process. If we assume all counter-examples are minimized and so \(|u| \leq n\) where \(n\) is the number of states in the true minimal DFA, then that’s \(O(\log(n)\)

Suppose the true minimal DFA has \(n\) states and that \(m = \sum |P_i|\) in it. Then at each stage we either add one or more states, or we increase the number of partitions, thus we perform at most \(n + \sum (|P_i – 1) = n + m – n = m\) counter-example queries.

Each completion step takes \(O(m |E|\) to run, and we only added experiments in the case where we increased the number of states, so each step runs in \(O(m n)\), and the entire thing runs in at most \(O((mn + \log(n)) m)\). \(m \geq n\), so we can discard the \(\log(m)\) term and the whole thing runs in \(O(m^2 n)\).

In contrast, classic L* the steps take \(O(|A| n^2)\) and we run in at most \(n\) steps, so the complexity is \(O(|A| n^3\). So this is a complexity improvement if and only if \(n^2 |A| \geq m^2\) asymptotically, or in other words if \(\frac{m}{n} \geq \sqrt{|A|}\) (technically this is an invalid way to reason about asymptotics, but those were pretty tight estimates).

\(\frac{m}{n}\) is the average number of distinct equivalent classes over states. For my use case I expect this to be very low – typically the number of equivalence classes will be one!

For the more general use case this isn’t obviously a win – if \(|A|\) is small it almost certainly isn’t. e.g. when \(|A| = 2\) this condition is equivalent to saying that 60% of states don’t depend on the character to determine their transition at all.

I haven’t attempted to reason about it much, but one possible fix would be an adaptive algorithm that tracks the average on the states and partitions found so far, and as soon as it exceeds \(\sqrt{|A|}\) switches over to normal L*. My suspicion is that if you measure cost purely in terms of membership queries, this adaptive version is a substantial saving on almost any language model.