# The Easy Way to Compile Automata to Regular Expressions

I have a confession: despite knowing a reasonably large amount about formal language theory and automata, I’ve never bothered to learn how to prove how to turn a finite automaton back into a regular expression.

My attitude has always been that I have a vague sense of how a proof would go and couldn’t be bothered to sort out the details. I’ve probably skimmed a proof in the past but it looked complicated and uninformative so I didn’t bother.

Due to reasons (a certain theorem about how you could represent regular expressions in a particular way required the details of a proof) I found I was forced to care recently, so I decided to actually sit down and figure it out.

The standard proofs did indeed seem complicated and uninformative, but fortunately it turns out that the general method of understanding regular language theory fundamentals applies in this case.

That method, of course, being to ask “What Would Janusz Brzozowski Do?”.

Starting from a Nondeterministic Finite Automaton (without $$\epsilon$$-transitions) there turns out to be a natural algebraic process which is fairly reminiscent of Gaussian Elimination that results in a fairly small regular expression matching the language.

The idea is this: Suppose we have a non-deterministic automaton without $$\epsilon$$-transitions. Label the states $$0, \ldots, n$$. We consider the languages $$R_i$$ of strings matched when starting from state $$i$$.

By looking at the automaton we get a set of algebraic equations for each of these, which we can progressively rewrite and simplify to remove loops in them (e.g. where $$R_i$$ is defined self-referentially, or the definitions of $$R_i$$ and $$R_j$$ depend on each other). Once we’ve successfully done that, we can just substitute in terms until we get an expression for $$R_0$$, which will be our final result.

Note: In what follows we’ll adopt the usual convention of equating regular expressions with their regular language. Variables which correspond to a language will be upper case, ones corresponding to an expression will be lower-case.

Let $$s_{ij}$$ be a regular expression matching the language of strings that cause a tradition from state $$i$$ to $$j$$. This is a language consisting of strings of length $$1$$ – if the character $$c$$ causes a transition from $$i$$ to $$j$$ then $$c \in s_{ij}$$. If there are no transitions from $$i$$ to $$j$$ then $$s_{ij}$$ is $$\phi$$, the regular expression matching no strings, otherwise it is a union of finitely many basic regular expressions.

Note that in the general case there may be strings matched by both $$s_{ij}$$ and $$s_{ik}$$ with $$j \neq k$$. This is because the automaton is non-deterministic. In a deterministic automaton the expressions $$s_{ij}, s_{ik}$$ with $$i \neq k$$ will have no common matches.

Let $$t_i$$ be $$\phi$$ if state $$i$$ is not accepting and $$\epsilon$$ if it is. i.e. $$t_i$$ is a regular expression matching the set of empty strings matched from state $$i$$.

Now, we have that $$R_i = t_i \cup \bigcup\limits_j s_{ij} R_j$$ – a string in $$R_i$$ is either the empty string (if $$i$$ is accepting), or the result of transitioning to another state and matching from there.

We’ll now begin the process of substituting in terms to remove cycles.

The key idea is to find $$t_i’, s_{ij}’$$ such that whenever $$j \leq i$$, $$s_{ij}’ = \phi$$. This means there can be no cycles because there are no backwards or self-references between the expressions.

It will no longer be the case that $$s_{ij}’$$ only matches strings of length $$1$$, but it will be the case that they don’t match the empty string (this is important).

How do we construct these modified terms? By induction!

Suppose we’ve constructed $$t_i$$ and $$s_{ij}$$ for $$i < k$$. If we just substitute in and rearrange all of those terms into $$R_k = t_k \cup \bigcup\limits_j s_{kj} R_j$$ then we almost get what we want: The only terms that can be non-empty are the ones where $$j \geq k$$, because we’ve eliminated all the terms that were $$< k$$ and replaced them with terms that are $$\geq k$$.

So all we need to do is figure out how to remove the $$R_k$$ term.

Fortunately, there’s a great tool for doing that. it’s called Arden’s theorem:

If we have languages $$X, A, B$$ with $$\epsilon \not\in A$$ and$$X = AX \cup B$$, then $$X = A* B$$ .

Conceptually this is fairly intuitive because you can just rerwite it as $$X = A(AX \cup B) \cup B$$ and continue this infinitely to get $$X = B \cup AB \cup AAB \cup \ldots$$, which is precisely $$X = A*B$$. However this intuition breaks down somewhat – the condition that $$A$$ does not contain the empty string is essential, because if we picked $$A = \{\epsilon\}$$ and $$B = \emptyset\} then this reduces to \(X = X$$, which every language satisfies.

The problem is essentially that the term in the $$\ldots$$ “does not converge” in this case, so we need to do some more work to actually prove this rigorously.

Proof:

First note that $$A*B$$ is a possible solution for $$X$$ because $$A* = (A A*) \cup \epsilon$$, so  $$A* B = (A A* B) \cup \epsilon B = A (A* B) \cup B$$.

Now note that any language satisfying the equation has the following property: $$x \in X$$ if and only if either $$x \in B$$ or $$x = uv$$ with $$u \in A$$ and $$v \in X$$.

This is simply by the definition: If $$x \not\in B$$ then $$x \in AX$$, so $$x$$ starts with a string of $$A$$, which is non-empty by assumption. Conversely, if $$x \in B$$ then $$x \in X$$ by definition, and if $$x = uv$$ with $$u \in A$$ and $$v \in L$$.

Now, suppose we have two non-identical solutions to the equation. Say $$L$$ and $$M$$. Suppose there exists $$x \in L \setminus M$$. Pick $$x$$ so that it has minimal length among such words..

Then certainly $$x \not\in B$$ or it would be in both, so by assumption we can write $$x = uv$$ with $$u \in A$$ and $$v \in L$$.

But then we must have $$v \in M$$, by the minimality of $$x$$. But if $$v \in M$$ then necessarily $$uv = x \in M$$. This contradicts the assumption. Therefore no such $$x$$ exists, and the two languages are equal.

QED

So it’s important that all the coefficients on the right hand side don’t match the empty string, but this is again true inductively: The initial $$s_{ij}$$ do not match the empty string, and every coefficient is either a concatenation or a union of two languages that don’t match the empty string, so in turn does not match the empty string.

This means that Arden’s theorem gives us the tool we need to remove the coefficient for $$R_k$$ on the right hand: All we need to do is to prepend the star of that coefficient to the other terms and remove $$R_k$$ from the equation.

This completes our inductive step, which in turn completes our algorithm: We run the induction over the whole set of equations, we now no longer have loops, and we can just substitute back in to get $$R_0$$ as desired.

Let’s work through an example now to clear up the details.

Suppose we have the states 0, 1, 2. Only state 2 is accepting. Our characters are a, b and c. The allowed transitions are:

• From 0: $$a, b \to 1$$.
• From 1: $$a \to 1, b \to 2$$
• From 2: $$c \to 0, 1$$.

This gives us the following three initial equations:

• $$R_0 = (a | b) R_1$$
• $$R_1 =a R_1 \cup b R_2$$
• $$R_2 = \epsilon \cup c R_0 \cup c R_1$$

We now perform substitutions as follows:

1. Our equation for $$R_0$$ is in the desired form already so nothing needs to be done.
2. Our equation for $$R_1$$ has $$R_1$$ on the right hand side, so we use Arden’s theorem to rewrite it as $$R_1 = a* b R_2$$. It is now in the desired form.
3. We substitute in our previous equation for $$R_0$$ and get $$R_2 = \epsilon \cup c (a|b) R_1 \cup c R_1 = \epsilon \cup c(a|b|\epsilon) R_1$$.
4. We substitute in our previous equation for $$R_1$$ and get  $$R_2 = \epsilon \cup c(a|b|\epsilon) a* b R_2$$.
5. We apply Arden’s theorem one final time and get $$R_2 = (c(a|b|\epsilon) a* b)* | \epsilon = (c(a|b|\epsilon) a* b)*$$.
6. We now substitute this into our equation for $$R_1$$ and get $$R_1 = a* b (c(a|b|\epsilon) a* b)*$$
7. We now substitute this into our equation for $$R_0$$ and get  $$R_0 = (a | b) a* b (c(a|b|\epsilon) a* b)*$$

We now have a (not all that pleasant) regular expression matching our original deterministic finite automaton, as desired.

In general I’m unclear on whether this is the “best” method for doing this. The above explanation is a lightly modified version of the one presented here, which compares it with several other methods, where it seems to come out ahead of the others.

It certainly isn’t optimal, in the sense that it doesn’t always produce a minimal regular expression. However, neither do any of the other standard methods. This seems to be a hard problem, and there aren’t as far as I know any good algorithms for doing it.

However, regardless of whether it’s the best, it is certainly the one that seems clearest to me: Once you have the key mechanism of Arden’s theorem, you just perform simple algebraic manipulation and the result pops out at the end.

If you want to see this working in practice, I’ve rather arbitrarily added in some code for doing this (albeit only with deterministic finite automata. The principle is the same though) to my FALBS project, which has basically become a dumping ground for all my formal languages code, along with a test that this does the right thing.

(Want more posts like this? Why not support my Patreon! It will cause me to write more, give you some influence over what I write, and give you access to early drafts of upcoming posts).

This entry was posted in Automata Theory, programming, Python on by .

# Fully Automated Luxury Boltzmann Sampling for Regular Languages

Suppose you have some regular language. That is, some language you can define through some mix of single characters, concatenation, repetition and alternation. e.g. the language defined by the regular expression (0|1)*2 which consists of any number of 0s and 1s followed by a single 2.

Suppose further instead of matching you want to generate an instance of such a regular language. This can be quite useful as a building block for building interesting data generators, so I’ve been looking into doing that in Hypothesis. Lucas Wiman’s revex is an example of an existing project doing similarly.

It’s easy to do this naively from our building blocks:

• To generate a single character, just return that character.
• To generate one of x|y, pick one of x or y uniformly at random and then generate from that.
• To generate xy, generate from x and y independently and concatenate the results.
• To generate from x* pick the number of repetitions (e.g. as a geometric distribution) and then generate that many instances and concatenate them.

This almost works but it has a major problem: It’s quite biased.

Consider the regular expression (00)|(1[0-9]). That is, we match either the string 00 or the string 1 followed by any digit. There are 11 strings matched by this language, but under our above model we’ll pick 00 half the time!

Additionally, it’s biased in a way that depends on the structure of our regular expression. We could just have easily have written this as (10)|((00)|(1[1-9)). Now we produce 10 half the time, 00 a quarter of the time, and one of the remaining 9 strings the remaining quarter of the time.

So how do we define a distribution that is independent of our particular choice of regular expression?

One goal that might be reasonable is to be unbiased: To produce every string in the language with equal probability. Unfortunately, this doesn’t make sense if our language is infinite: There is no uniform distribution on an infinite countable set.

We can however hope for a way that has the weaker property of being unbiased only among strings with the same length. So in the above example we’d have got a uniform distribution, but if we had say the language 0*1 of arbitrarily many zeros followed by a single 1, some lengths of string would be more likely than others.

There turns out to be a nice way of doing this!  They’re called Boltzmann Samplers (warning: The linked paper is quite information dense and I only understand about a third of it myself). The Boltzmann samplers for a language define a family of probability distributions over it, controlled by a single real valued parameter, $$x \in [0, 1]$$.

The idea is that you pick each string of length $$n$$ with probability proportional to $$x^n$$ (note: each string. So if there are $$k$$ strings of length $$n$$ then you pick some string of length $$n$$ with probability proportionate to $$k x^n$$).

In general there is not a defined Boltzmann sampler for every possible value of $$x$$. The closer $$x$$ is to $$1$$, the slower the probability of picking a string drops off. If $$x = 1$$ then we have a uniform distribution (so this is only possible when the language is finite). If $$x = 0$$ then we can only generate the empty string. The Boltzmann sampler will be defined as long as the corresponding infinite sum converges, which will definitely happen when $$x < \frac{1}{|A|}$$ (where $$A$$ is the alphabet for our language) but may happen for larger $$x$$ if the language is relatively constrained, and the average size of the strings will increase as $$x$$ increases.

It’s not a priori obvious that simulating a Boltzmann sampler is any easier than our original problem, but it turns out to be. The reason is that for certain primitive languages the Boltzmann sampler is easy to compute (e.g. for a language consisting only of a single fixed string it just returns that string), and for certain natural operations for combining languages (especially for our purposes disjoint unions) we can then simulate the Boltzmann sampler using the simulated Boltzmann samplers for the base languages.

We’ll see how this works in a moment, but first a digression.

An idea that turns out to be intimately connected to Boltzmann samplers is that of the counting generating function. If you have a language $$L$$, you can define $$g_L(x) = \sum\limits_{n=0}^\infty |L_n| x^n$$, where $$L_n$$ is the subset of the language consisting of strings.

Counting generating functions have a number of important properties about how they combine when you combine languages. The one that will be most relevant for us is that if $$L$$ and $$M$$ are two disjoint languages then $$g_{L \cup M}(x) = g_L(x) + g_M(x)$$. This is because $$(L \cup M)_n = L_n \cup L_m$$. Because these are disjoint this means that $$|(L \cup M)_n| = |L_n| + |M_n|$$. (It’s important to note that this doesn’t work when the languages aren’t disjoint: You over-count the intersection).

Counting generating functions will be useful because we can treat them as a sort of “mass function” that tells us how to weight our decisions: If we can go one of two ways, but one of them has a lot more strings below it, we should pick that one more often. This is the basis of the connection between counting generating functions and Boltzmann samplers.

In particular, if we have two disjoint languages $$L$$ and $$M$$ and we can simulate the Boltzmann sampler of parameter $$x$$ for each, we can simulate the Boltzmann sampler with parameter $$x$$ for $$L \cup M$$ as follows: We pick $$L$$ with probability $$\frac{g_L(x)}{g_L(x) + g_M(x)}$$, else we pick $$M$$, then we simulate the Boltzmann sampler for the language we ended up with.

This is almost exactly the same as our simulation from the beginnings, we’ve just changed the weightings! It’s important to remember that this only works if $$L$$ and $$M$$ are disjoint though.

This property, it turns out, is enough to compute a Boltzmann sampler for any regular language.

The key idea is this: If you compile a regular language to a deterministic finite automaton (DFA), you can now look at the languages matched from each state in that DFA. Each of these languages can be represented as a disjoint union of other languages, which gives us a system of equations for the counting generating functions that we can just solve.

If $$L[i]$$ is the language matched when starting from state i, then $$L[i] = \bigcup\limits_{c \in A} c L[\tau(i, c)] \cup E[i]$$, where $$\tau$$ is the transition function and $$E[i]$$ is the language that just matches the empty string if $$i$$ is an accepting state, else is the empty language.

That is, every string matched when starting from $$i$$ is either the empty string, or it is a single character prepended to a string from the language you get starting from the state that character causes a transition to. This is a disjoint union! Given a string in the union it is either empty, or you can look at its first character to identify which one of the branches it belongs to.

This means we can calculate the generating function as $$g_{L[i]}(x) = x \sum\limits_{c \in A} g_{L[\tau(i, c)]} + \epsilon[i]$$, where $$\epsilon[i]$$ is $$1$$ if this is an accepting state or $$0$$ otherwise.

(The extra factor of $$x$$ comes from the fact that we’ve added an extra character to the beginning language, so every string has its length increased by $$1$$).

But this is a linear system of equations in the generating functions! There’s that unknown variable $$x$$, but you can just treat that as meaning it’s a linear system of equations whose parameters live in the field of rational functions.

In particular, and importantly, we can use sympy to solve this rather than having to write our own linear solving routines (we can’t use a standard numpy-esque solver because of the symbolic parameter):

def compute_generating_functions(accepting, transitions): assert len(accepting) == len(transitions) n = len(accepting) z = sympy.Symbol('z', real=True) weights = {} for i in range(n): weights[(i, i)] = 1 for _, j in transitions[i].items(): key = (i, j) weights[key] = weights.get(key, 0) - z matrix = SparseMatrix( n, n, weights ) vector = sympy.Matrix(n, 1, list(map(int, accepting))) return z, matrix.LUsolve(vector)

This simultaneously solves the linear equations for all of the states of the DFA. So a result looks like this:

In []: rd.compute_generating_functions([False, False, True], [{0: 1}, {0: 2}, {0: 2}]) Out[]: (z, Matrix([ [z**2/(-z + 1)], [ z/(-z + 1)], [ 1/(-z + 1)]]))

This calculates the generating functions for the language that matches any string of two or more 0 bytes.

Now that we’ve calculated the generating functions, we can randomly walk the DFA to run the desired Boltzmann sampler! The counting generating function provides the weights for our random walks, and moving to one of the languages in our disjoint union is precisely either stopping the walk or transitioning to another state.

We start at the initial state, and at each stage we choose between one of $$|A| + 1$$ actions: We can either emit a single character and transition to the corresponding new state, or we can stop. Each of these are weighted by the generating function of the language applied to our Boltzmann sampling parameter: The option “halt now” is given weight $$1$$, and the action “emit character c” is given weight $$x g_{L[\tau(i, c)]}(x)$$ – i.e. $$x$$ times the state weight we’ve already calculated for the target state.

This is just a discrete distribution over a finite number of possibilities, so we can use the alias method to build a sampler for it.

Thus our algorithm becomes:

1. Build a DFA for the regular expression
2. Use sympy to calculate the counting generation function for the language matched starting from each state in our DFA
3. Pick a Boltzmann parameter
4. Calculate all the state weights
5. For each state (possibly lazily on demand) initialize an alias sampler for the actions that can be taken at that state
6. Repeatedly draw from the sampler, either drawing a stop action or drawing an emit character action. Where we emit a character, make the corresponding state transition.

I’m not going to claim that this approach is efficient by any stretch. The DFA may be exponential in the size of our starting regular expression, and we’re then solving an $$n \times n$$ linear equation on top of that. However for a lot of languages the resulting DFA won’t be too large, and this approach is fairly reasonable. It’s also nice as an existence proof.

Additionally, we can cache the first 5 steps and reuse them. Only the 6th step needs to be redone for new values of the parameter, and it at least is generally very fast – It’s O(n) in the length of the generated string, with pretty good constant factors as drawing from a pre-built alias sampler is generally very fast. You can also initialise the alias samplers lazily as you walk the graph.

If you want to see all of this fitting together, I created a small project with these concepts in it. It’s far from production ready, and is only lightly tested, but it seems to work pretty well based on initial experimentation and should be enough to understand the ideas.

(And if you want to see more posts like this, why not support my blog on Patreon?)

This entry was posted in programming, Python on by .

# How and why to learn about data structures

There’s a common sentiment that 99% of programmers don’t need to know how to build basic data structures, and that it’s stupid to expect them to.

There’s certainly an element of truth to that. Most jobs don’t require knowing how to implement any data structures at all, and a lot of this sentiment is just backlash against using them as part of the interview process. I agree with that backlash. Don’t use data structures as part of your interview process unless you expect the job to routinely involve writing your own data structures (or working on ones somebody has already written). Bad interviewer. No cookie.

But setting aside the interview question, there is still a strong underlying sentiment of this not actually being that useful a thing to spend your time on. After all, you wouldn’t ever implement a hash table when there’s a great one in the standard library, right?

This is like arguing that you don’t need to learn to cook because you can just go out to restaurants.

A second, related, point of view is that if you needed to know how this worked you’d just look it up.

That is, you don’t need to learn how to invent your own recipes because you can just look it up in a cook book.

In principle both of these arguments are fine. There are restaurants, there are cook books, not everybody needs to know how to cook and they certainly don’t need to become a gourmet chef.

But nevertheless, most people’s lives will be improved by acquiring at least a basic facility in the kitchen. Restaurants are expensive and may be inconvenient. You run out of ingredients and can’t be bothered to go to the store so you need to improvise or substitute. Or you’re just feeling creative and want to try something new for the hell of it.

The analogy breaks down a bit, because everybody needs to eat but most people don’t really need to implement custom data structures. It’s not 99%, but it might be 90%. Certainly it’s more than 50%.

But “most” isn’t “all”, and there’s a lot of grey areas at the boundary. If you’re not already sure you need to know this, you can probably get on fine without learning how to implement your own data structures, but you might find it surprisingly useful to do so anyway. Even if you don’t, there are some indirect benefits.

I’m not using this analogy just to make a point about usefulness, I also think it’s a valuable way of looking at it: Data structures are recipes. You have a set of techniques and tools and features, and you put them together in an appropriate way to achieve the result you want.

I think a lot of the problem is that data structures are not usually taught this way. I may be wrong about this – I’ve never formally taken a data structures course because my academic background is maths, not computer science, but it sure doesn’t look like people are being taught this way based on the books I’ve read and the people I’ve talked to.

Instead people are taught “Here’s how you implement an AVL tree. It’s very clever” (confession: I have no idea how you implement an AVL tree. If I needed to know I’d look it up, right?). It’s as if you were going to cookery school and they were taking you through a series of pages from the recipe book and teaching you how to follow them.

Which is not all bad! Learning some recipes is a great way to learn to cook. But some of that is because you already know how to eat food, so you’ve got a good idea what you’re trying to achieve. It’s also not sufficient in its own right – you need to learn to adapt, to combine the things you’ve already seen and apply the basic skills you’ve learned to solve new constraints or achieve new results.

Which is how I would like data structures to be taught. Not “Here is how to implement this named data structure” but “Here is the set of operations I would like to support, with these sorts of complexities as valid. Give me some ideas.”

Because this is the real use of learning to implement data structures: Sometimes the problem you’re given doesn’t match the set of data structures you have in the standard library or any of the standard ones. Maybe you need to support some silly combination of operations that you don’t normally do, or you have an unusual workload where some operations are very uncommon and so you don’t mind paying some extra cost there but some operations are very common so need to be ultra fast.

At that point, knowing the basic skills of data structure design becomes invaluable, because you can take what you’ve learned and put it together in a novel way that supports what you want.

And with that, I’ll finish by teaching you a little bit about data structures.

First lets start with a simple problem: Given a list of N items, I want to sample from them without replacement. How would I do that with an O(N) initialisation and O(1) sample?

Well, it’s easy: You create a copy of the list as an array. Now when you want to sample, you pick an index into the array at random.

Now that you have that index that gives you the value to return. Replace the value at that index with the value that’s at the end of the array, and reduce the array length by one.

Here’s some python:

def sample(ls, random): i = random.randint(0, len(ls) - 1) result = ls[i] ls[i] = ls[-1] ls.pop() return result

Now I’ve given you a recipe to build on, lets see you improve upon it!

1. If you assume the list you are given is immutable and you can hang onto it, can you improve the initialisation to O(1)? (you may need to make sampling only O(1) amortised and/or expected time to do this. Feel free to build on other standard data structures rather than inventing them from scratch).
2. How would I extend that data structure to also support a “Remove the smallest element” operation in O(log(n))? (You may wish to read about how binary heaps work). You’ll probably have to go back to O(n) initialisation, but can you avoid that if you assume the input list is already sorted?
3. How would you create a data structure to support weighted sampling with rejection? i.e. you start with a list of pairs of values and weights, and each value is sampled with probability proportionate to its weight. You may need to make sample O(log(n)) to do this (you can do it in expected O(1) time, but I don’t know of a data structure that does so without quite a lot of complexity). You can assume the weights are integers and/or just ignore questions of numerical stability.
4. How would add an operation to give a key selected uniformly at random to a hash table? (If you haven’t read about how pypy dicts work you may wish to read that first)
5. How would you extend a hash table to add an O(log(n)) “remove and return the smallest key” operation with no additional storage but increasing the insert complexity to O(log(n))? Can you do it without adding any extra storage to the hash table?

These aren’t completely arbitrary examples. Some of them are ones I’ve actually needed recently, others are just applications of the tricks I figured out in the course of doing so. I do recommend working through them in order though, because each will give you hints for how to do later ones.

You may never need any of these combinations, but that doesn’t matter. The point is not that these represent some great innovations in data structures. The point is to learn how to make your own data structures so that when you need to you can.

If you want to learn more, I recommend just playing around with this yourself. Try to come up with odd problems to solve that could be solved with a good data structure. It can also be worth learning about existing ones – e.g. reading about how the standard library in your favourite language implements things. What are the tricks and variations that it uses?

If you’d like to take a more formal course that is structured like this, I’m told Tim Roughgarden’s Coursera specialization on algorithms follows this model, and the second course in it will cover the basics of data structures. I’ve never taken it though, so this is a second hand recommendation. (Thanks @pozorvlak for the recommendation).

(And if you want to learn more things like this by reading more about it from me, support me on Patreon and say so! Nine out of ten cats prefer it, and you’ll get access to drafts of upcoming blog posts)

This entry was posted in programming, Python on by .

# Looking into doing a PhD

As regular readers of this blog have probably figured out, I’m a researchy sort of person.

A lot of my hobbies – maths, voting theory, weird corners of programming, etc – are research oriented, and most of my work has had some sort of research slant to it.

The last two years I’ve basically been engaged in a research project working on Hypothesis. It’s come quite far in that time, and I feel reasonably comfortable saying that it’s the best open source property based testing library on most metrics you’d care to choose. It has a number of novel features and implementation details that advance the state of the art.

It’s been pretty great working on Hypothesis like this, but it’s also been incredibly frustrating.

The big problem is that I do not have an academic background. I have a masters in mathematics (more technically I have a BA, an MA, and a CASM. Cambridge is weird. It’s entirely equivalent to a masters in mathematics though), but that’s where I stopped. Although it says “DR” in my online handle and the domain of this blog, those are just my initials and not my qualification.

As a result, I have little to no formal training or experience in doing academic research, and a similarly low understanding of who’s who and what’s what within the relevant fields. So I’ve been reading papers and trying to figure out the right people to talk to all on my own, and while it’s gone OK it’s still felt like fumbling around in the dark.

Which leads to the obvious solution that I spoilered in the title: If the problem is that I’m trying to do research outside of an academic context, the solution is to do research in an academic context.

So I’d like to do a PhD that is either about Hypothesis, or about something close enough to Hypothesis that each can benefit from the other.

There’s probably enough novel work in Hypothesis already that I could “just” clean it up, factor it out, and turn it into a PhD thesis as it is, but I’m not really expecting to do that (though I’d like that to be part of it). There are a number of additional directions that I think it would be worth exploring, and I expect most PhD funding will come with a focus subject attached which I would be happy to adapt to (a lot of the most interesting innovations in Hypothesis came because some external factor forced me to think about things in ways I wouldn’t otherwise have!). If you’d like to know more, I’ve written up a fairly long article about Hypothesis and why I think it’s interesting research on the main Hypothesis site.

Which, finally, brings me to the main point of the post: What I want from you.

I’m already looking into and approaching potential universities and interesting researchers there who might be good supervisors or able to recommend people who are. I’ve been in touch with a couple (some of whom might be reading this post. Hi), but I would also massively appreciate suggestions and introductions.

So, if you work in relevant areas or know of people who do and think it would be useful for me to talk to, please drop me an email at [email protected]. Or just leave a comment on this blog post, tweet at me, etc.

This entry was posted in Hypothesis, life, programming, Python on by .

# Laziness is better when it’s visible

This is a trick I invented a while ago. I’m morally certain it’s a reinvention rather than an invention, but I’ve not really seen it in use and at the very least it doesn’t seem to be widely known. I recently ran into a situation where a library would benefit from it greatly and doesn’t use it, so I thought I would write it up.

Suppose we have a bunch of strings and we want to concatenate them.

def cat(my_strings): result = '' for s in my_strings: result += s return result

That’s how you concatenate strings, right?

You can fix this using a special string buffer type, or in python by just using ”.join(my_strings), but wouldn’t it be nice if you didn’t have to? It’s often very convenient to build things up using expressions. Although it’s no great hardship for strings to do bulk operations, you run into the same problem in e.g. pulp where you have more complex expressions (and no corresponding sum method in the library). It would be great if this all just worked.

One way to do this sort of thing is to switch to an immutable tree based representation like a rope where the concatenation operation has a more reasonable complexity (usually log(n)).

But that then comes with its own costs. Using a tree structure slows down access and iteration – only by an O(log(n)) factor, but with relatively high constants. As a result you end up paying a relatively high performance penalty for what was mostly just wanted as a convenience (ropes do have other advantages, but that’s not what we’re looking at here).

But what if you didn’t have to do any of that? What if you could get O(1) concatenation and all the benefits of the compact representation? It sounds implausible, but I’m here to tell you you can!

Or, rather, that you almost can. Both of the above are true: You do get O(1) concatenation and you do get all the benefits of the compact representation, but you do end up paying some additional cost, because the idea is to use laziness. So although concatenation is O(1) you end up paying an additional O(n) cost the first time you want to use the result. Fortunately this still avoids the problem of sums being quadratic.

The key idea is to use a data structure that can swap out its implementation on demand. It starts by just storing the abstract expression tree that lead to it, and then switches to a more efficient representation as soon as you need it to.

e.g. the version of this for the above is that a string becomes a binary tree where the leaves are buffers and the branches indicate that the string is a concatenation of its left and right parts. Concatenation is then just creating a new split node, which is O(1).

Then, once we want the compact representation (which will tend to be as soon as we start doing interesting operations on the data – because the expression tree is entirely unnormalized there is very little we can usefully do to it that isn’t an O(n) operation!), we calculate that, store the result on the string and throw away the expression data that brought us here.

That is, as soon as we have forced the string, the string’s switches to a new representation using the forced buffer, essentially replacing the split node with a leaf node.

This feels like we’re back where we started – if you’re doing this lazily like that then you’re just summing together two string children so you’re quadratic again – but we’re not, for one very important reason: Because the implementation of the laziness is under our control, we can tell whether a string has already been forced or not. When forcing a node we then don’t force its child nodes, but instead just walk the tree and behave appropriately when we get to the leaves.

This sort of thing can be very useful, because the common cases where this runs into issues is that you have a complex expression graph and only actually care about a very small fraction of the subexpressions (e.g. in the sum case).

This isn’t always a win, in that it does behave suboptimally under some workloads (e.g. when you do care about a lot of the intermediate results but process them in the reverse of the order you created them), but it’s rarely a substantial loss and usually produces dramatic speedups by converting accidentally quadratic cases into the desired linear behaviour.

There are additional tricks you can build on top of this:

• You can precompute some data so you don’t always have to force the structure. e.g. you can always calculate the length of the string in the above example without forcing it and still have the operations be O(1)
• you can sometimes have operations that only require partially forcing the data structure (e.g. if you index into a string you might only have to force one half of it (or neither if the index is out of bounds!)
• If you have more complex operations then you can do a sort of “query optimization” to rewrite the expression tree into a more efficient execution plan. For example, a thing I’ve done in the past is when the operation is intersection you can rewrite it so that intersections are processed in order of increasing size, which often ends up with you being able to terminate early because you’ve discovered that the end result is going to be empty regardless of what happens next.

Depending on circumstances, any of the above can be worth doing, but most of the big wins come from the basic trick which is almost always a valuable one if you’re running into this sort of problem.

Using this technique won’t always be an improvement – e.g. you’ll end up doing some duplicated work if you do something like x + x because it will effectively recompute forcing x twice – but most of the work loads on which it will behave particularly badly are ones that you should probably have avoided anyway with the normal approach. The only real downsides where you do suffer a hit from using this is that the laziness adds an additional check to each operation, which can be anywhere between a small and modest performance hit depending on how expensive the operation normally is. Additionally, if you want the operations to be thread safe then you’ll need a memory barrier of some sort (e.g. making the relevant field volatile) to get the visibility right, which adds another small hit.

So it’s not a universal win, but the cost is light enough and there are enough work loads where it improves behaviour substantially that it is often worth considering.

To finish off, and make this more concrete, here’s some Python code implementing this idea for strings:

(Like this post? Want to see more like it? Why not support my Patreon! You’ll get to see drafts of upcoming posts and also increase the amount I write)

This entry was posted in programming, Python on by .