Category Archives: programming

Rigging elections with integer linear programming

No, this isn’t a post about politics, sorry, it’s just a post about voting theory.

As you might have noticed (and if not, this is the post announcing it), I have a book out! It’s called Voting by Example and is about the complexities of voting and why different voting systems might be interesting. You should go buy it.

But this isn’t just an ad post for the book, it’s about something else: How I came up with the examples in this book.

At its core is the following example. We have a student election where there are four candidates are running for class president. Each student casts a vote ranking the four candidates in order of most to least preferred.

The votes are as follows:

  • 32% of students voted Alex, Kim, Charlie, Pat
  • 27% of students voted Charlie, Pat, Kim, Alex
  • 25% of students voted Pat, Charlie, Kim, Alex
  • 16% of students voted Kim, Pat, Alex, Charlie

The point of this example is that each of four different ranked voting systems give a different answer:

It’s also constructed to have a minimal number of voting blocs, though it’s minimum amongst a slightly more specific set of elections than just those satisfying the above conditions.

Roughly what this means is:

  • Alex has the most first choice votes
  • When compared pairwise with any other candidate, the majority prefer Charlie
  • Pat wins a complicated procedure where people iteratively drop out depending on who is the favourite of the remaining candidates amongst the fewest voters
  • If you give people a score of 3 for each voter who ranks them first, two for each who ranks them second, and one for each who ranks them third, then Kim has the highest score.

If you want to know more than that I recommend the relevant Wikipedia links above, which are all very approachable.

The significance of this is that each of these is a popular (either in practice or amongst electoral theorists) way of saying who should be the winner. So the situation ends up being rather complex to decide.

But this isn’t a post about the significance. This is a post about how I constructed the example.

Historically I’ve generally created example elections with Hypothesis or a similar approach – randomly fuzzing until you get a result you want – but that wasn’t really going to work very well here due to the rather complicated set of constraints that are hard to satisfy by accident.

So I instead turned to my old friend, integer linear programming.

The idea of integer linear programming (ILP) is that we have a number of variables which are forced to take integer values. We can then impose linear constraints between them, and give a linear objective function to optimise for. For example we might have three variables \(v_1, v_2, v_3\) and the following constraints:

  • \(v_1, v_2, v_3 \geq 0\)
  • \(v_1 + 2 v_2 + 3 v_3 \leq 5\)

And then try to maximise \(v_1 + 2 v_2 + 4 v_3\).

Given a problem like this we can feed it to an ILP solver and it will spit out a solution. If we do that, it will tell us that an optimal solution for this problem is \(v_1 = 2, v_2 = 0, v_3 = 1\).

A lot of problems can be nicely turned into ILP problems, and there are some fairly good open source ILP solvers, so it’s often a nice way to solve complex combinatorial problems.

So how do we turn our election rigging problem into an integer linear program?

The key idea is this: 4! isn’t very large. It’s only 24. 24 is tiny from an ILP point of view.

This means that there’s no problem with creating a variable for each of the possible votes that someone could cast, representing the number of people who cast that vote. That is, we create integer variables \(v_p\) indexed by permutations with the constraints:

  • \(\sum v_p = 100\)
  • \(v_p \geq 0\)

Then the value of e.g. \(v_{(0, 1, 2, 3)}\) is the number of people who cast the exact vote \((0, 1, 2, 3)\) (or, by name, Alex,, Charlie, Pat, Kim).

We then use a trick to get nicer examples, which is that we try to minimise the number of non-zero votes. The idea is to create variables which, when minimised, just look like markers that say whether a vote is non-zero or not.

So we create supplementary 0/1 valued integer variables \(u_p\) with the constraints that \(v_p \leq 100 u_p\), and set the objective to minimise \(\sum u_p\). Then \(u_p\) will be set to \(0\) wherever it can be, and the only places where it can’t are where \(v_p\) is non-zero. Thus this minimises the number of voting blocs.

So that’s how we create our basic ILP problem, but right now it will just stick 100 votes on some arbitrary possible ballot. How do we then express the voting conditions?

Well, lets start with the plurality and Borda scores. These are pretty easy, because they constitute just calculating a score for each candidate for each permutation and adding up the scores. This means that the scores are just a linear function of the variables, which is exactly what an ILP is built on.

Victory is then just a simple matter of one candidate’s score exceeding another. You need to set some epsilon for the gap (linear programming can’t express \(<\), only \(\leq\)), but that’s OK – the scores are just integers, so we can just insist on a gap of \(1\).

The following code captures all of the above using Pulp, which is a very pleasant to use Python interface to a variety of ILP solvers:

The idea is that the additive_scores parameter takes a list of scoring functions and a partial list of winners given those functions and returns an election producing those orders.

So if we run this asking for the plurality winners to be (0, 1, 2, 3) and the Borda winners to be (3, 2, 1, 0) we get the following:

>>> build_election(4, 100, [
    (lambda p, c: int(p[0] == c), (0, 1, 2, 3)),
    (lambda p, c: 3 - p.index(c), (3, 2, 1, 0))])
[((0, 3, 1, 2), 35), ((1, 2, 3, 0), 34), ((2, 3, 0, 1), 31)]

So this is already looking quite close to our starting example.

Creating a Condorcet winner is similarly easy: Whether the majority prefers a candidate to another is again just an additive score. So we just need to add the requisite \(N\) constraint that our desired Condorcet candidate wins.

if condorcet_winner is not None:
    victories = {
        (i, j): lpsum(
            v for p, v in variables if p.index(i) > p.index(j) )
            for i in candidates
            for j in candidates
    for c in candidates:
        if c != condorcet_winner:
            problem.addConstraint( victories[(condorcet_winner, c)] >=
                victories[(c, condorcet_winner)] + 1

If we run this to force the Condorcet winner to be \(1\) we now get the following:

>>> build_election(4, 100, [
    (lambda p, c: int(p[0] == c), (0, 1, 2, 3)),
    (lambda p, c: 3 - p.index(c), (3, 2, 1, 0))],
[((0, 3, 1, 2), 28),
 ((1, 2, 3, 0), 27),
 ((2, 1, 3, 0), 24),
 ((3, 2, 0, 1), 21)]

This is pretty close to the desired result. We just need to figure out how to set the IRV winner.

This is a bit more fiddly because IRV isn’t a simple additive procedure, so we can’t simply set up scores for who wins it.

But where it is a simple additive procedure is to determine who drops out given who has already dropped out, because that’s simply a matter of calculating a modified plurality score with some of the candidates ignored.

So what we can do is specify the exact dropout order: This means we know who has dropped out at any point, so we can calculate the scores for who should drop out next and add the appropriate constraints.

The following code achieves this:

    if irv_dropout_order is not None:
        remaining_candidates = set(candidates)
        for i in irv_dropout_order:
            if len(remaining_candidates) <= 1:
            assert i in remaining_candidates
            allocations = {j: [] for j in remaining_candidates}
            for p, v in variables:
                for c in p:
                    if c in remaining_candidates:
            loser_allocations = sum(allocations.pop(i))
            for vs in allocations.values():
                problem.addConstraint(loser_allocations + 1 <= sum(vs))

And running this we get the following:

>>> build_election(4, 100, [
    (lambda p, c: int(p[0] == c), (0, 1, 2, 3)),
    (lambda p, c: 3 - p.index(c), (3, 2, 1, 0))
], condorcet_winner=1, irv_dropout_order=(3, 1, 0, 2))
[((0, 3, 1, 2), 31),
 ((1, 2, 3, 0), 27),
 ((2, 1, 3, 0), 25),
 ((3, 2, 0, 1), 17)]

This isn’t quite the same example in the book (the first bloc has one fewer vote which the last bloc got in this), because the book example had a bit more code for optimizing it into a canonical form, but that code isn’t very interesting so we’ll skip it.

Here’s the full code:

I’m constantly surprised how nice integer linear programming ends up being for constructing examples. I knew I could do the Borda and plurality scores – that’s in fact the example that motivated me trying this out at all – but although I find it obvious in retrospect that you can also fix the Condorcet winner it definitely wasn’t a priori. The fact that it’s easy to also calculate the IRV dropout order was genuinely surprising.

This is also a nice example of how much fun small n programming can be. This isn’t just an O(n!) solution – it generates a problem of size O(n!) and then feeds it to a solver for an NP-hard problem! in principle that should be catastrophic. In practice, n is 4, so who cares? (n=4 is about the limit for this too – it just about works for n=5, and for n=6 it doesn’t really).

I also find it somewhat surprising how much freedom there is to get different results from different voting systems. I was motivated to do some of this by The ultimate of chaos resulting from weighted voting systems, so I knew there was a fair bit of freedom, but I was somewhat expecting that to be a pathology of weighted voting systems, so I’m still a little surprised. I guess there’s just quite a lot of room to play around with in a 24-dimensional simplex.

Which brings me to my final point: When you’re trying to understand what is possible in a domain, this sort of example generation focused programming is very useful for exploring the limits of the possible and building your intuition. I feel like this isn’t a thing people do enough, and I think we should all do more of it.

I also think you should buy my book.

(Or support my writing on Patreon)

This entry was posted in programming, Python, voting on by .

Faster random sampling by handling rejection well

Attention conservation notice: You probably don’t have the problem this post solves. It might be interesting anyway.

As part of some ongoing development for Hypothesis, I find myself in need of a well optimised solution to the discrete sampling problem: We have weights \(w_1, \ldots, w_n\) and we want to produce a number in \(\{1, \ldots, n\}\) such that \(i\) is drawn with probability proportional to \(w_i\).

All solutions to this problem are intrinsically \(O(n)\): If you have \(w_i = 1\) for \(i \neq j\) and \(w_j = 10^6\) then you need to inspect \(j\) to know whether to generate \(i \neq j\) often or rarely.

You can however use that \(O(n)\) work once to build a data structure for repeated drawing. For example you can use the alias method to build a data structure that lets you do future draws in \(O(1)\).

Which is great and all if you’re expecting to repeatedly draw from the same distribution over and over again. Unfortunately that’s not what my workload looks like – the distributions used are many and varied. Distributions will be reused, but not in any easy to predict manner.

You can of course do cached lookup to an alias method sampler. This works, but if your cache hit rate gets too high (which it often will be in my workload) this doesn’t actually improve matters. Additionally, depending on how good your caching is and how much is in there, it might not be much cheaper to look up something in the cache  than it is to just create a new one from scratch.

So what we’d like is a method that is significantly cheaper to initialise. Fortunately there is one, it’s called rejection sampling!

You just need to calculate \(W = \max w_i\). Now you iterate the following procedure:

  1. Pick \(i\) uniformly at random
  2. With probability \(\frac{w_i}{W}\) return \(i\)
  3. Go to 1

This is very cheap to initialise but not so great in terms of run time: If the weights are uniform then it’s optimal (because you always return \(i\) in step 2), but if they’re very far from uniform then you’re going to make a lot of trips (potentially \(O(n)\) in the case where one weight dominates everything else) through that loop.

But there turns out to be a hybrid solution that works pretty well and gets us most of the cheapness of initialising rejection sampling with most of the performance of our better sampling methods, with a tuning parameter deciding how much of each we want.

Pick an integer parameter \(R > 0\) (good choices for \(R\) are probably \(1\) or \(2\)) and then run the following algorithm:

  1. Pick \(i\) uniformly at random
  2. With probability \(\frac{w_i}{W}\) return \(i\)
  3. If we’ve reached this point fewer than \(R\) times, go to 1.
  4. Look up or initialise a new sampler for these weights and draw from that.

So we perform the cheap rejection sampling until it’s obvious that it’s not helping much and then move to a proper sampling method.

For cases where we’re near the optimum case for rejection sampling we’ll do well and mostly use it, for the rest of the cases we’ll spend a bit of an extra cost. Potentially we could also do some extra book keeping – if we tend to find rejection sampling never works we could just turn off that phase of the algorithm entirely.

Experiments so far in the thing I’m implementing suggests that this idea provides a modest improvement compared to initializing the sampler (I’m actually using a data structure from Succinct Sampling from Discrete Distributions, on grounds of it being substantially cheaper to initialize at the cost of being slightly more expensive to run. Plausibly this would give a much bigger improvement if I were using the alias method).

This probably isn’t that useful a trick. The work load I’m optimising for is a bit odd, and the performance improvements under that workload are only modest. Additionally, the baseline is already fairly fast. But it’s not something I’d seen before and it seemed worth noting down.

(Something something Patreon something. It feels a bit cheap to plug it here given the relatively niche nature of this post, but hey I was doing the Hypothesis research that lead to it for free, so I guess it balances out?)

This entry was posted in programming on by .

Proving or refuting regular expression equivalence

Suppose we are interested in the following problem: We have two regular expressions. Do these match exactly the same strings, and if not what is the smallest (shortlex minimal) string that one matches but not the other?

i.e. we either want to prove they are equivalent or provide the best possible refutation that they are not.

How do we do this?

More specifically, let’s consider this problem for extended regular expressions. That is, we have the following types of expressions:

  • \(\epsilon\), which matches only the empty string
  • \(c\), for \(c\) in our alphabet, which matches only the character \(c\)
  • \(\emptyset\), which matches no strings
  • \(\omega\) which matches every string
  • \(x^*\) which matches \(a_1\ldots a_n\) for any strings \(a_i\) matched by \(x\).
  • \(xy\) where \(x\) and \(y\) are extended regular expressions, which matches any strings \(ab\) where \(x\) matches \(a\) and \(y\) matches \(b\)
  • \(x \vee y\) which matches any string matched by either or both of \(x\) or \(y\)
  • \(x \wedge y\) which matches any string matched by both \(x\) and \(y\)
  • \(\neg x\) which matches any string not matched by \(x\).

That is, we extend the normal regular expressions with intersection and negation. This doesn’t actually increase the set of languages we can match, but it can lead to significantly more concise representations (I think, but am not sure, there are extended regular expressions which require exponentially larger regular expressions to represent).

This reduces to the problem of finding the lexmin smallest member of the regular language for an expression or showing that it’s empty: Given two extended regular expressions \(x, y\), we can define \(x \oplus y = (x \vee y) \wedge (\neg (x \wedge y))\) – the symmetric difference. This matches any string which is matched by one but not both of \(x\) and \(y\), so a string is a member of this regular language precisely if it is a refutation of the claim that \(x\) and \(y\) are equivalent.

If we can compile extended regular expressions into a deterministic finite automaton (which we can), then the problem is easy:  To find the lexmin smallest element of a regular language or show that it’s empty given a deterministic finite automaton for it, you just use Dijkstra’s Algorithm to search the graph for an accepting node. You stop when you either find one (in which case you’ve constructed a lexmin path to it along the way) or you’ve explored all the states (in which case there is no path to an accepting node and the language is empty). Job’s done.

That was easy. Are we done?

Well, no. There are a lot of evils hidden in that sentence “just construct a deterministic finite automaton”. The corresponding automaton to a regular language may be exponentially larger than the language, and even when it’s not it’s still an expensive operation. Can we avoid that?

First, let’s take a detour: There are a number of obvious rewrite rules we can use to simplify our lives, by replacing regular expressions with straightforwardly equivalent versions of them. We use \(x \approx y\) to indicate that we should consider \(x\) and \(y\) the same under these rewrite rules:

  • \(x \wedge x \approx x\)
  • \(x \wedge y \approx y \wedge x\)
  • \(x \wedge (y \wedge z) \approx (x \wedge y) \wedge z\)
  • \(x \wedge \emptyset \approx \emptyset\)
  • \(x \wedge \omega \approx x\)
  • \(x \vee x \approx x\)
  • \(x \vee y \approx y \vee x\)
  • \(x \vee (y \vee z) \approx (x \vee y) \vee z\)
  • \(x \vee \emptyset \approx x \)
  • \(x \vee \omega \approx \omega \)
  • \(x(yz) \approx (xy)z\)
  • \(x\emptyset \approx \emptyset\)
  • \(\emptyset x \approx \emptyset\)
  • \(x \epsilon \approx x\)
  • \(\epsilon x \approx x\)
  • \((x^*)^* \approx x^*\)
  • \(\epsilon^* \approx \epsilon\)
  • \(\emptyset^* \approx \epsilon\)
  • \(\omega^* \approx \omega\)
  • \(\neg \emptyset \approx \omega\)
  • \(\neg \omega \approx \emptyset\)
  • \(\neg(\neg x) \approx x\)

If \(x \approx y\) we say that \(x\) and \(y\) are similar.

All of these correspond to rules that apply for the resulting languages, so if \(x \approx y\) then \(x\) and \(y\) match the same language. In general though \(x\) and \(y\) might match the same languages without being similar.

Why do we care about this? Well the first reason is that it lets us have a fast test for when two regular expressions definitely are equivalent: If they are similar then they are equivalent. If they’re not we have to do more work to test this.

In particular, we can use smart constructors for our regular expression data type to apply all of these rewrite rules at construction time, which then makes similarity simply a matter of testing for structural equality. If we use hash consing (essentially: memoizing all constructors so that structurally equal values are represented by the exact same object) we can even make testing for similarity O(1).

Which is good, because we’re going to need it a lot: The real reason to care about similarity of regular expressions is that we can construct the a deterministic finite automaton using the Brzozowski derivative of regular expressions, with each of our states labelled by a regular expression – if we have some character in our alphabet \(c\) and a regular language \(L\) then the Brzozowski derivative is \(d(L, c) = \{v: cv \in L\}\) – the set of suffixes of strings starting with \(c\) in \(L\). We can extend this to regular expressions with the following set of rules:

  • \(d(\epsilon, c) = \emptyset\)
  • \(d(b, c) = \emptyset\) if \(b \neq c\), else \(d(c, c) = \epsilon\)
  • \(d(\emptyset, c) = \emptyset\)
  • \(d(\omega, c) = \omega\)
  • \(d(x^*, c) = d(x, c) x^*\)
  • \(d(xy, c) = d(x, c)y \vee \nu(x) d(y, c)\)
  • \(d(x \vee y, c) = d(x, c) \vee d(y, c)\)
  • \(d(x \wedge y, c) = d(x, c) \wedge d(y, c)\)
  • \(d(\neg x, c) = \neg d(x, c)\)

Where \(\nu(x)\) is \(\emptyset\) if the corresponding regular language doesn’t match the empty string, or \(\epsilon\) otherwise (this can be calculated directly from the expressions by similar rules).

The idea is that we can construct an automaton for a regular language by labelling states by expressions. States are accepting if the corresponding expression matches the empty string, and the transition from state \(x\) given character \(c\) goes to \(d(x, c)\).

If our similarity rule was enough to guarantee equivalence then this would be a minimal deterministic finite automaton for the language. However, even without that there are only finitely many states reachable from a given regular expression (this is where similarity becomes important! Without it this is not true. e.g. if you consider iterating the derivatives of \(c^*\) under \(c\) you’ll get infinitely deep nested chains of similar but unequal expressions).

So far it isn’t obvious why this is interesting, but it has one very important property: You can walk this automaton without having to compute any of the states in advance. This means that you can build it as you explore, so we can repeat the above approach of just using Dijkstra’s algorithm to find a refutation but we don’t have to build the automaton in advance.

It’s particularly easy to calculate the derivatives of the disjoint sum language too:

\(d(x \oplus y, c) = d(x \vee y, c) \wedge d(\neg (x \wedge y), c) =  (d(x, c) \vee (y, c)) \wedge (\neg (d(x, c) \wedge d(y, c))) = d(x, c) \oplus d(y, c)\).

So if we’ve got the derivatives of our original regular expressions cached in some way it’s very cheap to calculate derivatives of our compound regular expression.

So that’s a major improvement on our original algorithm – in a number of cases it will be exponentially faster!

There are a number of ways you can improve it further: You can prune down the possible set of starting characters for a string in an imprecise way to reduce the number of branches you explore when doing Dijkstra. You can also precalculate when two characters will lead to the same state and always consider only the smallest. This is a worthwhile set of optimisations to the algorithm, but doesn’t affect it in any really fundamental way.

Can we do better yet?

The problem is that although it’s easy to calculate, our new automaton potentially has a lot more states than we want. In general it will be fewer, but if there are \(M\) states in \(x\) and \(N\) in \(y\) then there might be as many as \(MN\) in the automaton for the symmetric difference regular expression. We’ll likely not have to explore all of them, but it’s unclear how many you might have to explore.

But you can do equivalence testing in nearly O(M + N) using an algorithm due to Hopcroft and Karp based on union-find (Note: the algorithm called the Hopcroft-Karp algorithm is confusingly something completely different). The idea is that assume that they are equivalent and we merge states until we find a contradiction, and that by aggressively merging states we can consider a lot fewer of them:

Given two automata \(S\) and \(T\) with initial states \(s\) and \(t\), we test equivalence as follows:

def equivalent(S, T):
    s = S.start
    t = T.start
    merges = UnionFind()
    merges.union(s, t)
    stack = [(s, t)]
    while stack:
        p, q = stack.pop()
        for a in ALPHABET:
            pa = merges.find(S.transition(p, a))
            qa = merges.find(S.transition(q, a))
            if pa.accepting != qa.accepting:
               return False
            if pa != qa:
                merges.union(pa, qa)
                stack.append((pa, qa))
    return True

The idea is that if two states are equivalent then the states they transition to under the same character must be equivalent, and if we’ve proved that an accepting and non-accepting state are equivalent then we’ve reached a contradiction and our initial assumption must have been false. This merges two states if and only if there is some string that leads to one in one automaton and the other in the other, so if there is some string that one matches but not the other it will eventually merge those two states (if it doesn’t stop earlier).

Because of the way we aggressively merge states, this runs in time O(k(M + N)) where k is the size of the alphabet and M and N are the size of the respective states (there’s actually an inverse Ackermann factor in there too, but the inverse Ackermann function is so close to constant it might as well be ignored).

The problem with this algorithm is that it doesn’t produce a refutation, only a yes or no. We can trivially modify it by pushing paths along with the states, but the paths generated aren’t necessarily useful and in particular don’t necessarily go to a state that is accepting in one and not in the other – we’ve replaced paths to states we believed to be equivalent, but because the core assumption is wrong this didn’t necessarily work. In particular there’s no guarantee that either of the states that we use to refute the result are accepting, only that we falsely believe them to be equivalent to two states where one is accepting and the other is not.

We can still make use of this by using this as a fast initial test; Use Hopcroft and Karp’s algorithm as a fast initial test for equivalence and if it returns false then construct a refutation once we know one must exist. This algorithm also still works well with the lazy construction.

But it would be nice to be able to produce a witness as an integrated part of running this algorithm.

And it turns out we can: The idea is to hybridise the two algorithms – use Dijkstra’s algorithm to build the witness, but instead of operating on a graph corresponding to a single automaton, we build a graph on pairs of states, one from each automaton – an edge then involves following the corresponding edge in each automaton. Rather than searching for an accepting node, we are now searching for a node where one state is accepting and the other is not.

The key thing we then borrow from the algorithm of Hopcroft and Karp is that we can skip states where we have merged the pairs, because there is nothing interesting down that route even if we haven’t seen it.

Here’s some code for the resulting algorithm:

def witness_difference(left, right):
    if left is right:
        return None
    if left.matches_empty != right.matches_empty:
        return b''
    merges = UnionFind()
    merges.union(left, right)
    queue = deque()
    queue.append((left, right, ()))
    while len(queue) > 0:
        p, q, s = queue.popleft()
        for a in ALPHABET:
            pa = derivative(p, a)
            qa = derivative(q, a)
            if pa is qa:
            t = (a, s)
            if merges.find(pa) != merges.find(qa):
                if pa.matches_empty != qa.matches_empty:
                    result = bytearray()
                    while t:
                        a, t = t
                    return bytes(result)
                merges.union(pa, qa)
                queue.append((pa, qa, t))
    return None

One important difference here is that we always put the actual reached pairs of states onto the queue rather than the collapsed pair. This ensures that the paths we take are real paths to real states in the corresponding automaton. Also, because we stop as soon as we ever would merge an accepting and no-accepting state, the problem we worried about above can’t happen.

Also it’s worth noting that I’m not sure this algorithm produces a minimal refutation. It should produce a pretty small one, but there be a smaller one reachable through two states which the algorithm has merged but that are not actually equivalent.

The requirement for witnesses is a slightly unusual one – most algorithms just focus on testing for equivalence – so it would be reasonable to ask why we care.

The reason I care is that I’m interested in the following more general question: I want to maintain a set of N regular languages as witnessed by regular expressions, and I want to be able to test for membership of this set and add new elements to it.

I’m still not sure what the best way to do this is, but all of the best ideas I have require refutations to prevent you from having to do O(N) equivalence tests. For example, the following works: Maintain a binary tree, where each branch node has a string and each child node is a regular expression. When testing for membership, walk the tree as follows: At a branch, if the language contains the string then go to the left child else go to the right child. At a leaf, perform an equivalence test. If it returns true then the language is a member.

Insertion is then nearly the same: We walk until we get a leaf. At the leaf we perform an equivalence test with a refutation. If the two languages are equivalent then there’s no insertion to perform. If they’re not equivalent, then we create a split node with the refutation string as the key and put whichever one contains it as a leaf for the left child and the other one as a right.

This isn’t a great algorithm, in that it doesn’t have any easy way to balance the tree, but it’s always better than doing N equivalence checks: Matching a short string is very cheap, and we always perform at most N – 1 matches (potentially many fewer) and exactly one equivalence check.

In fact, it is not in general possible to balance the tree; Consider N strings where string k matches only the string consisting of a sequence of k 0s. Then each test string has exactly one language which matches it, so every split must have one of its children be a leaf and the look up problem degenerates to always testing N strings.

Another algorithm is to maintain a trie of all the refutations found so far. You can then simultaneously walk the regular expression’s deterministic finite automaton and the trie to produce a bit vector where bit i is set if the language contains the i’th string. This bitvector is then the same as that of at most one existing language, which can be looked up using e.g. a hash table. So to test membership you produce the bitvector and check for presence in the hash table. If it’s not present then the language isn’t present. If it is present, then check equivalence with the unique existing language with that bitvector. Insertion is similar: If the bitvector isn’t present, add the language to the hash table and leave the set of refutations alone. Else, check for equivalence with a refutation to the unique existing language with the same bitvector. If they’re equivalent, do nothing. If they aren’t, add the refutation to the trie and update the bitvector for all existing languages (you can actually do this last step incrementally as you go, but it may not be worth it).

The second algorithm is probably strictly worse than the first because it involves an O(N) computations on a lot of inserts, as well as testing for membership being more expensive. It really depends on how much cheaper equivalence is than matching for your language workload, and how likely existing refutations are to witness future ones.

One reason why being able to do the above is interesting is that it allows us to extend the lazy construction of a deterministic finite automaton to a lazy construction of a minimal deterministic finite automaton: If we can maintain such a set relatively efficiently, then instead of merely conflating similar expressions into the same states we can label states with regular languages, giving us the proper derivative automaton rather than just our approximation to it.

So that’s where my ability to answer this leaves off a bit. There may be a good data structure for this, but if there is I haven’t found the right corner of the literature to tell me what it is.

if you want to have a play with some of this, here’s some (mostly untested so probably badly wrong in places) Python code implementing a lot of these ideas:

If you’d like to learn more about this, most of the information in this post either came from or was based on material from Regular-expression derivatives reexamined,  Testing the Equivalence of Regular Languages and Algorithms for testing equivalence of finite automata, with a grading tool for JFLAP (which are in turn based on a lot of other papers you can get to from their citations).

(And, as always, if you’d like to see more posts like this and feel able to spare the price of less than a coffee per month, I encourage you to donate to my Patreon for supporting my writing.)

This entry was posted in Automata Theory, programming on by .

Randomized exhaustive exploration of a many branching tree

Attention conservation notice: This post is largely an obvious in retrospect solution to an obscure problem.

It’s a problem I spent a lot of time considering wrong or complicated solutions too though (this is what the persistent data structure I was considering was for, but it’s not needed in the end), so I thought it was worth writing up the correct solution.

The problem is this: I have a tree of unknown structure. Each node in the tree is either a leaf or has an ordered sequence of N (variable per node) children (technically the leaves are just a special case of this with N=0, but leaves have special properties as we’ll see in a moment).

I want to do an exhaustive search of the leaves of this tree, but there’s a limitation: when walking this tree I may only walk complete branches starting from the root and finishing at a leaf. I cannot back up and choose another branch.

The tree is potentially very large, to the point where an exhaustive search might be impossible. So this should actually be treated as an anytime algorithm, where we’re iteratively yielding new paths to a leaf which some external algorithm is then doing something with. When yielding those we want to guarantee:

  • We never yield the same path twice
  • We know when we have explored the entire tree and stop

It’s easy to do a lexicographic exploration of the branches in sorted order – just build a trail of the choices you took. At each step, increment the last choice. If that causes it to exceed the number of choices available, pop it from the end and increment the previous next choice. When making an unknown choice always choose zero.

But there’s a reason you might not want to do this: Under certain reasonable assumptions, you may end up spending a lot of time exploring very deep leaves.

Those reasonable assumptions are this:

  • All else being equal, we should prefer leaves which are closer to the root (e.g. because each walk to a child is expensive, so it takes less time to explore them and we can explore more leaves / unit time)
  • The expected number depth of a tree below any node given uniformly at random choices is usaully much lower than the maximum depth of the tree elow that point.

Under these assumptions the optimal solution for exploring the tree is instead to just randomly walk it: You will avoid the long paths most of the time because of the expectation assumption, and will do a better job of exploring close to the root.

Unfortunately, pure random exploration does not satisfy our requirements: We might generate duplicate paths, and we don’t know when we’ve explored all the nodes.

So what we want is a hybrid solution that exhaustively searches the tree without duplicates, but chooses its paths as randomly as possible.

Any way, there turns out to be a very simple way of doing this. It’s as follows:

We maintain a list of paths to unexplored nodes, bucketed by their length. This is initially the empty path leading to the root.

We then repeatedly iterate the following steps until there are no unexplored nodes in the list:

  1. Pick an unexplored node of minimal length uniformly at random and remove it from the list
  2. Walk to that node
  3. Walk randomly below that node. Every time we make a random decision on a child, add the path we took to get to this point plus each other decision we could have taken to the list of unexplored nodes.
  4. When we reach a leaf node, yield that path to the controlling algorithm.

Every path we walk goes through a node we have not previously explored, so must be unique. When the list is empty we’ve explored all the nodes and are done.

The correct storage format for the paths is just an immutable singly linked list stored with the last node in the path first – appending a node to the path is just consing it to the front of the list. You’re going to have to do O(n) work to reverse it before walking upwards, but that’s OK: You’re going to have to do O(n) work with n the same and a higher constant factor to walk back up the tree along that path anyway.

The reason for my original need for a data structure with better forward iteration than that was because the algorithm I was originally considering always picked the unexplored node with the shortlex minimal (lexicographically first amongst those with the shortest length) path to it, but there’s no real reason you need to do that and it makes things much more expensive.

If you want to instead do weighted random walks of the tree and are prepared to relax the requirements around exploring shorter paths first a bit, there’s an alternative algorithm you can use (and I’m actually more likely to use this one in practice because I’ll need to do weighted draws) which works as follows:

We build a shadow tree in parallel every time we walk the real tree. The shadow tree is mutable, and each node in it is in one of three states: Unexplored, Active or Finished. It starts with a single root node which is Unexplored.

We maintain the following invariant: Any node in the shadow tree which is not Finished has at least one child node that is not Finished. In particular, leaf nodes are always Finished.

We then decide how to walk the tree as follows: We always walk the shadow tree in parallel, matching our walk of the real tree. Whenever we walk to a node that is Unexplored, we do one of two things:

  1. If the corresponding real node is a leaf, we immediately mark it as Finished (and our walk is now done)
  2. If the corresponding real node has N children, we mark it as Active and create N Unexplored child nodes for it.

Once our walk has complete (i.e. we’ve reached a leaf), we look at the list of nodes in the shadow tree that we encountered  and walk it backwards. If all of the children of the current node are Finished, we mark the node as Finished as well. If not, we stop the iteration.

We can now use the shadow tree to guide our exploration of the real tree as follows: Given some random process for picking which child of the real tree we want to navigate to, we simply modify the process to never pick a child whose corresponding shadow node is Finished.

We can also use the shadow tree to determine when to stop: Once the root node is marked as Finished, all nodes have been explored and we’re done.

Note that these two approaches don’t produce at all the same distribution: If one child has many more branches below it than another, the initial algorithm will spend much more time below that child, while the shadow tree algorithm will explore both equally until it has fully exhausted one of them (even if this results in much deeper paths). It probably depends on the use case which of these are better.

You can use the shadow tree to guide the exploration too if you like. e.g. if there are any unexplored child nodes you could always choose one of them. You could also store additional information on it (e.g. you could track the average depth below an explored node and use that to guide the search).

Anyway, both of these algorithms are pretty obvious, but it took me a long time to see that they were pretty obvious so hopefully sharing this is useful.

As an aside: You might be wondering why I care about this problem.

The answer is that it’s for Hypothesis.

As described in How Hypothesis Works, Hypothesis is essentially an interactive byte stream fuzzer plus a library for doing structured testing on top of this. The current approach for generating that bytestream isn’t great, and has a number of limitations. In particular:

  • At the moment it fairly often generates duplicates
  • It cannot currently detect when it can just exhaustively enumerate all examples and then stop (e.g. if you do @given(booleans()) at the moment the result isn’t going to be great).
  • The API for specifying the byte distribution is fairly opaque to Hypothesis and as a result it limits a number of opportunities for improved performance.

So I’m exploring an alternative implementation of the core where you specify an explicit weighted distribution for what the next byte should be and that’s it.

This then opens it up to using something like the above for exploration. The nodes in the tree are prefixes of the bytestream, the leaves are complete test executions. Each non-leaf node has up to 256 (it might be fewer because not every byte can be emitted at every point) child nodes which correspond to which byte to emit next, and then we can treat it as a tree exploration problem to avoid all of the above issues.

I’ve currently got a branch exploring just changing the API and keeping the existing implementation to see how feasible rebuilding Hypothesis this way is. So far the answer is that it’ll be a fair bit of work but looks promising. If that promise pans out, one of the above (probably the shadow tree implementation) is likely coming to a Hypothesis near you at some point in 2017.

(Why yes, it is a programming post! I do still do those. I know these are a bit thin on the ground right now. If you want more, consider donating to my Patreon and encouraging me to write more).

This entry was posted in Hypothesis, programming, Python on by .

A simple purely functional data structure for append and forward traversal

For reasons I’ll get into in a later post I’m interested in a data structure with the following properties:

  • It represents a sequence of values
  • It is extremely cheap to append a value to the end in a way that returns an entirely new sequence. The same sequence will typically be extended many times, so no copy on write or append tricks allowed here (no vlists for you) it has to be genuinely cheap
  • Forward traversal is very cheap, especially when the traversal stops after only a small number of elements.
  • It needs to be very simple to implement in a variety of languages including C

The fact that you’re appending to the end and traversing from the beginning is the annoying bit.

The data structure to beat here is to just use a normal linked list, interpreted as storing the list backwards, and then reverse on traversal. If we expected to do a full traversal each time that would probably be pretty close to optimal, but the fact that we often want to traverse only a small number of elements makes it annoying.

(Spoilers for later post: these represent partial paths through a rose tree. The reason why we want fast traversal forwards over a couple of elements is that we want to compare them lexicographically. The reason why we want fast appends is that every time we make a choice we create path prefixes for all the choices we could have taken and didn’t. The reason it needs to be simple is that I might want to use this in Hypothesis, which has “be easily portable to new languages” as an explicit design goal).

The “right” data structure here is probably a finger tree. I don’t want to use a finger tree for one very simple reason: Finger trees are a complete pain to implement in Haskell, let alone another language.

It turns out there’s a data structure that achieves these goals. I haven’t benchmarked it, so unclear at present how useful it is in practice, but for the specific requirements I have it’s probably pretty good. It’s also pretty simple to implement.

The initial idea is that we store our elements as a left-leaning binary tree. That is, a binary tree where the left hand branch always points to a complete binary tree.

Appending then consists of either creating a new split node with the current tree on the left and the new leaf on the right (if the existing tree is already complete), or appending the element to the right hand branch (if the tree is not yet complete then the right hand branch must not yet be complete). i.e. we just do the natural thing for adding it to the rightmost side of a left-leaning binary tree.

This is already pretty good. Append is log(n) in principle, but in practice that side of the binary tree will usually be very cheap. Traversing a binary tree from left to right is pretty straightforward.

There are only two problems with it:

  • Forward traversal isn’t that good, because we have to descend into the deepest part of the tree to get to the first element.
  • We need to decide what sort of book keeping we want to do to check the size of the tree.

We’ll solve the second problem first. We could do this by size annotating every tree node (that was my original idea), but we actually don’t need to because we only need to know the total size of the tree to work out the number of elements below any node: The left hand node always contains the largest power of two strictly smaller than the total number of elements in the tree, the right hand node contains whatever is left.

So instead of size annotating every node we work with bare trees and define a wrapper type which is simply a tree and a count of elements. For some work loads this won’t always be an improvement (because it may reduce sharing), but for most it should.

The first problem is also straightforward to solve (the solution is not one I had seen before, but is certainly not novel to me): We “hoist” the first element of the left subtree up to the split node. So where a binary tree is normally either key list or stores an element between its two subtrees, here we store the leftmost element that “should” be in the left subtree. Note that this means that our balance condition is actually that the number of elements in the left subtree is one fewer than the number of elements in our right subtree (because the element on the current node is implicitly part of the left subtree).

The nice feature of this is that it means that the first element is accessible in O(1). Indeed, the i’th element is accessible in O(min(i, log(n))) – if i < log(n) then finding the i’th element is just a question of reading down the left hand side of the tree, same as it would be if this were a normal cons list.

The following Haskell code (yes I can still write Haskell, thank you very much) provides a minimal implementation of this concept:

This turns out to be very close to an existing purely functional data structure which I hadn’t previously known about – Okasaki’s Purely Functional Random-Access Lists. It’s somewhat simpler in places, mostly because of the operations it doesn’t need to bother supporting, but the binary nodes which contain the first child element is the same and the power-of-two structure is similar but different (in ways where Okasaki’s might just be better. I’m not totally sure, but this is definitely simpler).

I’m still hoping that the CS Theory Stack Exchange will give me a better answer here, but to be honest this is probably more than good enough for my purposes and any further work on this question is basically yak shaving.

This entry was posted in programming on by .