This is a trick I figured out this morning. It seems to work (in that I have a prototype that works just well enough to pass a single test), and I think it might have some interesting applications. I haven’t yet decided whether it’s more than a fun toy though.

It’s entirely possible (likely, even) that this is a well known idea that has some fancy name that I don’t know. I was doing some digging on related areas and couldn’t see anything like it, but that just means I wasn’t looking in the right places. It may also be that it’s never been written up because it’s a bad idea. Usual disclaimers about having clever ideas apply.

The idea is this: We’re working with boolean expressions over a set of variables. We want a canonical representation of them. We could use reduced ordered binary decision diagrams, but we dislike exponential blow-up. So what do we do?

There are a whole pile of extensions to the idea of BDDs which relax constraints, add features, etc. We’re not going to do that. Instead we’re going to canonicalize as we go using the awesome power of mutability.

For this trick we will need:

- A SAT solver (this problem is intrinsically equivalent to solving SAT, so there’s no getting away from that. We might as well let someone else do the hard work). I used minisat for my prototype. This would presumably work better with one of the various patched minisat extensions.
- A single mutable binary tree with leaves as boolean expressions and splits labelled by sets of variables.
- Union find.
- Hash consing (with weak references if you prefer but it will hurt speed)

Our working representation for boolean expressions will be that an expression can be one of:

- A literal true
- A literal false
- A single variable
- The negation of another boolean expression
- The conjunction (and) of two other boolean expressions

We hash cons the representation of our expressions so that any two structurally equal expressions are reference equal. We then make them our items in union find, so additionally every expression has a reference to another expression, or to itself if it is our current canonical representation for equivalent expressions.

Now, every time we introduce a new expression that is not previously in our hash cons we want to find out if it is equivalent to any expression we have previously seen. If it is, we perform a merge operation in our union find, rooting the more complicated of the two in the simpler of the two. If not, we add it as a new root.

Obviously we don’t do that by searching through every expression we’ve seen so far. This is where the binary tree comes in.

The binary tree starts with a single branch keyed with the empty set with the literal false as the left branch and the literal true as the right branch. When we insert a new node, we walk the tree as follows:

At a split node, look at the set of variables and determine what value this expression has when those variables are set to true and every other variable is set to false. If that result is true, walk to the right leaf. Else, walk to the left branch.

Once you get to a leaf, look at the expression that is currently there. Using a SAT solver, attempt to find a variable assignment that produces different results for the two expressions.t eIf the solver doesn’t find one, the expressions are equivalent. Merge them. If the solver *does* find one, split the leaf out into a branch with the label as the set of variables that are true in that assignment and false otherwise.

Anyway, once you have this data structure, you can always cheaply convert an expression that you’ve already constructed to the simplest equivalent expression you’ve seen so far. Testing equivalence of expressions then becomes equally cheap – just simplify both and see if you get the same answer.

This is of course a value of cheap that involves having run a large number of calls to a SAT solver to get to this point, but for the case where you’ve got a large number of small expressions which you’re going to want to then pass to a SAT solver all at once later I think that’s probably not too bad – passing the individual expressions to a SAT solver is extremely cheap, and then you just convert the whole batch to CNF together rather than trying to insert them into the data structure.

Notes:

- The tree will almost certainly get unbalanced. Occasional rebalancing probably becomes necessary. I’m not sure what the best way to do this is going to be, as it’s not actually an ordered tree in the traditional sense. I think something Splay Tree like could be made to work.
- You have to decide what constitutes a “better” expression in an implementation of this, but some sort of heuristic about number of reachable distinct subexpressions is a good start.
- It is probably worth putting some effort in to minimizing and reusing the expressions you get back from the SAT solver. It depends a bit on what the hit rate is likely to be.
- There are some optimizations you can do to make the evaluation a bit more efficient that I haven’t gone into here.
- There are some simple algebraic rewrite rules it’s worth performing before farming anything off to the sat solver.
- You can also optimize a little based on the fact that literals, variables and their negations are always distinct from each other.

Pingback: Language reconstruction based fuzzing without reconstructing languages | David R. MacIver