David R. MacIver's Blog: Domain specific evaluation orders

Domain specific evaluation orders

22 December 2013

Advance warning: I’m super pleased with this idea. I apologise if I’m unnecessarily smug about it as a result.

My extremely basic benchmark for intmap is inserting 10 million integers counting down and then printing out its length. I’ve been trying to get it to beat a similar benchmark on Haskell code. I was doing pretty well getting incremental improvements with custom memory allocators (I’m about 80% sure the main benefit Haskell has here is its garbage collector) and would probably have eventually got it down to about the same speed as Haskell by more or less emulating its memory allocation patterns (and yes I’m aware of the humour of trying to optimise C code to be as fast as Haskell code)

Then I had a clever idea. Screw “If you can’t beat ’em, join ’em”. If you can’t beat ’em, cheat.

Fundamentally the problem with this benchmark is that building up an intmap through progressive insertions is the wrong evaluation order (note though that it’s the evaluation order of IntMap.fromList in Haskell, so this isn’t as contrived as it sounds). Ideally what you want to do is always be merging small maps together.

I decided to steal a merge strategy from timsort. The way timsort handles merges is that it maintains a stack of things to merge, x1 through xn, which maintain the following two invariants:

\(|x_i| \geq |x_{i+1}|\)
\(|x_i| \leq |x_{i+1}| + |x_{i+2}|\)

Whenever pushing something onto the stack would violate these rules we perform stack collapsing operations by merging the head (if the first is violated we merge the head into its predecessor. If the second is violated we merge the 2nd and 3rd element together and move the head up one).

This means that the size of the largest element is at least the Fibonacci number of the stack height, which means that you only need quite a small sized stack to have more than you can possibly fit. I use a stack of size 128, which is excessive (this could fit an intmap of size up to \(2^88\) and intmap only supports 64-bit keys), but that’s partly because I want to be able to build up larger intermediate stacks.

I’ve introduced an intmap_builder type which basically maintains this stack so you can build up a whole bunch of unions like this without worrying about the evaluation order. Changing my performance benchmark so that it used this strategy immediately blew the Haskell implementation out of the water by a factor of 2 (I was previously about 60% slower).

But that’s properly cheating. I’m now winning the benchmark because I’m not executing the same program! That’s not remotely reasonable.

But... what if I could make it so that when I do execute the same program, behind the scenes it executes the faster version with intmap_builder.

How could I perform this black magic?

It turns out that I have an advantage that the Haskell version doesn’t: I can tell when I don’t care about intermediate results.

There are two features of the intmap API which are quite nice: Firstly, everything is reference counted. Secondly, all union operations implicitly decref their result (the way you’re supposed to treat this is that every intmap has a uniqueness type which is absorbed by arithmetic operations and that if you want to reuse a value you must pass it to copy. The abstraction is a little leakier than this).

Why is this relevant? Because it means that if an intmap with a refcount of 1 is passed to a union operation we know that we do not care about this intmap for any other reason than the result of this union. In particular we are free to mutate it arbitrarily because any other references to it are now invalid. Woo!

And this means that if we can disguise an intmap_builder as an intmap then when we know that it’s an owned argument of a union we can just push the right hand side onto it and the evaluation order would be sorted out. That’s pretty cool.

We then need to make sure that as soon as you incref it or need to do anything to it that observes its actual value it turns into a real intmap and not a stack of intmaps, but that’s not too bad. There’s a bunch of book-keeping to do around copying and redirects (we can’t always replace an intmap with the result of its union, because we might not own the result of its union). If you care about the details you can read the commit.

It works, too. It’s not quite as fast as just using the intmap_builder directly, mainly because there’s a lot of additional checking that has to be performed (I expect to be able to optimise this somewhat, though it will always be slower), but it’s still significantly faster than the Haskell version.

But details aside, this is I think an instance of a very powerful general technique: When you have a uniqueness type system and a bunch of algebraic rules, you can tell when you don’t care about the result of a sub-expression and then you can use your algebraic rules to massively rearrange the expression for a more efficient evaluation order. This is essentially a query optimiser: We have an algebraic expression which we want to evaluate in the most efficient way possible and we apply a bunch of rules to achieve this effect. The neat thing is that because of the way the API is structured we can apply this query optimiser without the user having to know about it: Every intmap value can be either a (somewhat reduced) abstract set of operations for combining intmaps, or it can be a concrete intmap with all the benefits thereof. It remains the former until you need it to be the latter, at which point the query is fully and transparently evaluated.

I’m not sure this idea works without a great deal of low-level access to the your implementation of laziness. If every operation is destructive and the copy operation returns a pair rather than leaving the original untouched then you can make it work, but the fact that you can force within a non-destructive function like size or equal is quite powerful and makes the API much nicer to use. I’d very much like to see an implementation of this sort of idea in a more functional language, as I’m currently not really sure how it could work.

In this case the algebraic rule I used was the associativity of union (note that order is preserved: intmap unions are not commutative because they have values and the union is right biased), but there are other ones. In particular I plan to implement similar strategies for intersection and difference: Intersection is associative, but additionally in a collection of n things being intersected, all but the left hand most one can be commuted (because we don’t care about their values, but we do care about the values of the left hand most one). This in particular lets us do intersection by adopting a smallest first evaluation order - we always intersect the left hand item with the smallest of the remaining items. This is both intrinsically more efficient and potentially lets us shortcut much earlier because the smallest one may reduce the left hand item to empty at which point we just release the remaining values, return empty, and delight in our O(lots) speed-up.

I haven’t thought through the consequences of this in detail yet, but I’m pretty excited about this idea. It’s essentially a general purpose query optimiser for a certain class of purely functional datastructures, and I think it might have the potential to be a seriously good one.

Comments

Ben Millwood on 2015-04-12 12:13:28:

It looks to me like Condition 2 implies Condition 1?