Laziness is better when it’s visible

This is a trick I invented a while ago. I’m morally certain it’s a reinvention rather than an invention, but I’ve not really seen it in use and at the very least it doesn’t seem to be widely known. I recently ran into a situation where a library would benefit from it greatly and doesn’t use it, so I thought I would write it up.

Suppose we have a bunch of strings and we want to concatenate them.

def cat(my_strings):
    result = ''
    for s in my_strings:
       result += s
    return result

That’s how you concatenate strings, right?

Oh no, accidentally quadratic!

You can fix this using a special string buffer type, or in python by just using ”.join(my_strings), but wouldn’t it be nice if you didn’t have to? It’s often very convenient to build things up using expressions. Although it’s no great hardship for strings to do bulk operations, you run into the same problem in e.g. pulp where you have more complex expressions (and no corresponding sum method in the library). It would be great if this all just worked.

One way to do this sort of thing is to switch to an immutable tree based representation like a rope where the concatenation operation has a more reasonable complexity (usually log(n)).

But that then comes with its own costs. Using a tree structure slows down access and iteration – only by an O(log(n)) factor, but with relatively high constants. As a result you end up paying a relatively high performance penalty for what was mostly just wanted as a convenience (ropes do have other advantages, but that’s not what we’re looking at here).

But what if you didn’t have to do any of that? What if you could get O(1) concatenation and all the benefits of the compact representation? It sounds implausible, but I’m here to tell you you can!

Or, rather, that you almost can. Both of the above are true: You do get O(1) concatenation and you do get all the benefits of the compact representation, but you do end up paying some additional cost, because the idea is to use laziness. So although concatenation is O(1) you end up paying an additional O(n) cost the first time you want to use the result. Fortunately this still avoids the problem of sums being quadratic.

The key idea is to use a data structure that can swap out its implementation on demand. It starts by just storing the abstract expression tree that lead to it, and then switches to a more efficient representation as soon as you need it to.

e.g. the version of this for the above is that a string becomes a binary tree where the leaves are buffers and the branches indicate that the string is a concatenation of its left and right parts. Concatenation is then just creating a new split node, which is O(1).

Then, once we want the compact representation (which will tend to be as soon as we start doing interesting operations on the data – because the expression tree is entirely unnormalized there is very little we can usefully do to it that isn’t an O(n) operation!), we calculate that, store the result on the string and throw away the expression data that brought us here.

That is, as soon as we have forced the string, the string’s switches to a new representation using the forced buffer, essentially replacing the split node with a leaf node.

This feels like we’re back where we started – if you’re doing this lazily like that then you’re just summing together two string children so you’re quadratic again – but we’re not, for one very important reason: Because the implementation of the laziness is under our control, we can tell whether a string has already been forced or not. When forcing a node we then don’t force its child nodes, but instead just walk the tree and behave appropriately when we get to the leaves.

This sort of thing can be very useful, because the common cases where this runs into issues is that you have a complex expression graph and only actually care about a very small fraction of the subexpressions (e.g. in the sum case).

This isn’t always a win, in that it does behave suboptimally under some workloads (e.g. when you do care about a lot of the intermediate results but process them in the reverse of the order you created them), but it’s rarely a substantial loss and usually produces dramatic speedups by converting accidentally quadratic cases into the desired linear behaviour.

There are additional tricks you can build on top of this:

  • You can precompute some data so you don’t always have to force the structure. e.g. you can always calculate the length of the string in the above example without forcing it and still have the operations be O(1)
  • you can sometimes have operations that only require partially forcing the data structure (e.g. if you index into a string you might only have to force one half of it (or neither if the index is out of bounds!)
  • If you have more complex operations then you can do a sort of “query optimization” to rewrite the expression tree into a more efficient execution plan. For example, a thing I’ve done in the past is when the operation is intersection you can rewrite it so that intersections are processed in order of increasing size, which often ends up with you being able to terminate early because you’ve discovered that the end result is going to be empty regardless of what happens next.

Depending on circumstances, any of the above can be worth doing, but most of the big wins come from the basic trick which is almost always a valuable one if you’re running into this sort of problem.

Using this technique won’t always be an improvement – e.g. you’ll end up doing some duplicated work if you do something like x + x because it will effectively recompute forcing x twice – but most of the work loads on which it will behave particularly badly are ones that you should probably have avoided anyway with the normal approach. The only real downsides where you do suffer a hit from using this is that the laziness adds an additional check to each operation, which can be anywhere between a small and modest performance hit depending on how expensive the operation normally is. Additionally, if you want the operations to be thread safe then you’ll need a memory barrier of some sort (e.g. making the relevant field volatile) to get the visibility right, which adds another small hit.

So it’s not a universal win, but the cost is light enough and there are enough work loads where it improves behaviour substantially that it is often worth considering.

To finish off, and make this more concrete, here’s some Python code implementing this idea for strings:

(Like this post? Want to see more like it? Why not support my Patreon! You’ll get to see drafts of upcoming posts and also increase the amount I write)

This entry was posted in programming, Python on by .

5 thoughts on “Laziness is better when it’s visible

  1. Tony Garnock-Jones

    Erlang’s IO Lists (no good URL, weirdly; perhaps are similar to this idea, but with the implicit-forcing baked into consumers of the structure. Also perhaps more distantly related is ye olde lisp `flatten`. Another interesting connection might be Smalltalk’s `become:` and `becomeForward:`, which would allow an append node to replace all references to itself with the flattened version at any time.

    1. david Post author

      Visible in the sense that you can tell the difference between whether an instance of the type has already been evaluated or not (at least internally, it doesn’t have to be part of the public API).

Comments are closed.