David R. MacIver's Blog: Left folds with early termination

Left folds with early termination

18 March 2009

Consider what should be the basis of a general purpose collections API. In Scala it is Iterable, which uses an Iterator to define a foreach method. Most other things are defined in terms of this (note: This statement is untrue. Many of them use the iterator directly. But it is morally true).

This is all icky and imperative. Yuck.

As every good functional programmer knows, the true notion behind this silly “iteration” thing is actually folding.

The JVM is mean to us and doesn’t support lazy evaluation or variable sized stacks, so foldRights tend to be an exercise in pain: They stack overflow distressingly soon. So let’s base our collections API on foldLeft: Most designs of it are tail recursive (or iterative. Sigh). As well as dealing with JVM limitations this lets us develop a very high performance interface, and we can do pleasant things like dealing with very large lazy collections in constant space[1]

So:

trait Foldable[T]{
 def foldLeft[S](s : S)(f : (S, T) => S) : S;
}

There. That’s a nice basic design.

What methods can we add to this?

Well the obvious thing we can do to placate the Java programmers:

def foreach(f : T => Unit) = foldLeft[Unit](())((s, t) => f(t))

Now let’s define some stuff we care about. One handy method on Iterable is find. So let’s add that.

def find(f : T => Boolean) : Option[T] =
   foldLeft[Option[T]](None)((s, t) =>
     s match {
       case Some(s) => Some(s);
       case None if f(t) => Some(t);
       _ => None
  })

Nice, isn’t it?

So what happens if we do Stream.from(1).find(_ > 10).

It loops forever. :-(

Unfortunately our nice efficient foldLeft based implementation always eats the entire collection. This is sad: We can’t deal with infinite collections.It’s also really inefficient for large collections where we don’t need the whole thing.

So let’s add a way of stopping when we’re done:

def foldLeft[S](s : S)(f : (S, T) => Option[S]) : S;

The idea is that the Option is used to signal a stop condition. We return a None when we’re done consuming the collection, and the foldLeft stops.

Rewriting find in terms of this:

def find(f : T => Boolean) : Option[T] =
   foldLeft[Option[T]](None)((s, t) =>
     s match {
       case Some(s) => None;
       case None if f(t) => Some(Some(t));
       _ => Some(None)
  })

Ok. That’s a bit gross. But that’s purely a function of the fact that we were already trying to return an Option[T]. I’m sure for normal code it will look ok.

What happens is that once we’ve already found something, we now know we can stop, so we return a None to indicate stopping.

Hmm. It would be nice if we could incorporate some of this functionality into find itself. Let’s consider the following example:

(1 to 100000000 by 3).find(_ == 11)

We know this collection is ordered, so once we’ve gotten to 13 we know we’re never going to find 11. But find keeps on trucking till it hits the end of the collection. :-(

So, let’s modify find to support early termination as well. First we change its signature to f => Option[Boolean], and then...

No, sorry. My sense of good taste just revolted.

The problem is we now have to modify all signatures that might possibly want to support early termination of the collection. And they then have to explicitly deal with passing that on to foldLeft in the right way. All just to allow callers to pretend that the collection is smaller than it really is. It would be nice if we could write this functionality once and then have all the higher order functions based on foldLeft inherit it for free.

Agreed?

Well, there fortunately *is* a way to do that. It involves some loss of type safety, but as long as foldLeft is always properly implemented this should be acceptable.

case object Stop extends Throwable;
def stop : Nothing = throw Stop;

Implementations of foldLeft then have to catch Stop and treat that point as the end of the collection.[2]

Now, let’s implement find as follows:

def find(f : T => Boolean) : Option[T] =
   foldLeft[Option[T]](None)((s, t) =>
     s match {
       case Some(s) => stop;
       case None if f(t) => Some(t);
       _ => None
  })

Only a simple modification to our original implementation, but now it correctly consumes only a minimal amount of the collection (actually it consumes one element too many, but that’s because I’ve modified the implementation from the real one we’d use for pedagogical reasons).

And, for free, we can use stop in the argument to find as well:

(1 to 100000000 by 3).find{x => val c = x compare 11; if(c > 0) stop else (c == 0)}

And we no longer continue past the point where we need to, even though we don’t find our element. Yay!

So, hopefully I have persuaded you that left folds with an early termination forms a nice basis for a collection library: Clean, functional, and efficient. What more could you hope for?

Oh, before I finish. One more little bone to throw to our Java friends to make them feel more comfortable with this design:

def break = stop;

[1] See “Towards the best collections API” for a discussion of building an API around this design (note that their solution does involve a bit more than what we talk about here, some of which is linguistically difficult in Scala without continuations).

[2] You may be making squawking noises about performance now. Don’t. It’s fine. But now is not the time for discussion of this.

Comments

Daniel Spiewak on 2009-03-18 23:25:02:

In order to support early termination, you’re actually using an exception to implement control flow. In my humble opinion, this is just as bad as implementing everything in terms of `elements` (using Iterator). Without their checked form, exceptions are really just a way of jumping to an earlier point on the stack. Not only is the jump anti-functional, but it’s almost anti-procedural, depending on your persuasion on the respective definitions of these terms. Checked exceptions are really the only truly functional exception forms (since they are just a slight adaptation of the monadic concept), but then we have to deal with all of the ugly type signatures which immediately follow, not to mention the fact that Scala doesn’t support checked exceptions.

The whole reason to go with Foldable rather than Iterable is to avoid imperative implementations. However, the only way to efficiently (and even practically, as with Stream) implement things in terms of Foldable is to sacrifice a lot of the resulting purity by adding arbitrary stack jumps. Choose your poison.

david on 2009-03-18 23:36:21:

It may help understand this post if I provide some context. I’m not actually intending to argue that Foldable is the right API to use here (though I am in favour of something similar). It’s a rebuttal to various people arguing that the use of break is evil and non-functional, demonstrating its use in support of a mostly functional API. Hence a refactor from a purely functional API to a more pleasant version involving an exception.

Personally, I’m fine with letting functional and imperative features interact and support eachother. Others seem not to be.

nuttycom on 2009-03-19 02:32:21:

Why not simply overload foldLeft to take an additional parameter, a function of the accumulator and the value which tells it when to stop?

trait Foldable[T]{
def foldLeft[S](s : S)(f : (S, T) => S)(stop: (S,T) => Boolean) : S
def foldLeft[S](s : S)(f : (S,T) => S) : s = foldLeft[S](s)(f)((xs : S, xt : T) => false)
}

Tony Garnock-Jones on 2009-03-19 10:41:21:

Hi David: you might enjoy this article which deals with a similar idea.

Martin on 2009-03-19 11:37:18:

These are some intriguing ideas, David. As you know, the current 2.8 collection design is based on foreach, with explicit breaks for early termination. To better be able to compare the two designs, I have two questions:

First, have you measured what the performance difference between the two schemes is likely to be?
foldLeft looks slightly more expensive than foreach because a result has to be propagated. On the other hand, the foreach code sometimes requires an extra indirection to store the result in an accumulator variable. So it’s not clear to me what the overall effect on peformance will be.

Second, do you propose to make `stop’ availabe to clients, or propose to use it only internally to collections? (This could be achieved by defining it as a protected in Foldable, for instance). I am a bit nervous to make `stop’ available to users in this way, because the scope of the code that’s exited is not clearly delimited. For instance someone might write:

(xs.foldLeft(z)) { ... find ( ... stop ... ) }

and expect the stop to go out to the foldLeft, not knowing that find is implemented in terms of foldLeft and will consume the stop instead. So we’d have to document very carefully which operations are `stoppable’, and this in turn will limit our implementation freedom for specialized collections. I am not sure yet the added flexibility is worth that.

Cheers

-- Martin

david on 2009-03-19 11:56:18:

nuttycom: And then overload all the other methods which I want to support premature termination too?

david on 2009-03-19 12:05:25:

Martin:

There’s some slight dishonesty in this post. It’s not actually how I propose the end result should look, but designed this way for pedagogical reasons to show that early termination is not in fact evil and imperative. In particular I think an implementation where the users implement a foreach method and foldLeft is implemented based on that can have exactly the same properties, be easier to implement and have better performance characteristics.

My intent is definitely that stop (which I do recommend calling break really) is available to user code. I’ve got quite a few examples where adding premature termination to an existing higher order function is exceedingly useful . We’d need to document the exact semantics of it in those cases of course so as to not have them merely be implementation details, but I think that’s acceptable. In cases where we’re worried about specialised collections we can simply add a comment to the effect “The default behaviour of breaking in this method is as follows, however subclasses are not required to honour this”.

My biggest concern about this is the use of it inside for comprehensions where you’ve got nested foreaches. It’s non-obvious that the stop only stops the inner loop. I don’t know what the best way around this is: I have some ideas on how to support labelled break, but I need to think about them some more.

There are definitely some usability concerns. I think most of them can be dealt with with careful design and documentation though.

Steve on 2009-03-21 17:01:51:

Wow... After reading this long article and the questionable workaround (throwing an exception, really?), the “icky and imperative” solution is looking pretty darn good to me.

david on 2009-03-21 17:09:22:

Steve, meet clue. Clue, steve. I don’t believe the two of you are acquainted.

(Hint: When someone above the age of 6 uses the word “icky” the chances are fairly high that they’re not serious)

david on 2009-03-21 17:28:53:

I additionally wish to note that the common aversion to using exceptions for control flow (particularly in library code) is nonsensical and without grounds.

Steve on 2009-03-21 18:11:49:

Responding to criticism with ad hominem attacks... how classy, David.

And speaking of clue, controlling the flow of execution with exceptions has been frowned upon for as long as exceptions have been popular (let’s say twenty years or so), and some of your commenters have already mentioned this to you. Please read up on the abundant literature on the topic and you will understand why your solution is not showing functional programming in a very good light.

But if you want to educate yourself, make your next blog post about “why exceptions are good for control flow” and I will get some popcorn while I read the comments.

david on 2009-03-21 18:28:48:

[citation fucking well needed]. Make an argument or provide a reference. Don’t airily wave your hands and claim that the argument has already been won. Claiming the existence of “abundant literature” does not make it so.

The paranoia about exceptions for control flow is yet another symptom of people picking up a piece of advice that is good enough in one particular context (Using exceptions for control flow in C++ is not a good idea) and then considering it to be writ in stone for all eternity. Languages that don’t force you to bend over backwards to write exception safe code have no such difficulties. Further, exceptions on the JVM (and many other implementations) are very cheap, so the “don’t use exceptions for control flow because they’re slow” arguments are just as invalid.

Steve on 2009-03-21 18:58:50:

Mmh can’t post comments any more?

Steve again on 2009-03-21 19:00:02:

For some reason, I can’t seem to post the more lengthy response I composed.

Steve again on 2009-03-21 19:00:22:

[Trying again]

Can you please not swear? It really doesn’t help the debate.

Who said that the reason to not use exceptions for control flow is because they’re slow? Exceptions are not slow in C++ and when they are used for control flow, the exception is thrown exactly once, which has close to no performance impact on the overall execution. You really, really need to do some homework on the subject.

Using exceptions for control flow is widely accepted as a code smell because exceptions are designed to be used when something exceptional happens. A loop ending is not an exception condition.

There are a lot of other reasons why you should in general not do that and I’m not going to enumerate them in a blog comment, but since you wanted a few references, here is a good discussion:

http://stackoverflow.com/questions/465953/throwing-exceptions-to-control-flow-code-smell

A few other references:

http://www.sturmnet.org/blog/2005/03/29/exceptions-flow
http://www.subbu.org/blog/2005/05/why-not-use-exceptions-to-control-flow

And probably the best way to do some catch up:

http://tinyurl.com/c2kncy

What makes me angry about your article is that it gives us (functional programmers) a bad name by making everyone think we don’t know much about computer science fundamentals.

David R. MacIver » Blog Archive » Exceptions for control flow considered perfectly acceptable, thanks very much on 2009-03-21 19:19:31:

[...] flow mechanism for collections makes me a bad person. This idiocy touches on a sore point for me. I have been invited to write a post on “Why exceptions for control flow are a good idea”. So. Here it [...]

david on 2009-03-21 19:19:45:

Your post got trapped by the spam filter because of the large number of URLs. I’ve approved it now.

Exceptions are in fact quite slow in C++, and the arguments about exceptions for control flow constantly harp on about performance. It is in fact a very common argument. In particular, note that it is mentioned in two of the three links you’ve provided.

You will further note that you said that it appeared in “literature”. I promptly googled. It doesn’t. It appears in a bunch of blog posts and random internet blithering. Nowhere is a useful argument presented. This includes all three of your links, all of which I had already found and dismissed. In particular searching scholar.google.com produces no valid results.

What makes me angry about your comments is that it gives us (programmers) a bad name by making everyone think that we’re superstitious idiots who repeatedly assert a fear of techniques that work just fine without being able to back it up with a valid argument.

Fetch your popcorn: /2009/03/exceptions-for-control-flow-considered-perfectly-acceptable-thanks-very-much/

Eric on 2009-03-23 01:39:23:

Hi David,

I’m wondering if Nuttycom suggestion is not the right way to go. Why not indeed overload every collection method with a termination function where needed?

I don’t necessarily mind using exceptions for control flow but as an API, it feels a bit weird:

- the “stop” or “break” method has to be imported separately from the collection import. So each time, I’m using a collection, I have an additional import, just for flow control.

- there may be some scope confusion as Martin says. That may not a big deal but IIRC, programming Groovy on and off, I’ve had trouble remembering what would be the effect of a “return” inside a “collect” operation.

So adding a supplementary termination function where this is useful is a bit more work on the library designer side but, as a client, it gives me a more “functional feeling” when using the library, with less to think about.

Eric.

david on 2009-03-23 08:55:49:

Frankly, because it’s an awful idea. It’s less flexible and more work for both the library designer and the end user. You’ll get over squeamishness in the long run, you won’t get over a bad API design.