A glorified to do list

I’m pretty bad about remembering stuff I need to do. At work, having an issue tracker is a really useful way of keeping track of that, so I’ve decided to emulate that at home. I’ve set up a Trac instance for stuff I need to get done, both site related and random personal projects related.

This is mostly for my own use, but feel free to add stuff to it if there are things you need me to get done. I allow anonymous ticket creation and commenting. I’ll fix that if it gets too spammy.

This entry was posted in Admin on by .

New collections in Scala 2.7.2

Astute observers might have noticed my name on the release note for the latest version of Scala. Alas, not because I’ve been involved in reengineering the compiler from the ground up and making Scala into a dependently typed purely functional language with backwards in time debugging, but because I’ve added a bunch of collections implementations to the standard library. This post is just a brief introduction to them.

Performance disclaimer: All numbers I mention in this post should be assumed to be referring to specific results in specific tests run on one computer. Consequently they’re at most a guideline and shouldn’t be assumed to be hard and fast rules about performance

Immutable collections

There are a number of immutable map implementations included. The principle inspiration for this is that I have a bit of a hatred of the standard immutable HashMap implementation. It’s mutable behind the scenes, which means that it has to synchronize in weird ways and is thus unsafe for passing between multiple threads. Additionally it means that it has a lot of correctness problems, and has performance characteristics more like that of a mutable map (in particular a very low degree of sharing between distinct instances)

All three immutable map implementations I’ve provided are truly immutable. They don’t use mutation behind the scenes and will have a fairly high degree of sharing between distinct instances. They don’t synchronize and should generally be rather more reliable.

Unfortunately they’re also somewhat slower. The numbers for this are all a bit suspect as they’re extremely dependent on number of cores, cache size, etc. so vary from computer to computer, but because of the tree based implementation instead of the array based one they may be about a factor of two slower on get and update (on the other hand, Ismael Juma has reported them being somewhat faster  on get in some tests on his machine. Like I said, the numbers are hard to make precise). Curiously, and I’m not entirely sure why in some cases, they appear to be dramatically faster on bulk operations like map, foreach and filter. Additionally in algorithms which can benefit heavily from sharing they may give you a better algorithmic complexity.

When should you use these? Well, I’m probably going to use them by default when I want immutable maps, and not just because I’m biased. :-) They’re never dramatically slower than the standard immutable HashMap and are occasionally dramatically faster. Additionally there are a lot of nice correctness advantages (at least, I hope so! I certainly don’t promise that they’re bug free, but I’ve tested them reasonably heavily).

As a general point of advice: None of the immutable collections performance is that great. If you need to build a collection once and then never modify it and you have fairly stringent performance requirements, you may be better building a mutable one and then calling readOnly on it. It won’t give you an order of magnitude speedup, but it will definitely be faster.

immutable.IntMap & immutable.LongMap

This is a pair of data structures specialised for integer and long keys respectively, implementing Okasaki and Gill’s “Fast Mergeable Integer Maps”. As well as the standard operations for immutable maps they offer a number of additional methods for merging two maps together (most of which I think should probably be added to immutable.Map).

immutable.TreeHashMap

TreeHashMap should be a drop in replacement for the standard immutable.HashMap. Its implementation is as an IntMap of hash codes to lists of key value pairs (well, it’s not literally a List[(K, V)], but it’s the same idea).

Mutable collections

By and large there aren’t correctness problems with Scala’s mutable collections. The main problem with them is that they’re a bit slow. So I added two alternative implementations which should be significantly faster.

ArrayStack

The most natural way to build an iterator over a tree is by using a stack to simulate recursion. So that’s what I did when I built the iterator for IntMap. To my great surprise it was incredibly slow. About 5 times slower than that of HashMap’s. So I rewrote it to use a special purposed implementation of a fixed depth stack and it promptly sped up by an order of magnitude (so it’s now about twice as fast as HashMap’s).
This was a little shocking, so I investigated and discovered that the mutable stack in the standard library was not exactly brilliantly implemented and was based on a linked list. This was a bit sad, so I provided this implementation based on a growable array. It also contains some useful methods for stack shuffing that the existing one doesn’t.

OpenHashMap

When I was benchmarking the immutable map implementations I also tried them against the mutable implementations and discovered that jcl.HashMap was significantly faster than mutable.HashMap. This was a bit sad, so I decided to fix it by providing a faster mutabe map implementation. After some experimenting I settled on this implementation which is based on an open hashing scheme approximately borrowed from Python’s dictionary implementation.

Interfacewise, it’s just another mutable map. However it does have the nice feature of providing guarantees about certain types of mutations you can do during iteration. Specifically, you are allowed to delete keys and modify the values of existing mappings (but not add new mappings) while iterating.

This entry was posted in programming and tagged , on by .

A follow up to yesterday’s article

There seems to have been a lot of misunderstanding about what yesterday’s article was about. Mostly from people who didn’t read past the topic. So, here’s a summary version more in line with your attention spans.

FUNCTIONAL PROGRAMMING: UR DOIN’ IT RONG

This entry was posted in programming and tagged , on by .

Functional code != Good code

There’s a dangerous trap to fall into: The belief that functional code is automatically good code. Hopefully not too many people would come out and actually claim this, but it seems to be an unstated common belief. It’s also utter bollocks.

Functional programming gives you tools for writing good code. Good functional code can be very good (good imperative code can be very good too! But that’s not the point of this post). Bad functional code can be just as bad as its imperative cousin.

Now, what post about functional programming would be complete without some trivial one liners? Let’s start by summing a collection of integer elements.

Imperatively:

   var x = 0;
   for (j <- myCollection) x += j;

Functionally:

  val x = myCollection.foldLeft(0)(_+_);

The functional code is a little shorter. You’ve avoided making the sum variable mutable, which is nice. It introduces one fewer local variable which helps with reading the code (but it wouldn’t if Scala didn’t have the cut notation for anonymous functions, you’d have had to write (j, i) => j + i, which introduces *more* variables). Is the fact that you’ve used a higher order function an improvement? No, not really. The loop is a higher order function too. We could have written it:

   var x = 0;
   myCollection.foreach(x += _);

and it would have been exactly the same thing (literally. The Scala compiler emits something nearly identical to the second version when given the first version). So the only difference here is the lack of mutability. Both examples are trivial, so there isn’t a whole load of difference. But most code is just a sequence of trivialities, so considering how they’re written is valuable.

One advantage of the fold here is that you can use it as an expression without binding it to a variable. The loop doesn’t let you do that. But is that really a good thing for you to be doing?

Thing is, both of these examples are the wrong way to do it in the middle of other code. The right way to do it?

  val x = sum(myCollection);

(Unfortunately there is no sum method in the Scala standard library. A minor wart. But there’s one in e.g. scalax, or your personal utility classes, or wherever you care to look really).

Inline calculations like this should be factored out into simple methods. This is a no brainer, everyone knows it’s a good idea. And the fact that you’ve written your code in a single expression with a fold doesn’t exempt you from that.

Let’s make our trivial example very slightly less trivial. Instead of a collection of integers, I have a collection of collections. I want to find the total length.

Imperative version:

  var x = 0;
  for (c <- myCollection) x += c.size;

Now we’ll see how the functional code really shines in this example:

Functional version:

  val x = myCollection.foldLeft(0)((y, c) => y + c.size)

See how much better abstracted that is than the imperative version?

Um, no. Me neither.

When you’re doing folding like this if you don’t have an already defined function to be reusing (which doesn’t have to be a named constant, it could be the result of some other computation. It probably shouldn’t be a lambda), this really isn’t any better than imperative version. At least to my eyes I find it a bit worse, but that’s largely a matter of opinion. It’s sure as hell not better.

So how *should* you be writing it? Am I now going to say you should define a method sumLengthsOfCollections?

Fuck no.

You should solve it, as with so many other things, by reducing it to a problem you’ve already solved.

   val x = sum(myCollection.map(_.size));

Here you actually *have* reused logic properly, and the code is at least somewhat better. And if you wanted to later change the implementations of sum (Think this is a ludicrous comment? Really? What if this were acting on BigInts rather than integers. Still think it is?).

In these trivial examples it doesn’t make much of a difference, but as the number of variations and steps in the pipeline increases this style of code really starts to make a difference. Earlier today I had to deal with a method definition that involved a really large number of folds, most of them doing trivial things. I was not amused. The fact that you can fold, doesn’t mean that you should, and the fact that you’ve been given higher order functions isn’t a license to use the wrong ones and assume your code will be good anyway.

Reusing Java’s tools: Cobertura and Scala

I’ve recently been writing some collection implementations for Scala which will hopefully go in the standard library in the upcoming release. Naturally I want to make sure that I tested the hell out of them, so I’ve been writing a reasonably large number of ScalaCheck tests to test the code and make sure it does what I want it to. Earlier I realised I should probably be doing some sort of coverage checking to make sure I had tested things adequately so thought I’d see how easy it would be to plug Cobertura into a Scala project. And I have good news! The answer is: Essentially trivial.

There were a few minor issues that cropped up while doing it:

  • Cobertura strips scala specific metadata out of the class files. This means that you must compile everything first, including your unit tests, before instrumenting code.
  • Scala does a lot of name mangling and generates a lot of classes with weird names. Cobertura naturally doesn’t understand the mapping back to Scala code, so you’ll see the mangled names. They’re not usually that hard to figure out.
  • Cobertura won’t be able to find your source files if you don’t follow Java directory conventions for packages.

Other than that, it all works fine. You get per line coverage information for your source files, decent per package and per class reports, etc. No problem.

Unfortunately my coverage wasn’t nearly as good as I hoped, so I’ll probably have to write a bunch more tests before I submit a patch.

Update: Actually, the line number reporting for Cobertura + Scala appears to be really bad, and often really hard to decipher. But it’s still a nice tool, and actually helped me catch some bugs.

This entry was posted in programming and tagged , , on by .