David R. MacIver's Blog: New collections in Scala 2.7.2

New collections in Scala 2.7.2

23 August 2008

Astute observers might have noticed my name on the release note for the latest version of Scala. Alas, not because I’ve been involved in reengineering the compiler from the ground up and making Scala into a dependently typed purely functional language with backwards in time debugging, but because I’ve added a bunch of collections implementations to the standard library. This post is just a brief introduction to them.

Performance disclaimer: All numbers I mention in this post should be assumed to be referring to specific results in specific tests run on one computer. Consequently they’re at most a guideline and shouldn’t be assumed to be hard and fast rules about performance

Immutable collections

There are a number of immutable map implementations included. The principle inspiration for this is that I have a bit of a hatred of the standard immutable HashMap implementation. It’s mutable behind the scenes, which means that it has to synchronize in weird ways and is thus unsafe for passing between multiple threads. Additionally it means that it has a lot of correctness problems, and has performance characteristics more like that of a mutable map (in particular a very low degree of sharing between distinct instances)

All three immutable map implementations I’ve provided are truly immutable. They don’t use mutation behind the scenes and will have a fairly high degree of sharing between distinct instances. They don’t synchronize and should generally be rather more reliable.

Unfortunately they’re also somewhat slower. The numbers for this are all a bit suspect as they’re extremely dependent on number of cores, cache size, etc. so vary from computer to computer, but because of the tree based implementation instead of the array based one they may be about a factor of two slower on get and update (on the other hand, Ismael Juma has reported them being somewhat faster on get in some tests on his machine. Like I said, the numbers are hard to make precise). Curiously, and I’m not entirely sure why in some cases, they appear to be dramatically faster on bulk operations like map, foreach and filter. Additionally in algorithms which can benefit heavily from sharing they may give you a better algorithmic complexity.

When should you use these? Well, I’m probably going to use them by default when I want immutable maps, and not just because I’m biased. :-) They’re never dramatically slower than the standard immutable HashMap and are occasionally dramatically faster. Additionally there are a lot of nice correctness advantages (at least, I hope so! I certainly don’t promise that they’re bug free, but I’ve tested them reasonably heavily).

As a general point of advice: None of the immutable collections performance is that great. If you need to build a collection once and then never modify it and you have fairly stringent performance requirements, you may be better building a mutable one and then calling readOnly on it. It won’t give you an order of magnitude speedup, but it will definitely be faster.

immutable.IntMap & immutable.LongMap

This is a pair of data structures specialised for integer and long keys respectively, implementing Okasaki and Gill’s “Fast Mergeable Integer Maps”. As well as the standard operations for immutable maps they offer a number of additional methods for merging two maps together (most of which I think should probably be added to immutable.Map).

immutable.TreeHashMap

TreeHashMap should be a drop in replacement for the standard immutable.HashMap. Its implementation is as an IntMap of hash codes to lists of key value pairs (well, it’s not literally a List[(K, V)], but it’s the same idea).

Mutable collections

By and large there aren’t correctness problems with Scala’s mutable collections. The main problem with them is that they’re a bit slow. So I added two alternative implementations which should be significantly faster.

ArrayStack

The most natural way to build an iterator over a tree is by using a stack to simulate recursion. So that’s what I did when I built the iterator for IntMap. To my great surprise it was incredibly slow. About 5 times slower than that of HashMap’s. So I rewrote it to use a special purposed implementation of a fixed depth stack and it promptly sped up by an order of magnitude (so it’s now about twice as fast as HashMap’s).
This was a little shocking, so I investigated and discovered that the mutable stack in the standard library was not exactly brilliantly implemented and was based on a linked list. This was a bit sad, so I provided this implementation based on a growable array. It also contains some useful methods for stack shuffing that the existing one doesn’t.

OpenHashMap

When I was benchmarking the immutable map implementations I also tried them against the mutable implementations and discovered that jcl.HashMap was significantly faster than mutable.HashMap. This was a bit sad, so I decided to fix it by providing a faster mutabe map implementation. After some experimenting I settled on this implementation which is based on an open hashing scheme approximately borrowed from Python’s dictionary implementation.

Interfacewise, it’s just another mutable map. However it does have the nice feature of providing guarantees about certain types of mutations you can do during iteration. Specifically, you are allowed to delete keys and modify the values of existing mappings (but not add new mappings) while iterating.

Comments

Daniel on 2008-08-24 08:22:35:

That’s A.W.E.S.O.M.E!

Any intentions concerning Bags and BiMaps? :-D

These would be really useful (but I haven’t the experience to write that)

Cheers,
Landei

david on 2008-08-24 09:38:59:

Daniel:

Thanks. Glad you like them. :-)

At the moment I’m mainly going to be looking at improving the implementations of different existing types of collection. I’m going to put in some similar work for sets and then immutable sequences at some point in the not too distant future. Once the collections redesign is properly underway I’ll probably look into adding new types, but until then I’d like to hold off on it.

That being said, if you want such things, I strongly encourage you to work on them! Study the existing implementations in Java, etc, read some papers, generally have a play around. It’s not really that hard, and if you don’t do it because of lack of experience then you’ll never acquire the experience :-)