# New collections in Scala 2.7.2

Astute observers might have noticed my name on the release note for the latest version of Scala. Alas, not because I’ve been involved in reengineering the compiler from the ground up and making Scala into a dependently typed purely functional language with backwards in time debugging, but because I’ve added a bunch of collections implementations to the standard library. This post is just a brief introduction to them.

Performance disclaimer: All numbers I mention in this post should be assumed to be referring to specific results in specific tests run on one computer. Consequently they’re at most a guideline and shouldn’t be assumed to be hard and fast rules about performance

## Immutable collections

There are a number of immutable map implementations included. The principle inspiration for this is that I have a bit of a hatred of the standard immutable HashMap implementation. It’s mutable behind the scenes, which means that it has to synchronize in weird ways and is thus unsafe for passing between multiple threads. Additionally it means that it has a lot of correctness problems, and has performance characteristics more like that of a mutable map (in particular a very low degree of sharing between distinct instances)

All three immutable map implementations I’ve provided are truly immutable. They don’t use mutation behind the scenes and will have a fairly high degree of sharing between distinct instances. They don’t synchronize and should generally be rather more reliable.

Unfortunately they’re also somewhat slower. The numbers for this are all a bit suspect as they’re extremely dependent on number of cores, cache size, etc. so vary from computer to computer, but because of the tree based implementation instead of the array based one they may be about a factor of two slower on get and update (on the other hand, Ismael Juma has reported them being somewhat faster  on get in some tests on his machine. Like I said, the numbers are hard to make precise). Curiously, and I’m not entirely sure why in some cases, they appear to be dramatically faster on bulk operations like map, foreach and filter. Additionally in algorithms which can benefit heavily from sharing they may give you a better algorithmic complexity.

When should you use these? Well, I’m probably going to use them by default when I want immutable maps, and not just because I’m biased. :-) They’re never dramatically slower than the standard immutable HashMap and are occasionally dramatically faster. Additionally there are a lot of nice correctness advantages (at least, I hope so! I certainly don’t promise that they’re bug free, but I’ve tested them reasonably heavily).

As a general point of advice: None of the immutable collections performance is that great. If you need to build a collection once and then never modify it and you have fairly stringent performance requirements, you may be better building a mutable one and then calling readOnly on it. It won’t give you an order of magnitude speedup, but it will definitely be faster.

### immutable.IntMap & immutable.LongMap

This is a pair of data structures specialised for integer and long keys respectively, implementing Okasaki and Gill’s “Fast Mergeable Integer Maps”. As well as the standard operations for immutable maps they offer a number of additional methods for merging two maps together (most of which I think should probably be added to immutable.Map).

### immutable.TreeHashMap

TreeHashMap should be a drop in replacement for the standard immutable.HashMap. Its implementation is as an IntMap of hash codes to lists of key value pairs (well, it’s not literally a List[(K, V)], but it’s the same idea).

## Mutable collections

By and large there aren’t correctness problems with Scala’s mutable collections. The main problem with them is that they’re a bit slow. So I added two alternative implementations which should be significantly faster.

### ArrayStack

The most natural way to build an iterator over a tree is by using a stack to simulate recursion. So that’s what I did when I built the iterator for IntMap. To my great surprise it was incredibly slow. About 5 times slower than that of HashMap’s. So I rewrote it to use a special purposed implementation of a fixed depth stack and it promptly sped up by an order of magnitude (so it’s now about twice as fast as HashMap’s).
This was a little shocking, so I investigated and discovered that the mutable stack in the standard library was not exactly brilliantly implemented and was based on a linked list. This was a bit sad, so I provided this implementation based on a growable array. It also contains some useful methods for stack shuffing that the existing one doesn’t.

### OpenHashMap

When I was benchmarking the immutable map implementations I also tried them against the mutable implementations and discovered that jcl.HashMap was significantly faster than mutable.HashMap. This was a bit sad, so I decided to fix it by providing a faster mutabe map implementation. After some experimenting I settled on this implementation which is based on an open hashing scheme approximately borrowed from Python’s dictionary implementation.

Interfacewise, it’s just another mutable map. However it does have the nice feature of providing guarantees about certain types of mutations you can do during iteration. Specifically, you are allowed to delete keys and modify the values of existing mappings (but not add new mappings) while iterating.

This entry was posted in programming and tagged , on by .

# Java collections and concurrency

This is a general tip about Java collections and concurrency. I’m not the best person to write about this, so I’m going to keep this post limited to a simple note, but it’s an important point which far too many people get wrong.

There are various methods in Collections such as synchronizedList, synchronizedMap, etc. These are for wrapping non threadsafe collections in a way that synchronizes important operations.

Don’t use them. Ever.

In a similar theme, never write code that looks like the following:

synchronized(myMap){
doStuffTo(myMap);
}


Concurrency is not an afterthought. If you’re going to be doing concurrent programming you should be using datastructures designed for concurrent use. java.util.concurrent has a number of good ones. Further, you should avoid explicitly synchronizing if at all possible and have your structures be internally threadsafe. If you try to ensure thread safety by synchronizing on the structures you’re mutating you will

a) Make a mistake. Almost certainly. This will introduce bizarre bugs which you will have a serious headache tracking down.
b) Have worse concurrent performance than using a properly designed datastructure – e.g. a ConcurrentHashMap has finer grained locking, so it actually is possible for multiple threads to write to it in a safe manner.
c) Have really ugly code with synchronization logic spread all over the place. This is not a minor point – if your threading code is simple, it’s much easier to determine if it’s correct (although still not easy).

This entry was posted in programming and tagged , on by .