# java.lang.String#hashCode

Apologies in advance. This is a really boring post.

For reasons you either know about by now or don’t care about, I was curious as to how well String’s hashCode was distributed (I suspected the answer was “not very”). I ran a few quick experiments to verify this.

For your amusement, here is a list of all hash collisions between alphanumeric strings of size 2: http://www.drmaciver.com/collisions.txt and here is a list of all which don’t collide with any others http://www.drmaciver.com/noncolliding.txt

Some statistics: There are 3844 alphanumeric strings of size 2. Of these 3570 collide with at least one other string. That is, 274 of these strings (or about 7% of them) *don’t* collide with something else.

Oh well. It’s a good thing no one would be stupid enough to rely on hashCode to distinguish the contents of two objects.

For what it’s worth, even fewer strings have unique hash codes for 3 characters. 3948 don’t collide, or about 1.6% of them.

This of course doesn’t mean that probability of a hash collision is really high. In reality it’s acceptably low. It’s just a demonstration that it’s not hard to find colliding pairs.

The following consist of all the String hash collisions contained in the sowpods scrabble dictionary:

• isohel, epistolaries
• righto, buzzards
• hierarch, crinolines
• inwork, hypercatalexes
• wainages, presentencing
• variants, gelato
• misused, horsemints
• trichothecenes, locular
• pomatoes, eructation
This entry was posted in programming and tagged , on by .

# Turn your toString methods inside out

All examples in this post will be written in a pseudo-dialect of Scala. Hopefully they should be easy to translate into your favourite programming language (or Java). I also haven’t bothered to compile any of them as they’re mostly not entirely valid. Feel free to point out errors.

Consider the following code:

class List[T]{
// list implementation

override def toString : String = {
val it = this.elements;
var result = "[";

while(it hasNext){
result = result + (it next);
if (it hasNext) result = result + ", ";
}
result + "]"
}
}


What’s wrong with it?

Well, as you presumably know, concatenating two strings of length m and n is an O(m + n) operation (In Haskell or ML it would be an O(m) operation, so this can be made more efficient, but the basic point will still remain). This means we’ve accidentally made an O(n^2) toString algorithm. Oops.

class List[T]{
import java.lang.StringBuilder;
// list implementation

override def toString : String = {
val it = this.elements;
var result = new StringBuilder();

while(it hasNext){
result.append(it next);
if (it hasNext) result.append(", ");
}
result.append("]").toString;
}
}


Great! We’ve removed all those expensive string concatenations.

Now, what happens if we call toString on a List[List[String]]? Umm…

Now, consider the following code snippet:

  println(myReallyLongList);


Let’s unpack what’s going on in it.

  val it = myReallyLongList.elements;
var result = new StringBuilder();

while(it hasNext){
result.append(it next);
if (it hasNext) result.append(", ");
}
println(result.append("]").toString);


So, we’ve created a big intermediate string via a StringBuilder, then printed it, discarding the string after that. Right?

Wouldn’t it be great if we’d written the following code instead?

  val it = myReallyLongList.elements;

while(it hasNext){
print(it next);
if (it hasNext) print(", ");
}
println("]");


No intermediate structures created at all. And note that the code used to print is almost exactly the same as the code used to append to the StringBuilder.

Conveniently there’s a useful little interface in java.lang which people tend to ignore. If not, we’d have had to write wrappers. In particular this is a superclass of Writer, PrintStream, StringBuilder and StringBuffer. So, let’s rewrite the above code:

class List[T]{
import java.lang.StringBuilder;
// list implementation

def appendTo(ap : Appendable){
val it = this.elements;

while(it hasNext){
ap.append(... // err. What do we do here?


We could just do ap.append(it next toString). But that doesn’t solve the first problem – when we nest these things we’re creating a lot of intermediate strings and then immediately throwing them away, not to mention having once again introduced a hidden O(n^2) factor. Sadness. :(

Let’s do the following:

  trait Append{
def appendTo(ap : Appendable) : Appendable;

override def toString = appendTo(new java.lang.StringBuilder()) toString;
}

object Appending{
def append(any : AnyRef, ap : Appendable){
if (any.isInstanceOf[Append]) any.asInstanceOf[Append].appendTo(ap);
else ap.append(any toString)
}
}


Now we can write it as:

class List[T] extends Append{
import Appending._;
import java.lang.StringBuilder;
// list implementation

def appendTo(ap : Appendable) = {
val it = this.elements;
while(it hasNext){
append(it next, ap);
if (it hasNext) ap append(", ");
}
ap.append("]");
}
}


Now, no matter how deeply we nest things, we’ll get things printed in a manner with completely consistent performance – no hidden gotchas.

There are also other benefits to structuring things this way. If you make everything work based on an API that looks like this you’ll tend to write things which work by injecting filters in reading and writing code. And, hey, suddenly all your code works completely transparently when you discover that you need to work with things that are e.g. read off the network, backed by something on the file system, etc. and really need a streaming version of the library.

Also note that I’m not saying “Strings are bad”. There are a lot of cases where what you need really is a persistently available string. Then, by all means, use toString! But even then this is helpful, as your toString code will work a lot better and more consistently than it might otherwise have done.

This entry was posted in programming and tagged , , , , on by .