java.lang.String#hashCode

July 16th, 2008

Apologies in advance. This is a really boring post.

For reasons you either know about by now or don’t care about, I was curious as to how well String’s hashCode was distributed (I suspected the answer was “not very”). I ran a few quick experiments to verify this.

For your amusement, here is a list of all hash collisions between alphanumeric strings of size 2: http://www.drmaciver.com/collisions.txt and here is a list of all which don’t collide with any others http://www.drmaciver.com/noncolliding.txt

Some statistics: There are 3844 alphanumeric strings of size 2. Of these 3570 collide with at least one other string. That is, 274 of these strings (or about 7% of them) *don’t* collide with something else.

Oh well. It’s a good thing no one would be stupid enough to rely on hashCode to distinguish the contents of two objects.

Edit: Additional facts I originally posted in the comments.

For what it’s worth, even fewer strings have unique hash codes for 3 characters. 3948 don’t collide, or about 1.6% of them.

This of course doesn’t mean that probability of a hash collision is really high. In reality it’s acceptably low. It’s just a demonstration that it’s not hard to find colliding pairs.

The following consist of all the String hash collisions contained in the sowpods scrabble dictionary:

  • isohel, epistolaries
  • righto, buzzards
  • hierarch, crinolines
  • inwork, hypercatalexes
  • wainages, presentencing
  • variants, gelato
  • misused, horsemints
  • trichothecenes, locular
  • pomatoes, eructation

13 Responses to “java.lang.String#hashCode”

  1. Justin lee Says:

    haha. I *wonder* where all *that* came from! P^)=

    –cheeser

  2. Knah Says:

    That’ll be 3844, not 3884.

    Mind you, 3570+274=3844….

  3. david Says:

    Oops. Thanks for catching the typo. Fixed now.

  4. chessguy Says:

    Sounds like you found what you were looking for. Which makes me a little suspicious… Now I’m not usually one to defend Java, by any means, but this experiment seems a little…strange. If you’ve gone this far, why not use the same infrastructure to do a quick calculation and see about 3-character strings? or 4? or….whatever. What about toString()’ing some non-string objects? Testing only 2-character strings can’t be a very good measure.

  5. david Says:

    Well part of the reason I didn’t do it for much larger numbers is that it takes a longish time to run for them as the run time for calculating this is exponential in the number of characters. :-)

    I did try it out for three character strings, but the results were uninterestingly similar so I didn’t bother posting them. If you really care I’ll run the results for larger numbers when I get home (I didn’t put the code anywhere central and can’t be bothered to recreate it here) and post them.

    For what it’s worth, this isn’t intended as evidence that java Strings are terrible and awful. I just wanted to demonstrate that hash collisions are not remotely a rare thing.

  6. Adam Goode Says:

    See also:
    http://developers.sun.com/learning/javaoneonline/2007/pdf/TS-2707.pdf
    Puzzle #6, page 37

  7. dave Says:

    I once worked at a bank where the former Cobol programmers who made up the Java development staff thought you had to override hashCode() for every class you defined — and almost universally defined it as “return 1;”

  8. Cetin Sert Says:

    I tested it with the same string size using a-zA-Z0-9. 32-bit Vista using .NET and mono.

    running on .NET 3.5:
    3844 two-char strings
    14776336 comparisons
    0 collisions with one or more items
    0 total collisions

    running on mono 1.9.1:
    3844 two-char strings
    14776336 comparisons
    **3570** collisions with one or more items
    5250 total collisions

    http://sert.homelinux.org/projects/mono-collisions.txt

    ** the number is the same ^_^

  9. Cetin Sert Says:

    even the colliding strings seem to be the same ones ^^

  10. david Says:

    For what it’s worth, even fewer strings have unique hash codes for 3 characters. 3948 don’t collide, or about 1.6% of them.

    Also, this of course doesn’t mean that probability of a hash collision is really high. In reality it’s acceptable low. It’s just a demonstration that it’s not hard to find colliding pairs.

  11. david Says:

    Utterly pointless fact of the day. The following consist of all the String hash collisions contained in sowpods:

    isohel, epistolaries
    righto, buzzards
    hierarch, crinolines
    inwork, hypercatalexes
    wainages, presentencing
    variants, gelato
    misused, horsemints
    trichothecenes, locular
    pomatoes, eructation

  12. Cetin Sert Says:

    david: “3948 don’t collide, or about 1.6% of them”

    3 characters mono 1.9.1:

    again the same situation with mono 1.9.1 so I conclude java and mono use the same hashing algorithm:
    http://sert.homelinux.org/projects/mono-collisions3.txt

    Oh by the way may I ask which implementation and version of java we are talking about here?

  13. david Says:

    It doesn’t actually matter which version I’m using. java.lang.String has a hash code implementation defined as part of the standard library specification, and it’s been the same since… not sure. 1.2 or 1.3 I think.

Leave a Reply