Apologies in advance. This is a really boring post.
For reasons you either know about by now or don’t care about, I was curious as to how well String’s hashCode was distributed (I suspected the answer was “not very”). I ran a few quick experiments to verify this.
For your amusement, here is a list of all hash collisions between alphanumeric strings of size 2: http://www.drmaciver.com/collisions.txt and here is a list of all which don’t collide with any others http://www.drmaciver.com/noncolliding.txt
Some statistics: There are 3844 alphanumeric strings of size 2. Of these 3570 collide with at least one other string. That is, 274 of these strings (or about 7% of them) *don’t* collide with something else.
Oh well. It’s a good thing no one would be stupid enough to rely on hashCode to distinguish the contents of two objects.
Edit: Additional facts I originally posted in the comments.
For what it’s worth, even fewer strings have unique hash codes for 3 characters. 3948 don’t collide, or about 1.6% of them.
This of course doesn’t mean that probability of a hash collision is really high. In reality it’s acceptably low. It’s just a demonstration that it’s not hard to find colliding pairs.
The following consist of all the String hash collisions contained in the sowpods scrabble dictionary:
- isohel, epistolaries
- righto, buzzards
- hierarch, crinolines
- inwork, hypercatalexes
- wainages, presentencing
- variants, gelato
- misused, horsemints
- trichothecenes, locular
- pomatoes, eructation
haha. I *wonder* where all *that* came from! P^)=
–cheeser
That’ll be 3844, not 3884.
Mind you, 3570+274=3844….
Oops. Thanks for catching the typo. Fixed now.
Sounds like you found what you were looking for. Which makes me a little suspicious… Now I’m not usually one to defend Java, by any means, but this experiment seems a little…strange. If you’ve gone this far, why not use the same infrastructure to do a quick calculation and see about 3-character strings? or 4? or….whatever. What about toString()’ing some non-string objects? Testing only 2-character strings can’t be a very good measure.
Well part of the reason I didn’t do it for much larger numbers is that it takes a longish time to run for them as the run time for calculating this is exponential in the number of characters. :-)
I did try it out for three character strings, but the results were uninterestingly similar so I didn’t bother posting them. If you really care I’ll run the results for larger numbers when I get home (I didn’t put the code anywhere central and can’t be bothered to recreate it here) and post them.
For what it’s worth, this isn’t intended as evidence that java Strings are terrible and awful. I just wanted to demonstrate that hash collisions are not remotely a rare thing.
See also:
http://developers.sun.com/learning/javaoneonline/2007/pdf/TS-2707.pdf
Puzzle #6, page 37
I once worked at a bank where the former Cobol programmers who made up the Java development staff thought you had to override hashCode() for every class you defined — and almost universally defined it as “return 1;”
I tested it with the same string size using a-zA-Z0-9. 32-bit Vista using .NET and mono.
running on .NET 3.5:
3844 two-char strings
14776336 comparisons
0 collisions with one or more items
0 total collisions
running on mono 1.9.1:
3844 two-char strings
14776336 comparisons
**3570** collisions with one or more items
5250 total collisions
http://sert.homelinux.org/projects/mono-collisions.txt
** the number is the same ^_^
even the colliding strings seem to be the same ones ^^
For what it’s worth, even fewer strings have unique hash codes for 3 characters. 3948 don’t collide, or about 1.6% of them.
Also, this of course doesn’t mean that probability of a hash collision is really high. In reality it’s acceptable low. It’s just a demonstration that it’s not hard to find colliding pairs.
Utterly pointless fact of the day. The following consist of all the String hash collisions contained in sowpods:
isohel, epistolaries
righto, buzzards
hierarch, crinolines
inwork, hypercatalexes
wainages, presentencing
variants, gelato
misused, horsemints
trichothecenes, locular
pomatoes, eructation
david: “3948 don’t collide, or about 1.6% of them”
3 characters mono 1.9.1:
again the same situation with mono 1.9.1 so I conclude java and mono use the same hashing algorithm:
http://sert.homelinux.org/projects/mono-collisions3.txt
Oh by the way may I ask which implementation and version of java we are talking about here?
It doesn’t actually matter which version I’m using. java.lang.String has a hash code implementation defined as part of the standard library specification, and it’s been the same since… not sure. 1.2 or 1.3 I think.
Audio books can be purchased online at a fraction of cost. Sometimes you can find them for free. Audible and ebook libraries and bookstores are both dedicated to their customers by providing the largest selection of audible books that are downloadable online or purchased in local stores. Some of the top providers are eBay, Amazon and several audible clubs. With millions of audio books available, various formats are used for the convenience of everyone ranging from children to adults. Some of those formats included are MP3 players, MP3 compatible mobile phones, iPods, iPhones, PDA’s, CD’s as well as kindles and android devices. Here is a list of places that offer audible books: