Category Archives: Data geeking

The Right Data

As promised, I’m now making data dumps of The Right Tool available. You can download them here. The data there will be updated daily.

This contains all the statements and languages (including ones which I’ve since removed from the site), and all rankings of them with an identifier to distinguish users. It doesn’t contain any of the calculated data (’cause there’s a lot of it. I may open source some of the calculation code later, but in the meantime ask me if you want to reproduce the results).

I’m releasing all of this under a “let me know if you do anything cool with it” license. :-) Basically, it’s yours to play with as you see fit, but if anything interesting emerges out of it I really would like to hear about it!

This entry was posted in Data geeking, programming on by .

The right tool for the job

James Iry recently wrote a post about type systems in which he ranked type systems according to two characteristics, safety and sophistication. We had an interesting discussion about it on IRC afterwards, in which one of the things discussed was whether these were really the right axes.

It occurred to me that we spend a lot of time talking about the major axes on which you compare programming languages, and that it would be interesting to gather some data on what these actually were instead of just plucking things that seem like they may sense out of the air. So I’ve created a little app to gather data for this: Check out The Right Tool, and I’d love it if you’d rank some languages on it.

Essentially the idea is this: I have a list of statements and a list of programming languages, and I want you to rank those languages in order of how well the statements apply to them.

At the moment I’m just data gathering: I don’t actually do anything with the data. The end goal is that I’m going to have a very high dimensional representation of each programming language in terms of what people think of them and why they use them which I will do some sort of dimension reduction to, hopefully giving rise to some interesting metrics with which to compare languages.

A couple notes:

  • This is not about which language is best. It’s about comparing the strengths and weaknesses of different languages
  • I’m not making the data available yet, but I promise I will later once I’ve gathered enough
  • The site looks like ass. I know. I’m not a web designer. Deal with it or send me a better stylesheet
  • If I’ve missed any languages out, let me know. Drop me a comment or something. Similarly if you have ideas for good statements you haven’t seen on the site
This entry was posted in Data geeking, programming on by .

Many eyes make heavy work

We were talking in the office the other day about a fun little project for twitter. Basically just looking at what pairs of hash tags get used together. After getting one and a half hours sleep last night, waking up and being unable to get back to sleep I had some time to kill on my hands, so thought I’d throw it together.

Getting and munging the data into a form that gave tweet similarity was easy enough. But what to do with it then? The obvious thing to do is to visualise the resulting graph.

We have our own visualisation software at trampoline (which I did try on this data. It does fine), but I wanted something smaller and more stand alone. I’d heard people saying good things about IBM’s many eyes (in retrospect I may have to challenge them to duels for the sheer affrontery of this claim), so I thought I’d give it a go.

Let me start by saying that there is one single feature that would change my all encompassing burning loathing for Many Eyes into a mild dislike. It alleges to have the ability to let you edit data you have uploaded. Except that the button is disabled with the helpful tooltip “Editing of data is currently disabled”.

This renders the entire fucking site useless, because it takes what should be a trivial operation (editing the data you’ve uploaded to see how it changes the visualisation) into a massive ordeal. You need to create an entirely new dataset, label it, add it, recreate the visualisation…

Fortunately recreating the visualisation isn’t that hard. After all, Many Eyes doesn’t actually give you any functionality with which to customise your visualisation (maybe it does for some of the others. It sure doesn’t for the network viewer).

So why did I need to tinker with the data? Isn’t it just a question of upload a data set, feed it into IBM’s magic happy wonderground of visualisation and go “Oooooh” at the pretty pictures?

Well, it sortof is.

Actually what it is is upload the data set, feed it into IBM’s magic happy wonderground of visualisation and go “Aaaaaaaargh” as my browser grinds to a halt and then crashes.

It’s understandable really. I did just try to feed their crapplet an entire one point six MEGA bytes of data (omgz).

Wait, no. That’s not understandable at all. In fact it’s completely shit. That corresponds to about 12K nodes and about 60K edges. This is *not* a particularly large number (metascope happily lays it out in a few tens of seconds). This is a goddamn data visualisation tool. The whole point of it is that you’re supposed to be able to feed it large amounts of data. At the very least it should tell you “Sorry, I was written by monkeys so probably can’t handle the not particularly large amount of data you have just fed me”.

So, I spent some time trying to prune the data down to a size many eyes could manage to not fail dismally at but where the graph was still large enough to be interesting. This was a hilarious process. Consider:

  • The only way to edit the data is to create an entirely new data set and recreate the visualisation.
  • The only way to determine that I’ve got too much is to observe that my browser crashes.

After about half a dozen iterations of this I decided enough was enough and declared many eyes to be an unusable piece of shite that was not worth my time. Life’s too short for bad software.

This entry was posted in Data geeking, programming and tagged , , on by .