Archive for May, 2010

Databases for cooccurrences?

Friday, May 21st, 2010

This is a bit of a “Dear lazyweb” post.

I frequently find myself having the following setup, and I’ve never really come up with a solution that I find terribly satisfactory or reusable:

I have a number of events and a number of labels, and a set of tags which associate labels with events that can be added or removed. Each event-label pair is unique – you can’t tag the same thing with the same label multiple times.

I want to support the following queries:

  1. Given a label, how many events has it occurred in?
  2. Given two labels, how often did they co-occur?
  3. Given a label, give me all labels which have co-occurred with it in at least N events, and the number of events they have co-occurred in
  4. Some sort of “dump” operation which will give me all tuples “label1, label2, label1_occurrences, label2_occurrences, cooccurrences” in a file or through some sort of streaming interface

I need to support quite large numbers of tags and events (the tags are usually words, and so end up eventually using an appreciable proportion of the english language. The events are usually documents, but could easily number in the thousands). I’m ok with a time lag of even a few minutes between data coming in and being visible in the coocurrences “table”.

I’ve previously done this in an SQL database, because there was usually one to hand. It works “ok”. It doesn’t perform brilliantly, and you end up with annoying trade offs between performance, simplicity and consistency (that’s always going to be the case to a certain extent, but I felt that SQL was a particularly unsuitable medium here). Well I’ve got to do it again, and this time there’s not inherently a database built in to the app already, so I’m looking further afield. I’m half-heartedly looking into using redis, but I don’t really expect it will actually be that much help.

Any suggestions for alternatives?

The Right Data

Sunday, May 16th, 2010

As promised, I’m now making data dumps of The Right Tool available. You can download them here. The data there will be updated daily.

This contains all the statements and languages (including ones which I’ve since removed from the site), and all rankings of them with an identifier to distinguish users. It doesn’t contain any of the calculated data (’cause there’s a lot of it. I may open source some of the calculation code later, but in the meantime ask me if you want to reproduce the results).

I’m releasing all of this under a “let me know if you do anything cool with it” license. :-) Basically, it’s yours to play with as you see fit, but if anything interesting emerges out of it I really would like to hear about it!

Rank Aggregation in The Right Tool

Tuesday, May 11th, 2010

As you’ve probably noticed, over the weekend I created The Right Tool, for discovering how we describe programming languages. In response to massive amounts of data and feedback I’ve thrown together some pages for seeing the results of which statements describe which languages well.

There’s one set for which statements match a given language and one for how well each language matches a given statement.

The method I’m using is straightforward enough, but quite neat. It’s an adaptation of a method from “Rank Aggregation methods For The Web, by Dwork, Kumar, Naor and Sivakumar” which they refer to as MC4. I’ve modified it to better take into account shades of grey and to be more resilient to rare events.

It belongs to a general class of scoring mechanisms based on Markov Chains. Google’s pagerank is an instance of these. Essentially what you do is define a Markov chain on the things you want to score. You try to arrange it so that transitions to good nodes are more likely than transitions to bad nodes. You then find the stationary distribution for this markov chain (the long-term probability of being at a particular node). The idea is that if transitions to good nodes are more likely than transitions to bad nodes, the nodes which you spend more time in are more likely to be good. These probabilities then form a measure of what you are trying to rank.

So, that’s all very well, but how do we decide what markov chain to define?

In MC4 the markov chain is defined as following: When at node A, pick a node at random. If the majority of people agree that that node is better than this one, transition there. Else stay where you are.

This has a couple of problems.

The first is this: How do we decide what constitutes “the majority of people”? A flat out majority is very sensitive to small numbers of voters: If only one person ranked both of two languages and ranked it the “wrong” way, we’re still going to treat it as better. I tried various things to make this work: One was just a lower bound on the number of votes we counted, one was modelling as a Bernoulli trial and requiring N (for N = 1 or 2. 1 seemed to work slightly better) standard deviations above average. It all felt a bit arbitrary though.

The second problem was shades of grey: It makes absolutely no distinction between language A being a lot better than language B and language A being slightly better than language B. As well as creating weird biases, this results in the method being very bad at getting the tail right: You get a lot of languages which everyone thinks are the pits, and they all get assigned a score of 0, so there’s no way to distinguish between them. It happened that the approximation I was using somewhat mitigated this, but it was still a bit annoying.

So I’ve modified the chain definition as follows: First pick a language B at random. Now consider the probability of language B beating A in a comparison, and jump to B with that probability, else stay at A. This means that when there’s only a slight benefit to B over A the score of B is only slightly higher than that of A, because we’ve still got a decent chance of transitioning back (it turns out that with only two languages A and B in the chain where B beats A with probability p we assign a score of p to B and 1 – p to A).

One final refinement: Obviously we don’t know the “true” probability of B beating A, so we have to estimate that from the data. The way I’ve done this is a very simple weighted bayesian average. We estimate it as (0.5 * N + Number of times B beat A) / (N + number of times B and A were compared), where N is some number representative of how hard we are to convince of a probability (I think I currently have it set to 5). For very small samples we’re artificially forcing it closer to half, and even for larger samples we’re never allowing it to quite hit 0 or 1.

That’s about it. We now have these probabilities. The ranking is then simply defined by ordering from highest to lowest probability.

This seems to work pretty well. I don’t have good metrics in place, but the lists it’s generating “feel” better than any of the previous incarnations, and it’s conceptually pleasing. It also appears to be quite spam resistant, which is a nice touch.

The right tool for the job

Saturday, May 8th, 2010

James Iry recently wrote a post about type systems in which he ranked type systems according to two characteristics, safety and sophistication. We had an interesting discussion about it on IRC afterwards, in which one of the things discussed was whether these were really the right axes.

It occurred to me that we spend a lot of time talking about the major axes on which you compare programming languages, and that it would be interesting to gather some data on what these actually were instead of just plucking things that seem like they may sense out of the air. So I’ve created a little app to gather data for this: Check out The Right Tool, and I’d love it if you’d rank some languages on it.

Essentially the idea is this: I have a list of statements and a list of programming languages, and I want you to rank those languages in order of how well the statements apply to them.

At the moment I’m just data gathering: I don’t actually do anything with the data. The end goal is that I’m going to have a very high dimensional representation of each programming language in terms of what people think of them and why they use them which I will do some sort of dimension reduction to, hopefully giving rise to some interesting metrics with which to compare languages.

A couple notes:

  • This is not about which language is best. It’s about comparing the strengths and weaknesses of different languages
  • I’m not making the data available yet, but I promise I will later once I’ve gathered enough
  • The site looks like ass. I know. I’m not a web designer. Deal with it or send me a better stylesheet
  • If I’ve missed any languages out, let me know. Drop me a comment or something. Similarly if you have ideas for good statements you haven’t seen on the site