Author Archives: david

Rank aggregation basics: Local Kemeny optimisation

The technical content of this post is based heavily on Rank Aggregation Revisited” by Ravi Kuma, Moni Naorz, D. Sivakumarx and Cynthia Dwork. It’s purely expository in nature on top of that.

You’ve all seen Hammer Principle now, right? If not, go check it out.

The task it performs of aggregating all the individual opinions into a single overall ranking is harder than it looks. This is a post about some of the technical details involved.

The setup is as follows: We have a bunch of items, and a bunch of votes placing a subset of those items in order.

The naive idea one starts with is as follows: If the majority of people prefer A to B, rank A higher than B. This is an obvious thing to aim for. Unfortunately it’s impossible, even if everyone ranks every item. Considering the following set of votes:

1: A, B, C
2: B, C, A
3: C, A, B

Then 1 and 3 think A < B, 1 and 2 think B < C, and 2 and 3 think C < A. So if we tried to order by majority opinion we'd have A < B < C < A. This is called Condorcet's paradox: Majority voting amongst > 2 items is impossible.

Nevertheless, we hope to be able to do at least reasonably well.

One natural approach is called Kemeny optimisation: We try to find a solution which minimizes the number of pairwise disagreements between the end rank and the individual voters. That is, for some set of items I, votes V and an aggregate ranking r, the score is

K(r) = #{ i, j in I, v in V, (r(i) < r(j)) != (v(i) < v(j)) } If r is a minimum for this score we say it's Kemeny optimal. There needn't be a unique Kemeny optimal solution: In the above example, all three of the individual votes are Kemeny optimal rankings. Kemeny optimal rankings have the following nice majority voting property: Let r be Kemeny optimal. If U and V partition the set of items, and for every u in U and v in V the majority think u < v, then for every u in U and v in V r(u) < r(v). Proof: Suppose we had r(u) > r(v) for some u, v. By passing to a smaler u and a larger v we can ensure that there is no z with r(u) > r(z) > r(v). (we can take the smallest u such that r(u) > r(v) then the largest v such that r(u) > r(v)).

But then if we were to define the ranking r’ which is exactly r except that u and v were swapped, this would decrease K: Because there are no points between them, the only pairwise disagreements it changes are those on u and v, so K(r’) – K(r) = #{ t : t(u) < t(v) } - #{ t : t(u) > t(v)}. But we know the majority think u < v, so this change in score must be negative, so we have K(r') < K(r), contradicting minimality. QED We call the conclusion of this theorem the generalised Condorcet criterion (the condorcet criterion is that if there's a single element that beats all the others in a majority vote then it should be the winner). It's a nice property - it is in some sense an approximation of majority voting. In many cases satisfying it will uniquely determine a great deal about the resulting ranking - it's only when there's ambiguity (as in the Condorcet paradox example) that it fails to do so. This should give us confidence that the Kemeny optimal solution is the right approach. So, we want to calculate a kemeny optimal solution. There's just one problem: We've defined the problem in terms of a minimization over all rankings of n items, of which there are n!. This search space is, to borrow a technical term, freaking huge. This makes it a hard problem. In fact finding a Kemeny optimal solution for rankings on n items is NP hard, even with only four votes. I won't prove this here. Read the paper if you care. So, no Kemeny optimal solutions for us. How sad. What would work well as a substitute? Well, examining our proof that Kemeny optimal solutions satisfy the generalised Condorcet condition, an interesting feature appears: We don't actually use anywhere that it is a global minimum. We only use the fact that swapping two adjacent pairs can’t decrease the score. This suggests the following relaxation of the condition:

A ranking r is locally Kemeny optimal if swapping two adjacent elements of the ranking does not decrease K.

By the above observation, locally Kemeny optimal solutions satisfy the generalized Condorcet condition.

This turns our problem from a global search to a local one: Basically we can start from any point in the search space and search locally by swapping adjacent pairs until we hit a minimum. This turns out to be quite easy to do. We basically run insertion sort: At step n we have the first n items in a locally Kemeny optimal order. Swap the n+1th item backwards until the majority think its predecessor is < it. This ensures all adjacent pairs are in the majority order, so swapping them would result in a greater than or equal K. This is of course an O(n^2) algorithm. In fact, the problem of merely finding a locally Kemeny optimal solution can be done in O(n log(n)) (for much the same reason as you can sort better than insertion sort). You just take the directed graph of majority votes and find a Hamiltonian Path. The nice thing about the above version of the algorithm is that it gives you a lot of control over where you start your search.

The above algorithm however produces an ordering which is consistent with its starting point in the following sense:

If we start with a ranking r and then run the above algorithm on it to get a ranking r’, then if r'(u) < r'(v) we must have r(u) < r(v) or a majority vote that u < v. This is easy to see: If r(u) > r(v) and the majority think that u > v then when moving u backwards we would have to stop at v at the latest.

In fact for any starting point there is only one kemeny optimal solution which is consistent for it in this way: We can see this inductively. If it’s true for the first n – 1 items, then where can we put the nth item? The only possible place is where the above algorithm puts it: If it were to put it after that place, it wouldn’t be Kemeny optimal. If it were to put it before it, it wouldn’t be consistent because it would be less than some item which it was greater than in the original ordering.

The unique locally Kemeny optimal solution is called the local Kemenization of the ranking.

So, this process gives us the core idea of the rank aggregation mechanism we’re using: We pick a starting point according to some heuristic (I’ll explain the heuristic we’re using in a later post), and from that starting point calculate its local Kemenisation. This gives us a ranking which is guaranteed to satisfy the generalised Condorcet condition, and thus likely to be a good ranking, but takes into account whatever our heuristic thought was important as well – this can matter a lot for cases where there is ambiguity in the ranking data.

This entry was posted in Numbers are hard, programming on by .

Chakchouka, of sorts

I was reading Wild Cooking earlier, and it mentioned Chakchouka. I didn’t read it too carefully, but it sounded delicious. So, when I got back from the gym feeling (for basically the first time since I embarked on this bizarre exercise in self flagellation) energetic rather than completely destroyed and with my body going “feeed meee proteeein!!!” it was basically inevitable what I would make.

That is to say, something which was almost, but not completely, entirely unlike chakchouka.

The ingredients lists are broadly similar though, and I’m definitely chalking this one as a win due to it being amazing.

Here’s what it was: Two bell peppers, one yellow one green, a handful of cherry tomatoes, pierced but otherwise whole, and three sliced linda mccartney vegetarian sausages, all fried in olive oil with salt and a dollop of chipotle in adobo sauce (this stuff is amazing. Spicey, smokey and slightly sweet. Get yourself some and thank me later). At the last minute, I added two lightly scrambled eggs and stirred till it coated everything, then served the whole thing topped with coarse grated parmesan.

This was a LOT of food. Way more than I actually needed.

I have no regrets.

This entry was posted in Food on by .

Databases for cooccurrences?

This is a bit of a “Dear lazyweb” post.

I frequently find myself having the following setup, and I’ve never really come up with a solution that I find terribly satisfactory or reusable:

I have a number of events and a number of labels, and a set of tags which associate labels with events that can be added or removed. Each event-label pair is unique – you can’t tag the same thing with the same label multiple times.

I want to support the following queries:

  1. Given a label, how many events has it occurred in?
  2. Given two labels, how often did they co-occur?
  3. Given a label, give me all labels which have co-occurred with it in at least N events, and the number of events they have co-occurred in
  4. Some sort of “dump” operation which will give me all tuples “label1, label2, label1_occurrences, label2_occurrences, cooccurrences” in a file or through some sort of streaming interface

I need to support quite large numbers of tags and events (the tags are usually words, and so end up eventually using an appreciable proportion of the english language. The events are usually documents, but could easily number in the thousands). I’m ok with a time lag of even a few minutes between data coming in and being visible in the coocurrences “table”.

I’ve previously done this in an SQL database, because there was usually one to hand. It works “ok”. It doesn’t perform brilliantly, and you end up with annoying trade offs between performance, simplicity and consistency (that’s always going to be the case to a certain extent, but I felt that SQL was a particularly unsuitable medium here). Well I’ve got to do it again, and this time there’s not inherently a database built in to the app already, so I’m looking further afield. I’m half-heartedly looking into using redis, but I don’t really expect it will actually be that much help.

Any suggestions for alternatives?

This entry was posted in programming on by .

The Right Data

As promised, I’m now making data dumps of The Right Tool available. You can download them here. The data there will be updated daily.

This contains all the statements and languages (including ones which I’ve since removed from the site), and all rankings of them with an identifier to distinguish users. It doesn’t contain any of the calculated data (’cause there’s a lot of it. I may open source some of the calculation code later, but in the meantime ask me if you want to reproduce the results).

I’m releasing all of this under a “let me know if you do anything cool with it” license. :-) Basically, it’s yours to play with as you see fit, but if anything interesting emerges out of it I really would like to hear about it!

This entry was posted in Data geeking, programming on by .

Rank Aggregation in The Right Tool

As you’ve probably noticed, over the weekend I created The Right Tool, for discovering how we describe programming languages. In response to massive amounts of data and feedback I’ve thrown together some pages for seeing the results of which statements describe which languages well.

There’s one set for which statements match a given language and one for how well each language matches a given statement.

The method I’m using is straightforward enough, but quite neat. It’s an adaptation of a method from “Rank Aggregation methods For The Web, by Dwork, Kumar, Naor and Sivakumar” which they refer to as MC4. I’ve modified it to better take into account shades of grey and to be more resilient to rare events.

It belongs to a general class of scoring mechanisms based on Markov Chains. Google’s pagerank is an instance of these. Essentially what you do is define a Markov chain on the things you want to score. You try to arrange it so that transitions to good nodes are more likely than transitions to bad nodes. You then find the stationary distribution for this markov chain (the long-term probability of being at a particular node). The idea is that if transitions to good nodes are more likely than transitions to bad nodes, the nodes which you spend more time in are more likely to be good. These probabilities then form a measure of what you are trying to rank.

So, that’s all very well, but how do we decide what markov chain to define?

In MC4 the markov chain is defined as following: When at node A, pick a node at random. If the majority of people agree that that node is better than this one, transition there. Else stay where you are.

This has a couple of problems.

The first is this: How do we decide what constitutes “the majority of people”? A flat out majority is very sensitive to small numbers of voters: If only one person ranked both of two languages and ranked it the “wrong” way, we’re still going to treat it as better. I tried various things to make this work: One was just a lower bound on the number of votes we counted, one was modelling as a Bernoulli trial and requiring N (for N = 1 or 2. 1 seemed to work slightly better) standard deviations above average. It all felt a bit arbitrary though.

The second problem was shades of grey: It makes absolutely no distinction between language A being a lot better than language B and language A being slightly better than language B. As well as creating weird biases, this results in the method being very bad at getting the tail right: You get a lot of languages which everyone thinks are the pits, and they all get assigned a score of 0, so there’s no way to distinguish between them. It happened that the approximation I was using somewhat mitigated this, but it was still a bit annoying.

So I’ve modified the chain definition as follows: First pick a language B at random. Now consider the probability of language B beating A in a comparison, and jump to B with that probability, else stay at A. This means that when there’s only a slight benefit to B over A the score of B is only slightly higher than that of A, because we’ve still got a decent chance of transitioning back (it turns out that with only two languages A and B in the chain where B beats A with probability p we assign a score of p to B and 1 – p to A).

One final refinement: Obviously we don’t know the “true” probability of B beating A, so we have to estimate that from the data. The way I’ve done this is a very simple weighted bayesian average. We estimate it as (0.5 * N + Number of times B beat A) / (N + number of times B and A were compared), where N is some number representative of how hard we are to convince of a probability (I think I currently have it set to 5). For very small samples we’re artificially forcing it closer to half, and even for larger samples we’re never allowing it to quite hit 0 or 1.

That’s about it. We now have these probabilities. The ranking is then simply defined by ordering from highest to lowest probability.

This seems to work pretty well. I don’t have good metrics in place, but the lists it’s generating “feel” better than any of the previous incarnations, and it’s conceptually pleasing. It also appears to be quite spam resistant, which is a nice touch.

This entry was posted in Uncategorized on by .