Archive for the ‘programming’ Category

On hiring (developers)

Tuesday, August 17th, 2010

First off, we’re hiring. You should apply. Even if you don’t apply, check out the spec. This post will make more sense if you do, and I’d love it if you were to forward it to anyone you know who would fit.

We’ve yet to see if it will produce the right results, but one interesting thing has emerged from writing this job spec is that we’ve not done it as a collective. It’s largely been my doing, with editorial feedback from the rest of the team.

There are a couple consequences of this. Some good, some bad, some still in a superposition of the two.

we’re looking to hire someone who can do this job better than me

The part of it I especially liked is that the way the job requirements (both in the spec and as we’re viewing it) is that we’re looking at the person who currently is doing / would do the job and saying “You need to do the job better than him”. We’ve done this a bit less formally previously when hiring our new sysadmin, and it just sortof emerged naturally this time as a result of the fact that I was the one pushing the hardest to get this job spec done and posted. Whatever happens, I would very much want to retain this aspect of the process: Pick a person in the company who would do the job if you don’t hire for this role, make them responsible for speccing out the job and make them the primary decision maker on whether the person gets hired – other people get to veto, but they’re the one doing the initial approval by saying “Yes, I would rather have this job done by this candidate than done by (a perfect clone of) me”. I think this is very much the right way to do it.

15 years of BBQML

The fact that I wrote it brings with it a certain authorial style. This is both good and bad. If you’re reading this blog you probably are familiar with that style by now – whimsical, fairly abrasive and decidedly non-traditional. All of that comes through in the job posting.

Personally, I think that’s ok. Everyone in the company seems to have liked the posting (Well, they laughed. That’s good, right?), so enjoying the style is probably a good sign of a compatible personality. On the other hand it may turn out to be a bad thing – it may scare off some people who would actually work well for the role (I have at least one friend who is giving me a hard time because he thinks the posting is unprofessional and makes me sound like a wanker), on the other hand it may attract too many people who self-describe as rockstar ninja cyborg artist programmers.

If it turns out not to work, I will modify my style for postings in future, but I am hopeful.

we don’t want yet another [REDACTED]

The biggest feature of this posting that may or may not work in our favour is its honesty. It’s very much “This is who we are, this is what we’re looking for”. There are some tensions you can see behind it as a result. Will they scare people off? I hope not. As a company, we’re very much not perfect. No one is, and if you expect the company you’re applying to to be perfect then you’re in for a bit of a shock whomever they are. I hope that being up front about this is a good thing, and a nice departure from the whitewashed corporate-speak you get in a lot of job postings.

and…?

All in all, it’s been an interesting experiment. I look forward to seeing the results.

Let me know what you think.

The role of a CTO

Monday, July 19th, 2010

IRC conversation with Al Davidson

< DRMacIver> I must admit to only having a relatively vague idea of what a CTO job involves (and a much stronger conviction that I don’t want one any time soon).
< DRMacIver> As I’m fairly sure that craig isn’t typical of the role.
< al_> i think you’re right on that score
< DRMacIver> My impression is that it seems to be somewhere more in the realm of “You’re supposed to get to work on the fun stuff but so much of your time is taken up by boring administrative crap that you don’t get to, but it’s still your fault if it goes wrong”
< al_> bingo
< DRMacIver> I see the old chestnut of “take rosy eyed view of the world and apply cynicism until it starts to weep” still works.
< al_> david, i think you just described the human experience in a nutshell

What did you study?

Wednesday, July 7th, 2010

Most of the time when I tell people I did mathematics at university they reply with something along the lines of “Oh, I really hated that at school”.

My brother (who did philosophy) apparently often hears “So can you read my mind?”. I don’t quite get it.

Apparently musicians often hear “Oh, I played the X a while ago but stopped when I left school”. Sometimes suffixed with “Apparently I was good enough to do it professionally, but I didn’t want to”.

Beyond that, I don’t know. I’m sure most subjects have a wealth of stereotypical answers, but I’ve no idea what they are!

So… lets find out. I’ve created a site to gather all those answers which annoy you so much when you get them. It’s pretty limited so far: You just get to add answers and see other peoples’. If it goes well there are all sorts of cool things I can do with the data, but right now we’re at just the basics.

Have fun

Rank aggregation basics: Local Kemeny optimisation

Friday, June 18th, 2010

The technical content of this post is based heavily on Rank Aggregation Revisited” by Ravi Kuma, Moni Naorz, D. Sivakumarx and Cynthia Dwork. It’s purely expository in nature on top of that.

You’ve all seen Hammer Principle now, right? If not, go check it out.

The task it performs of aggregating all the individual opinions into a single overall ranking is harder than it looks. This is a post about some of the technical details involved.

The setup is as follows: We have a bunch of items, and a bunch of votes placing a subset of those items in order.

The naive idea one starts with is as follows: If the majority of people prefer A to B, rank A higher than B. This is an obvious thing to aim for. Unfortunately it’s impossible, even if everyone ranks every item. Considering the following set of votes:

1: A, B, C
2: B, C, A
3: C, A, B

Then 1 and 3 think A < B, 1 and 2 think B < C, and 2 and 3 think C < A. So if we tried to order by majority opinion we'd have A < B < C < A.

This is called Condorcet's paradox: Majority voting amongst > 2 items is impossible.

Nevertheless, we hope to be able to do at least reasonably well.

One natural approach is called Kemeny optimisation: We try to find a solution which minimizes the number of pairwise disagreements between the end rank and the individual voters. That is, for some set of items I, votes V and an aggregate ranking r, the score is

K(r) = #{ i, j in I, v in V, (r(i) < r(j)) != (v(i) < v(j)) }

If r is a minimum for this score we say it's Kemeny optimal. There needn't be a unique Kemeny optimal solution: In the above example, all three of the individual votes are Kemeny optimal rankings.

Kemeny optimal rankings have the following nice majority voting property:

Let r be Kemeny optimal. If U and V partition the set of items, and for every u in U and v in V the majority think u < v, then for every u in U and v in V r(u) < r(v).

Proof:

Suppose we had r(u) > r(v) for some u, v. By passing to a smaler u and a larger v we can ensure that there is no z with r(u) > r(z) > r(v). (we can take the smallest u such that r(u) > r(v) then the largest v such that r(u) > r(v)).

But then if we were to define the ranking r’ which is exactly r except that u and v were swapped, this would decrease K: Because there are no points between them, the only pairwise disagreements it changes are those on u and v, so K(r’) – K(r) = #{ t : t(u) < t(v) } - #{ t : t(u) > t(v)}. But we know the majority think u < v, so this change in score must be negative, so we have K(r') < K(r), contradicting minimality.

QED

We call the conclusion of this theorem the generalised Condorcet criterion (the condorcet criterion is that if there's a single element that beats all the others in a majority vote then it should be the winner). It's a nice property - it is in some sense an approximation of majority voting. In many cases satisfying it will uniquely determine a great deal about the resulting ranking - it's only when there's ambiguity (as in the Condorcet paradox example) that it fails to do so. This should give us confidence that the Kemeny optimal solution is the right approach.

So, we want to calculate a kemeny optimal solution.

There's just one problem: We've defined the problem in terms of a minimization over all rankings of n items, of which there are n!. This search space is, to borrow a technical term, freaking huge. This makes it a hard problem. In fact finding a Kemeny optimal solution for rankings on n items is NP hard, even with only four votes. I won't prove this here. Read the paper if you care.

So, no Kemeny optimal solutions for us. How sad.

What would work well as a substitute?

Well, examining our proof that Kemeny optimal solutions satisfy the generalised Condorcet condition, an interesting feature appears: We don't actually use anywhere that it is a global minimum. We only use the fact that swapping two adjacent pairs can’t decrease the score. This suggests the following relaxation of the condition:

A ranking r is locally Kemeny optimal if swapping two adjacent elements of the ranking does not decrease K.

By the above observation, locally Kemeny optimal solutions satisfy the generalized Condorcet condition.

This turns our problem from a global search to a local one: Basically we can start from any point in the search space and search locally by swapping adjacent pairs until we hit a minimum. This turns out to be quite easy to do. We basically run insertion sort: At step n we have the first n items in a locally Kemeny optimal order. Swap the n+1th item backwards until the majority think its predecessor is < it. This ensures all adjacent pairs are in the majority order, so swapping them would result in a greater than or equal K.

This is of course an O(n^2) algorithm. In fact, the problem of merely finding a locally Kemeny optimal solution can be done in O(n log(n)) (for much the same reason as you can sort better than insertion sort). You just take the directed graph of majority votes and find a Hamiltonian Path. The nice thing about the above version of the algorithm is that it gives you a lot of control over where you start your search.

The above algorithm however produces an ordering which is consistent with its starting point in the following sense:

If we start with a ranking r and then run the above algorithm on it to get a ranking r’, then if r’(u) < r'(v) we must have r(u) < r(v) or a majority vote that u < v.

This is easy to see: If r(u) > r(v) and the majority think that u > v then when moving u backwards we would have to stop at v at the latest.

In fact for any starting point there is only one kemeny optimal solution which is consistent for it in this way: We can see this inductively. If it’s true for the first n – 1 items, then where can we put the nth item? The only possible place is where the above algorithm puts it: If it were to put it after that place, it wouldn’t be Kemeny optimal. If it were to put it before it, it wouldn’t be consistent because it would be less than some item which it was greater than in the original ordering.

The unique locally Kemeny optimal solution is called the local Kemenization of the ranking.

So, this process gives us the core idea of the rank aggregation mechanism we’re using: We pick a starting point according to some heuristic (I’ll explain the heuristic we’re using in a later post), and from that starting point calculate its local Kemenisation. This gives us a ranking which is guaranteed to satisfy the generalised Condorcet condition, and thus likely to be a good ranking, but takes into account whatever our heuristic thought was important as well – this can matter a lot for cases where there is ambiguity in the ranking data.

Databases for cooccurrences?

Friday, May 21st, 2010

This is a bit of a “Dear lazyweb” post.

I frequently find myself having the following setup, and I’ve never really come up with a solution that I find terribly satisfactory or reusable:

I have a number of events and a number of labels, and a set of tags which associate labels with events that can be added or removed. Each event-label pair is unique – you can’t tag the same thing with the same label multiple times.

I want to support the following queries:

  1. Given a label, how many events has it occurred in?
  2. Given two labels, how often did they co-occur?
  3. Given a label, give me all labels which have co-occurred with it in at least N events, and the number of events they have co-occurred in
  4. Some sort of “dump” operation which will give me all tuples “label1, label2, label1_occurrences, label2_occurrences, cooccurrences” in a file or through some sort of streaming interface

I need to support quite large numbers of tags and events (the tags are usually words, and so end up eventually using an appreciable proportion of the english language. The events are usually documents, but could easily number in the thousands). I’m ok with a time lag of even a few minutes between data coming in and being visible in the coocurrences “table”.

I’ve previously done this in an SQL database, because there was usually one to hand. It works “ok”. It doesn’t perform brilliantly, and you end up with annoying trade offs between performance, simplicity and consistency (that’s always going to be the case to a certain extent, but I felt that SQL was a particularly unsuitable medium here). Well I’ve got to do it again, and this time there’s not inherently a database built in to the app already, so I’m looking further afield. I’m half-heartedly looking into using redis, but I don’t really expect it will actually be that much help.

Any suggestions for alternatives?