Category Archives: Uncategorized

A possibly new unix-style utility

I’ve found a couple times in the past that I have the following problem:

I’ve got a bunch of records of the form

key1: value1
key1: value2
key2: value3
...

And I want to transform them into the form

key1: value1, value2
key2: value3
...

so combining consecutive lines with the same initial key and joining the values together on one line.

It’s not a terribly hard problem, but doing it quickly and in bounded amount of memory for very large volumes of data is at least non-trivial enough to require a bit of care.

In the past I generally solved this in some application specific way, but recently I decided to do it properly and have extracted a nice little command line utility from it. It accepts a sensible range of options for configuring behaviour, is reasonably fast (I haven’t benchmarked it very extensively, but on all data I’ve tried it on it’s about an order of magnitude faster than sorting the data, which I tend to want to do before feeding it into squish anyway, and about half the speed of uniq) and runs in memory that will never grow past O(size of largest key).

I didn’t have a good name for it, so I picked a bad one. It’s called “squish” and is available from my data-tools repository (aka “Where I put random crap”). If you have a better name for it I’m all ears.

It’s possible that this duplicates functionality of something that already exists. Anyone know if it does?

This entry was posted in Uncategorized on by .

Northpaw, part 1

So I’ve bought myself a Northpaw kit.

Why? Well, partly because I’m interested in this sort of thing – different modes of perception of and interaction with information – but mostly because it looked cool and I was in a gadget buying mood.

After some grief from the royal mail it finally arrived on friday, and I spent some time today at the london hack space putting it together.

I don’t have anything very exciting to report so far: I basically got it to work, but mangled the assembly. The electronics turned out to be relatively easy – the results aren’t elegant, as I have the manual dexterity of a blind walrus and haven’t soldered in 12 years, but they seem to work. The positioning of the motors however turns out to be rather a challenge.

I found the ribbon extremely hard to configure correctly, and the power connection for it very hard to get right in a way that gave me any capability to position the motors correctly. I’m sure it’s doable, but I couldn’t do it. So the upshot is that I’m going to have to rebuild that part. Fortunately the motors all work and the electronics seems to talk to them correctly. My inclination is to take the ribbon mount off and wire it together with normal wires rather than ribbon.

Another update will be forthcoming when I’ve actually got the thing wearable.

This entry was posted in Uncategorized on by .

What did you study?

Most of the time when I tell people I did mathematics at university they reply with something along the lines of “Oh, I really hated that at school”.

My brother (who did philosophy) apparently often hears “So can you read my mind?”. I don’t quite get it.

Apparently musicians often hear “Oh, I played the X a while ago but stopped when I left school”. Sometimes suffixed with “Apparently I was good enough to do it professionally, but I didn’t want to”.

Beyond that, I don’t know. I’m sure most subjects have a wealth of stereotypical answers, but I’ve no idea what they are!

So… lets find out. I’ve created a site to gather all those answers which annoy you so much when you get them. It’s pretty limited so far: You just get to add answers and see other peoples’. If it goes well there are all sorts of cool things I can do with the data, but right now we’re at just the basics.

Have fun

This entry was posted in programming, Uncategorized on by .

Irrelevant alternatives aren’t

Every now and then someone discovers Arrow’s Impossibility Theorem and gets all excited. “Democracy is impossible! Let’s have a dictator!” they declare from the rooftops, or words to that effect. I certainly found this fascinating when I first discovered it.

Eventually they calm down. There are a lot of commonly mentioned reasons why Arrow’s impossibility theorem doesn’t have massive real world consequences – it’s not like anyone thought they were using a perfectly fair voting system in the first place, and the mechanism described in the theorem doesn’t actually correspond that closely with real world votes, which are mostly just trying to elect a single winner and don’t require nearly so strong consistency.

One reason I haven’t seen mentioned is the following: If it were possible to create a voting system which satisfied the criteria of Arrow’s impossibility theorem, it would be a bad idea. Independence of irrelevant alternatives, that the ordering of A and B doesn’t depend on the introduction of C, is an appealing condition on the face of it, but it turns out that you don’t actually want real world voting systems to have it. Consider the following set of opinions:

A > B > C
B > C > A
C > A > B
B > C > A

The numbers work out as follows:

There is a 50-50 split on A > B.
75% of people think that B > C
75% of people think that C > A.

Therefore even though there is a tie between A and B, the only fair combined ordering is that B > C > A – any other ordering would make a lot more people unhappy. So the introduction of an apparently irrelevant alternative has taken a tie between A and B and broken it decisively.

This entry was posted in Uncategorized on by .

Rank Aggregation in The Right Tool

As you’ve probably noticed, over the weekend I created The Right Tool, for discovering how we describe programming languages. In response to massive amounts of data and feedback I’ve thrown together some pages for seeing the results of which statements describe which languages well.

There’s one set for which statements match a given language and one for how well each language matches a given statement.

The method I’m using is straightforward enough, but quite neat. It’s an adaptation of a method from “Rank Aggregation methods For The Web, by Dwork, Kumar, Naor and Sivakumar” which they refer to as MC4. I’ve modified it to better take into account shades of grey and to be more resilient to rare events.

It belongs to a general class of scoring mechanisms based on Markov Chains. Google’s pagerank is an instance of these. Essentially what you do is define a Markov chain on the things you want to score. You try to arrange it so that transitions to good nodes are more likely than transitions to bad nodes. You then find the stationary distribution for this markov chain (the long-term probability of being at a particular node). The idea is that if transitions to good nodes are more likely than transitions to bad nodes, the nodes which you spend more time in are more likely to be good. These probabilities then form a measure of what you are trying to rank.

So, that’s all very well, but how do we decide what markov chain to define?

In MC4 the markov chain is defined as following: When at node A, pick a node at random. If the majority of people agree that that node is better than this one, transition there. Else stay where you are.

This has a couple of problems.

The first is this: How do we decide what constitutes “the majority of people”? A flat out majority is very sensitive to small numbers of voters: If only one person ranked both of two languages and ranked it the “wrong” way, we’re still going to treat it as better. I tried various things to make this work: One was just a lower bound on the number of votes we counted, one was modelling as a Bernoulli trial and requiring N (for N = 1 or 2. 1 seemed to work slightly better) standard deviations above average. It all felt a bit arbitrary though.

The second problem was shades of grey: It makes absolutely no distinction between language A being a lot better than language B and language A being slightly better than language B. As well as creating weird biases, this results in the method being very bad at getting the tail right: You get a lot of languages which everyone thinks are the pits, and they all get assigned a score of 0, so there’s no way to distinguish between them. It happened that the approximation I was using somewhat mitigated this, but it was still a bit annoying.

So I’ve modified the chain definition as follows: First pick a language B at random. Now consider the probability of language B beating A in a comparison, and jump to B with that probability, else stay at A. This means that when there’s only a slight benefit to B over A the score of B is only slightly higher than that of A, because we’ve still got a decent chance of transitioning back (it turns out that with only two languages A and B in the chain where B beats A with probability p we assign a score of p to B and 1 – p to A).

One final refinement: Obviously we don’t know the “true” probability of B beating A, so we have to estimate that from the data. The way I’ve done this is a very simple weighted bayesian average. We estimate it as (0.5 * N + Number of times B beat A) / (N + number of times B and A were compared), where N is some number representative of how hard we are to convince of a probability (I think I currently have it set to 5). For very small samples we’re artificially forcing it closer to half, and even for larger samples we’re never allowing it to quite hit 0 or 1.

That’s about it. We now have these probabilities. The ranking is then simply defined by ordering from highest to lowest probability.

This seems to work pretty well. I don’t have good metrics in place, but the lists it’s generating “feel” better than any of the previous incarnations, and it’s conceptually pleasing. It also appears to be quite spam resistant, which is a nice touch.

This entry was posted in Uncategorized on by .