Author Archives: david

Making my life more difficult

I’ve often been accused of subscribing to a philosophy of “Why do things the easy way when you can do them the hard way?”. Personally, I prefer to think of it as “Why do things the easy way when it’s wrong?”, but there’s no doubt that sometimes this make my life harder than it needs to be.

One example of this is in job applications. I’m very much of the opinion that it’s not about finding a job but about finding the right one. As such, this means that I’m not willing to compromise on who I am for a company. If they wouldn’t approve of me as I am, I’m unlikely to have any interest in working for them.

Last time this resulted in this post. It still holds true, but I don’t feel inclined to make such a big deal of it this time. However one feature this time which I didn’t have last is that I have a very non-traditional CV. I like it, but it won’t work for everyone. At least one company so far has had a distinct sense of humour failure regarding it. That’s fine – there are certainly no hard feelings on my part. My sense of humour and approach to life wouldn’t fit in everywhere, nor should it. As such, the CV style acts as a rather effective early warning sign for both sides.

This entry was posted in life and tagged , on by .

Some useful scripts

I decided to have another go at using delicious.com to manage my bookmarks, as I’ve realised I have a serious problem of not being able to find stuff I’ve liked.

After playing around with it a bit, I was convinced to give it a serious try. There was just one problem. I already have those links I want to find. How do I get them all into delicious?

The ones I care most about are my reddit liked pages. So, how to get them out of reddit? Well, some poking around shows that although reddit doesn’t have a formal API it does expose a lot of its data in moderately convenient forms. So I wrote a reddit liked bulk exporter.

Then it occurred to me that a large part of the utility of delicious was that everything was tagged, and that that coming from reddit was very under tagged (I could and did use the subreddits as tags, but that’s not quite enough). So I wrote a bulk title to tag converter. It’s really very basic – you’ll end up with a bunch of not so great tags as well as the good ones – but it does the job adequately.

So now came the time to get it into delicious. Delicious has an API, so it was straightforward enough. I wrote a bulk delicious importer.

All of these are super basic scripts, and they need a long way before they become fully featured tools in their own right, but they’re all at least mildly useful as they currently stand. I may elaborate one or more of them into a proper project, or I may not bother, but either way maybe someone else will find these useful.

Update: And the next Step: StumbleUpon.

StumbleUpon does not expose anything that even resembles an API. Were I the paranoid sort I would say that in fact they go out of their way to make their site unusable for anything except a browser running the plugin. Fortunately, I *had* a browser running the plugin. After about half an hour of dicking around with curl and cookies and getting nowhere fast I ended up automating scraping the list view of my stumbleupon favourites with a trivial chickenfoot script (I’ve lost the code, but it really was trivial). I then wrote this script to parse the resulting html and output it as json suitable for the delicious bulk import.

This entry was posted in programming on by .

Lessons learned: Are you solving the right problem?

I like to think I’m pretty good at problem solving. How much this is really the case is up for debate, but either way I’ve learned a lot of useful little lessons in the course of testing it.

A really important one is this:

The problem you are trying to solve is often harder than the problem that you need to solve.

Problems frequently come with a set of requirements. It has to play games, let me do my email and make coffee. Well… does it really? Actually it’s much simpler to solve if you make your own damn coffee.

This is a slightly spurious example because the making coffee requirement is so obviously tacked on. But the general idea is common: Of the requirements your problem comes with, one or more may actually be non-essential and dropping it can simplify your life substantially.

Let’s take two real examples:

The first is at work, on the SONAR system. The objective is to build profiles for peoples’ knowledge based on their authored content. So what do we do?

  1. For each document, assign it some tags that describe it
  2. For each user, look at the tags of documents they’ve authored and give them a score for that tag based on that

This is an interesting example of an implicit requirement. We’ve broken the problem up into two steps, and each of those steps has thus become a requirement for the overall problem.

One of the first things I realised when I started working on SONAR is that this implicit requirement was making our life a lot harder than it needed to be. Automatic document tagging with an unknown set of tags is hard. There are lots of papers telling you how to do it, and none of them actually do a very good job (the best of them all tag in relation to some background corpus, but that’s not so good when there are e.g. a lot of internal project names to the company which aren’t present in that corpus).

Given that the end result on users is what matters, this realisation frees us from having to achieve a high precision per document and allows us to focus on recall (Translation: we don’t need to make sure all the results we get are good, we just need to make sure that the good results are contained within them). One minor rewrite later, our theme quality had gone up and our implementation complexity down.

My next example is much more controversial. Which is also a relevant point: Not everyone is going to agree on whether a given feature is essential. I still think I’m right on this one, but then I would. I’ve more or less resigned myself to the idea that the battle is lost here, so this is purely illustrative rather than an attempt to convince.

The example is the 2.8 Scala collections library. The signatures of map and flatMap are quite complicated:

  def map[B, That](f: A => B)(implicit bf: BuilderFactory[B, That, Traversable[A]]): That
  def flatMap[B, That](f: A => Traversable[B])(implicit bf: BuilderFactory[B, That, Traversable[A]]): That

In comparison, an experimental collections API design I put together a while ago called alt.collections has substantially simpler signatures (equivalent to the old signatures in scala.collection really)

  def map[S](f : T => S) : Traversable[S]
  def flatMap[S](f : T => Traversable[S]) : Traversable[S]

Why is my version so much simpler than Martin’s? Well, as much as I’d like to claim it’s because I’m an unparalleled genius of API design compared to Martin, that’s really not the case.

The reason for the comparative simplicity is that Martin’s version does something that mine does not.

The problem we’ve both set out to solve is that the return types of map and flatMap are very inconsistent in the current API: Sometimes they’re overridden to return the right type, sometimes they’re not (sometimes they can’t be). This upsets people. Essentially there is the following requirement that isn’t being met by the current collections API:

map and flatMap on a collection should return the same type of collection

Martin responds by meeting that requirement. I respond by taking it away. The design of alt.collections was centered around the idea that map and flatMap would not attempt to return the “right” type but would instead always return a Traversable. It then set about making it easy to get the right type out at the end.

More than one person has responded to this design with “I’m sorry, but that’s horrible” or similar, so clearly a lot of people think in this case the answer is “Yes, we are”. But I still find it a rather compelling example of how much simpler you can make the solution by first simplifying the problem.

This entry was posted in life, programming on by .

Many eyes make heavy work

We were talking in the office the other day about a fun little project for twitter. Basically just looking at what pairs of hash tags get used together. After getting one and a half hours sleep last night, waking up and being unable to get back to sleep I had some time to kill on my hands, so thought I’d throw it together.

Getting and munging the data into a form that gave tweet similarity was easy enough. But what to do with it then? The obvious thing to do is to visualise the resulting graph.

We have our own visualisation software at trampoline (which I did try on this data. It does fine), but I wanted something smaller and more stand alone. I’d heard people saying good things about IBM’s many eyes (in retrospect I may have to challenge them to duels for the sheer affrontery of this claim), so I thought I’d give it a go.

Let me start by saying that there is one single feature that would change my all encompassing burning loathing for Many Eyes into a mild dislike. It alleges to have the ability to let you edit data you have uploaded. Except that the button is disabled with the helpful tooltip “Editing of data is currently disabled”.

This renders the entire fucking site useless, because it takes what should be a trivial operation (editing the data you’ve uploaded to see how it changes the visualisation) into a massive ordeal. You need to create an entirely new dataset, label it, add it, recreate the visualisation…

Fortunately recreating the visualisation isn’t that hard. After all, Many Eyes doesn’t actually give you any functionality with which to customise your visualisation (maybe it does for some of the others. It sure doesn’t for the network viewer).

So why did I need to tinker with the data? Isn’t it just a question of upload a data set, feed it into IBM’s magic happy wonderground of visualisation and go “Oooooh” at the pretty pictures?

Well, it sortof is.

Actually what it is is upload the data set, feed it into IBM’s magic happy wonderground of visualisation and go “Aaaaaaaargh” as my browser grinds to a halt and then crashes.

It’s understandable really. I did just try to feed their crapplet an entire one point six MEGA bytes of data (omgz).

Wait, no. That’s not understandable at all. In fact it’s completely shit. That corresponds to about 12K nodes and about 60K edges. This is *not* a particularly large number (metascope happily lays it out in a few tens of seconds). This is a goddamn data visualisation tool. The whole point of it is that you’re supposed to be able to feed it large amounts of data. At the very least it should tell you “Sorry, I was written by monkeys so probably can’t handle the not particularly large amount of data you have just fed me”.

So, I spent some time trying to prune the data down to a size many eyes could manage to not fail dismally at but where the graph was still large enough to be interesting. This was a hilarious process. Consider:

  • The only way to edit the data is to create an entirely new data set and recreate the visualisation.
  • The only way to determine that I’ve got too much is to observe that my browser crashes.

After about half a dozen iterations of this I decided enough was enough and declared many eyes to be an unusable piece of shite that was not worth my time. Life’s too short for bad software.

This entry was posted in Data geeking, programming and tagged , , on by .

Food discovery of the day

Fried polenta goes really well in greek salad.

This entry was posted in Food on by .