Archive for the 'programming' Category

Some useful scripts

Saturday, June 13th, 2009

I decided to have another go at using delicious.com to manage my bookmarks, as I’ve realised I have a serious problem of not being able to find stuff I’ve liked.

After playing around with it a bit, I was convinced to give it a serious try. There was just one problem. I already have those links I want to find. How do I get them all into delicious?

The ones I care most about are my reddit liked pages. So, how to get them out of reddit? Well, some poking around shows that although reddit doesn’t have a formal API it does expose a lot of its data in moderately convenient forms. So I wrote a reddit liked bulk exporter.

Then it occurred to me that a large part of the utility of delicious was that everything was tagged, and that that coming from reddit was very under tagged (I could and did use the subreddits as tags, but that’s not quite enough). So I wrote a bulk title to tag converter. It’s really very basic – you’ll end up with a bunch of not so great tags as well as the good ones – but it does the job adequately.

So now came the time to get it into delicious. Delicious has an API, so it was straightforward enough. I wrote a bulk delicious importer.

All of these are super basic scripts, and they need a long way before they become fully featured tools in their own right, but they’re all at least mildly useful as they currently stand. I may elaborate one or more of them into a proper project, or I may not bother, but either way maybe someone else will find these useful.

Update: And the next Step: StumbleUpon.

StumbleUpon does not expose anything that even resembles an API. Were I the paranoid sort I would say that in fact they go out of their way to make their site unusable for anything except a browser running the plugin. Fortunately, I *had* a browser running the plugin. After about half an hour of dicking around with curl and cookies and getting nowhere fast I ended up automating scraping the list view of my stumbleupon favourites with a trivial chickenfoot script (I’ve lost the code, but it really was trivial). I then wrote this script to parse the resulting html and output it as json suitable for the delicious bulk import.

Lessons learned: Are you solving the right problem?

Monday, June 8th, 2009

I like to think I’m pretty good at problem solving. How much this is really the case is up for debate, but either way I’ve learned a lot of useful little lessons in the course of testing it.

A really important one is this:

The problem you are trying to solve is often harder than the problem that you need to solve.

Problems frequently come with a set of requirements. It has to play games, let me do my email and make coffee. Well… does it really? Actually it’s much simpler to solve if you make your own damn coffee.

This is a slightly spurious example because the making coffee requirement is so obviously tacked on. But the general idea is common: Of the requirements your problem comes with, one or more may actually be non-essential and dropping it can simplify your life substantially.

Let’s take two real examples:

The first is at work, on the SONAR system. The objective is to build profiles for peoples’ knowledge based on their authored content. So what do we do?

  1. For each document, assign it some tags that describe it
  2. For each user, look at the tags of documents they’ve authored and give them a score for that tag based on that

This is an interesting example of an implicit requirement. We’ve broken the problem up into two steps, and each of those steps has thus become a requirement for the overall problem.

One of the first things I realised when I started working on SONAR is that this implicit requirement was making our life a lot harder than it needed to be. Automatic document tagging with an unknown set of tags is hard. There are lots of papers telling you how to do it, and none of them actually do a very good job (the best of them all tag in relation to some background corpus, but that’s not so good when there are e.g. a lot of internal project names to the company which aren’t present in that corpus).

Given that the end result on users is what matters, this realisation frees us from having to achieve a high precision per document and allows us to focus on recall (Translation: we don’t need to make sure all the results we get are good, we just need to make sure that the good results are contained within them). One minor rewrite later, our theme quality had gone up and our implementation complexity down.

My next example is much more controversial. Which is also a relevant point: Not everyone is going to agree on whether a given feature is essential. I still think I’m right on this one, but then I would. I’ve more or less resigned myself to the idea that the battle is lost here, so this is purely illustrative rather than an attempt to convince.

The example is the 2.8 Scala collections library. The signatures of map and flatMap are quite complicated:

  def map[B, That](f: A => B)(implicit bf: BuilderFactory[B, That, Traversable[A]]): That
  def flatMap[B, That](f: A => Traversable[B])(implicit bf: BuilderFactory[B, That, Traversable[A]]): That

In comparison, an experimental collections API design I put together a while ago called alt.collections has substantially simpler signatures (equivalent to the old signatures in scala.collection really)

  def map[S](f : T => S) : Traversable[S]
  def flatMap[S](f : T => Traversable[S]) : Traversable[S]

Why is my version so much simpler than Martin’s? Well, as much as I’d like to claim it’s because I’m an unparalleled genius of API design compared to Martin, that’s really not the case.

The reason for the comparative simplicity is that Martin’s version does something that mine does not.

The problem we’ve both set out to solve is that the return types of map and flatMap are very inconsistent in the current API: Sometimes they’re overridden to return the right type, sometimes they’re not (sometimes they can’t be). This upsets people. Essentially there is the following requirement that isn’t being met by the current collections API:

map and flatMap on a collection should return the same type of collection

Martin responds by meeting that requirement. I respond by taking it away. The design of alt.collections was centered around the idea that map and flatMap would not attempt to return the “right” type but would instead always return a Traversable. It then set about making it easy to get the right type out at the end.

More than one person has responded to this design with “I’m sorry, but that’s horrible” or similar, so clearly a lot of people think in this case the answer is “Yes, we are”. But I still find it a rather compelling example of how much simpler you can make the solution by first simplifying the problem.

Many eyes make heavy work

Thursday, May 28th, 2009

We were talking in the office the other day about a fun little project for twitter. Basically just looking at what pairs of hash tags get used together. After getting one and a half hours sleep last night, waking up and being unable to get back to sleep I had some time to kill on my hands, so thought I’d throw it together.

Getting and munging the data into a form that gave tweet similarity was easy enough. But what to do with it then? The obvious thing to do is to visualise the resulting graph.

We have our own visualisation software at trampoline (which I did try on this data. It does fine), but I wanted something smaller and more stand alone. I’d heard people saying good things about IBM’s many eyes (in retrospect I may have to challenge them to duels for the sheer affrontery of this claim), so I thought I’d give it a go.

Let me start by saying that there is one single feature that would change my all encompassing burning loathing for Many Eyes into a mild dislike. It alleges to have the ability to let you edit data you have uploaded. Except that the button is disabled with the helpful tooltip “Editing of data is currently disabled”.

This renders the entire fucking site useless, because it takes what should be a trivial operation (editing the data you’ve uploaded to see how it changes the visualisation) into a massive ordeal. You need to create an entirely new dataset, label it, add it, recreate the visualisation…

Fortunately recreating the visualisation isn’t that hard. After all, Many Eyes doesn’t actually give you any functionality with which to customise your visualisation (maybe it does for some of the others. It sure doesn’t for the network viewer).

So why did I need to tinker with the data? Isn’t it just a question of upload a data set, feed it into IBM’s magic happy wonderground of visualisation and go “Oooooh” at the pretty pictures?

Well, it sortof is.

Actually what it is is upload the data set, feed it into IBM’s magic happy wonderground of visualisation and go “Aaaaaaaargh” as my browser grinds to a halt and then crashes.

It’s understandable really. I did just try to feed their crapplet an entire one point six MEGA bytes of data (omgz).

Wait, no. That’s not understandable at all. In fact it’s completely shit. That corresponds to about 12K nodes and about 60K edges. This is *not* a particularly large number (metascope happily lays it out in a few tens of seconds). This is a goddamn data visualisation tool. The whole point of it is that you’re supposed to be able to feed it large amounts of data. At the very least it should tell you “Sorry, I was written by monkeys so probably can’t handle the not particularly large amount of data you have just fed me”.

So, I spent some time trying to prune the data down to a size many eyes could manage to not fail dismally at but where the graph was still large enough to be interesting. This was a hilarious process. Consider:

  • The only way to edit the data is to create an entirely new data set and recreate the visualisation.
  • The only way to determine that I’ve got too much is to observe that my browser crashes.

After about half a dozen iterations of this I decided enough was enough and declared many eyes to be an unusable piece of shite that was not worth my time. Life’s too short for bad software.

Interest bandwidth

Friday, May 15th, 2009

Highlight from the #scala IRC channel:

13:21 < DRMacIver> I have a somewhat unfortunate theory. I suspect the amount one cares about
                   programming issues is inversely proportional to how inherently interesting one's
                   work is.
13:22 < DRMacIver> Or at least negatively correlated
13:22 < ijuma> DRMacIver: is this based on personal experience? ;)
13:22 < DRMacIver> Yes
13:22  * DRMacIver finds it much harder to get worked up about language issues these days
13:23 < DRMacIver> Which is in large part because I'm doing a lot more interesting borderline computer
                   sciencey work
13:24 < dgreensp> yeah, I do a lot of interesting work and only rarely stop to think about languages
13:24 < ijuma> yeah, I noticed. I think it makes sense. People have limited bandwidth and if work is
               interesting, it's likely to take quite a bit of it
13:25 < DRMacIver> I think the other issue is that people want to find what they do interesting. And
                   so if *what* they do isn't interesting they have to become interested in *how* they
                   do it.
13:25 < DRMacIver> But yes, the bandwidth thing is also a big part of it
13:26 < dgreensp> bandwidth minimum, bandwidth maximum :)
13:26 < DRMacIver> ?
13:27 < ijuma> DRMacIver: agreed
13:27 < dgreensp> er, the two points seemed related -- people need to be interested in something, but
                  they can't be interested in too many things at once
13:28 < DRMacIver> Oh, right.
13:28 < DRMacIver> Yes. That's a good way of looking at it.

A problem of language

Friday, May 15th, 2009

I try to stay out of language wars these days. I find the whole endeavour incredibly tedious. I don’t really feel like arguing whether OCaml is better than Ruby is better than Scala is better than brainfuck is better than C. I like them all (ok maybe not brainfuck), and there are valid arguments for and against each of them.

But one thing I have a lot of trouble with is bad arguments. Not “arguments I disagree with”, but arguments which are simply outright bad.

Robert Fischer has a post on his blog: Scala is not a functional language. It’s not the first time this idea has come up. It’s not an idea entirely without merit. Scala’s functional features are certainly not as seamless to use as one might hope.

The problem is that while it is not an idea without merit, its presentation is certainly one without content. As per usual, every time this comes up, no one is actually willing to say what on earth they mean by “functional programming language”. This problem is ubiquitous. Consider the following wikipedia entry:

In computer science, functional programming is a programming paradigm that treats computation as the evaluation of mathematical functions and avoids state and mutable data. It emphasizes the application of functions, in contrast to the imperative programming style, which emphasizes changes in state. Functional programming has its roots in the lambda calculus, a formal system developed in the 1930s to investigate function definition, function application, and recursion. Many functional programming languages can be viewed as embellishments to the lambda calculus.

Notice the bait and switch there? It defines “functional programming” and then talks about “functional programming languages“. Everyone does this. The only unambiguous definition you find which includes the term “functional programming language” is that of Purely functional language. This is much less ambiguous, but it’s also the case that the vast majority of people arguing about whether Scala (or other language of choice) is functional are comparing it to a decidedly impure language. Certainly neither Clojure nor OCaml are purely functional.

On the other hand, if you choose “functional programming language” to simply mean “supports functional programming” then suddenly you have to acknowledge various languages like Ruby as functional and for some reason this makes people uncomfortable. So instead we end up with all sorts of hand waving and mumbling and no one is able to have a useful discussion because everyone is too busy making assertions which can’t be argued with because they don’t mean anything.

So no one defines the term, but everyone seems to “know what it means”. And what it inevitably means is “shares features with this language which I use and like and consider to be a functional language”.

For a particular extreme example, consider the following conversation on reddit from the last time this subject came up:

vagif:
Well, here’s my understanding of this disagreement.
I think many people (me included) do not feel that Scala is functional enough. It is not because of the mutable variables. It all boils down to simple syntax. That’s why completely imperative and old fashioned language CL (i use sbcl) feels to me much more functional than Scala will ever be with its modern functional features like pattern matching, type inference and list comprehensions.
“Functional” for me means simple small core of orthogonal features.
“Functional” for me means – List processing.
“Functional” for me means no premature optimization (deciding to use arrays instead of lists because they are faster etc.)
Obviously, trying to accommodate java OO model, makes Scala way to complicated, arcane, baroque in its syntax, to feel functional enough.

DRMacIver:
I enjoy the fact that the word “function” does not appear in your definition of functional…

vagif:
And for a good reason. Functions are part of any imperative language. That does not make them more functional. It is all that infrastructure, that allows for easy juggling those functions, that makes a language functional. Passing them as parameters and return values, anonymous functions, closures, polymorphic functions (be it a result of type inference or dynamic nature of language). It is this small core of orthogonal features, all geared up for list processing, that truly creates a functional feeling in a programmer.

This sort of woolly thinking is endemic in these arguments. Please try to avoid it. If you’re not willing to define “functional programming language” in a way that is at least moderately unambiguous and doesn’t involve arguments about “feelings” or simply feature comparisons with some language which you state as an example of functional programming, don’t bother making arguments about whether a language is functional or not.

Of course, once you start defining the term people will start arguing about the definitions. This is pretty tedious, I know. But as tedious as arguing about definitions is, it can’t hold a candle to arguing without definitions.