Category Archives: programming

reddilicous: Automatically import your links from other sites into delicious

I appear to have done something highly out of character and created a tool which is simply useful, without any real theoretical interest to it.

I followed up on the useful scripts I posted a while ago and decided to turn it into something slightly more complete and robust. The result is reddilicious, a tool for automatically importing from various sites into delicious. (Note: It’s written in Ruby, mostly due to the rather excellent HTTParty and Mechanize libraries. If you’re coming from planet scala expecting to see my awesome scala code, sorry, instead you get some rather grim Ruby code).

It currently handles:

  • Reddit: Pulls in any pages you’ve voted up. Tags them with the subreddit and via:reddit
  • Stumbleupon: Similarly pulls in any pages you’ve thumbed up. Pulls in your blog entry on it if there is one as an extended comment. Pulls in any tags. Tags it via:stumbleupon
  • Twitter: Pulls in any links mentioned in your friends timeline. Tags them via:twitter, from:user, to:users (anyone mentioned @ in the tweet) and with the hashtags mentioned in the tweet

It correctly handles historical data for all of them (twitter it only goes back to the limit of your friends timeline, not everything ever mentioned by a friend of yours), with timestamps set appropriately (on reddit the timestamp is the post date rather than the date you thumbed it up).

It’s all very rough around the edges at the moment, but it does work rather well if you’re prepared to put up with its quirks. The basic mechanism is as follows: There’s a script “reddilicious” in the distribution (note: It currently gets unhappy if you refer to it with a symlink due to path handling issues. Fixing that is on my todo list) which handles all its operations. A reddilicious instance corresponds to a directory where it stores all its data (including passwords in plaintext. Sorry). You create an instance with:

reddilicious create somedir

add accounts to it with e.g.

reddilicious twitter somedir

and update it with

reddilicious update somedir

update will pull in new items and post them to delicious. It should be pretty well behaved about not stomping existing bookmarks when there are duplicates (but will add in additional tags to them, add an extended description if there isn’t one already there, etc).

Its logging output is very chatty (an artifact of my spending way too much time debugging it), and currently logs to stdout. I currently have it running in cron redirected to a file.

This entry was posted in programming and tagged on by .

Some useful scripts

I decided to have another go at using delicious.com to manage my bookmarks, as I’ve realised I have a serious problem of not being able to find stuff I’ve liked.

After playing around with it a bit, I was convinced to give it a serious try. There was just one problem. I already have those links I want to find. How do I get them all into delicious?

The ones I care most about are my reddit liked pages. So, how to get them out of reddit? Well, some poking around shows that although reddit doesn’t have a formal API it does expose a lot of its data in moderately convenient forms. So I wrote a reddit liked bulk exporter.

Then it occurred to me that a large part of the utility of delicious was that everything was tagged, and that that coming from reddit was very under tagged (I could and did use the subreddits as tags, but that’s not quite enough). So I wrote a bulk title to tag converter. It’s really very basic – you’ll end up with a bunch of not so great tags as well as the good ones – but it does the job adequately.

So now came the time to get it into delicious. Delicious has an API, so it was straightforward enough. I wrote a bulk delicious importer.

All of these are super basic scripts, and they need a long way before they become fully featured tools in their own right, but they’re all at least mildly useful as they currently stand. I may elaborate one or more of them into a proper project, or I may not bother, but either way maybe someone else will find these useful.

Update: And the next Step: StumbleUpon.

StumbleUpon does not expose anything that even resembles an API. Were I the paranoid sort I would say that in fact they go out of their way to make their site unusable for anything except a browser running the plugin. Fortunately, I *had* a browser running the plugin. After about half an hour of dicking around with curl and cookies and getting nowhere fast I ended up automating scraping the list view of my stumbleupon favourites with a trivial chickenfoot script (I’ve lost the code, but it really was trivial). I then wrote this script to parse the resulting html and output it as json suitable for the delicious bulk import.

This entry was posted in programming on by .

Lessons learned: Are you solving the right problem?

I like to think I’m pretty good at problem solving. How much this is really the case is up for debate, but either way I’ve learned a lot of useful little lessons in the course of testing it.

A really important one is this:

The problem you are trying to solve is often harder than the problem that you need to solve.

Problems frequently come with a set of requirements. It has to play games, let me do my email and make coffee. Well… does it really? Actually it’s much simpler to solve if you make your own damn coffee.

This is a slightly spurious example because the making coffee requirement is so obviously tacked on. But the general idea is common: Of the requirements your problem comes with, one or more may actually be non-essential and dropping it can simplify your life substantially.

Let’s take two real examples:

The first is at work, on the SONAR system. The objective is to build profiles for peoples’ knowledge based on their authored content. So what do we do?

  1. For each document, assign it some tags that describe it
  2. For each user, look at the tags of documents they’ve authored and give them a score for that tag based on that

This is an interesting example of an implicit requirement. We’ve broken the problem up into two steps, and each of those steps has thus become a requirement for the overall problem.

One of the first things I realised when I started working on SONAR is that this implicit requirement was making our life a lot harder than it needed to be. Automatic document tagging with an unknown set of tags is hard. There are lots of papers telling you how to do it, and none of them actually do a very good job (the best of them all tag in relation to some background corpus, but that’s not so good when there are e.g. a lot of internal project names to the company which aren’t present in that corpus).

Given that the end result on users is what matters, this realisation frees us from having to achieve a high precision per document and allows us to focus on recall (Translation: we don’t need to make sure all the results we get are good, we just need to make sure that the good results are contained within them). One minor rewrite later, our theme quality had gone up and our implementation complexity down.

My next example is much more controversial. Which is also a relevant point: Not everyone is going to agree on whether a given feature is essential. I still think I’m right on this one, but then I would. I’ve more or less resigned myself to the idea that the battle is lost here, so this is purely illustrative rather than an attempt to convince.

The example is the 2.8 Scala collections library. The signatures of map and flatMap are quite complicated:

  def map[B, That](f: A => B)(implicit bf: BuilderFactory[B, That, Traversable[A]]): That
  def flatMap[B, That](f: A => Traversable[B])(implicit bf: BuilderFactory[B, That, Traversable[A]]): That

In comparison, an experimental collections API design I put together a while ago called alt.collections has substantially simpler signatures (equivalent to the old signatures in scala.collection really)

  def map[S](f : T => S) : Traversable[S]
  def flatMap[S](f : T => Traversable[S]) : Traversable[S]

Why is my version so much simpler than Martin’s? Well, as much as I’d like to claim it’s because I’m an unparalleled genius of API design compared to Martin, that’s really not the case.

The reason for the comparative simplicity is that Martin’s version does something that mine does not.

The problem we’ve both set out to solve is that the return types of map and flatMap are very inconsistent in the current API: Sometimes they’re overridden to return the right type, sometimes they’re not (sometimes they can’t be). This upsets people. Essentially there is the following requirement that isn’t being met by the current collections API:

map and flatMap on a collection should return the same type of collection

Martin responds by meeting that requirement. I respond by taking it away. The design of alt.collections was centered around the idea that map and flatMap would not attempt to return the “right” type but would instead always return a Traversable. It then set about making it easy to get the right type out at the end.

More than one person has responded to this design with “I’m sorry, but that’s horrible” or similar, so clearly a lot of people think in this case the answer is “Yes, we are”. But I still find it a rather compelling example of how much simpler you can make the solution by first simplifying the problem.

This entry was posted in life, programming on by .

Many eyes make heavy work

We were talking in the office the other day about a fun little project for twitter. Basically just looking at what pairs of hash tags get used together. After getting one and a half hours sleep last night, waking up and being unable to get back to sleep I had some time to kill on my hands, so thought I’d throw it together.

Getting and munging the data into a form that gave tweet similarity was easy enough. But what to do with it then? The obvious thing to do is to visualise the resulting graph.

We have our own visualisation software at trampoline (which I did try on this data. It does fine), but I wanted something smaller and more stand alone. I’d heard people saying good things about IBM’s many eyes (in retrospect I may have to challenge them to duels for the sheer affrontery of this claim), so I thought I’d give it a go.

Let me start by saying that there is one single feature that would change my all encompassing burning loathing for Many Eyes into a mild dislike. It alleges to have the ability to let you edit data you have uploaded. Except that the button is disabled with the helpful tooltip “Editing of data is currently disabled”.

This renders the entire fucking site useless, because it takes what should be a trivial operation (editing the data you’ve uploaded to see how it changes the visualisation) into a massive ordeal. You need to create an entirely new dataset, label it, add it, recreate the visualisation…

Fortunately recreating the visualisation isn’t that hard. After all, Many Eyes doesn’t actually give you any functionality with which to customise your visualisation (maybe it does for some of the others. It sure doesn’t for the network viewer).

So why did I need to tinker with the data? Isn’t it just a question of upload a data set, feed it into IBM’s magic happy wonderground of visualisation and go “Oooooh” at the pretty pictures?

Well, it sortof is.

Actually what it is is upload the data set, feed it into IBM’s magic happy wonderground of visualisation and go “Aaaaaaaargh” as my browser grinds to a halt and then crashes.

It’s understandable really. I did just try to feed their crapplet an entire one point six MEGA bytes of data (omgz).

Wait, no. That’s not understandable at all. In fact it’s completely shit. That corresponds to about 12K nodes and about 60K edges. This is *not* a particularly large number (metascope happily lays it out in a few tens of seconds). This is a goddamn data visualisation tool. The whole point of it is that you’re supposed to be able to feed it large amounts of data. At the very least it should tell you “Sorry, I was written by monkeys so probably can’t handle the not particularly large amount of data you have just fed me”.

So, I spent some time trying to prune the data down to a size many eyes could manage to not fail dismally at but where the graph was still large enough to be interesting. This was a hilarious process. Consider:

  • The only way to edit the data is to create an entirely new data set and recreate the visualisation.
  • The only way to determine that I’ve got too much is to observe that my browser crashes.

After about half a dozen iterations of this I decided enough was enough and declared many eyes to be an unusable piece of shite that was not worth my time. Life’s too short for bad software.

This entry was posted in Data geeking, programming and tagged , , on by .

Interest bandwidth

Highlight from the #scala IRC channel:

13:21 < DRMacIver> I have a somewhat unfortunate theory. I suspect the amount one cares about programming issues is inversely proportional to how inherently interesting one's work is.
13:22 < DRMacIver> Or at least negatively correlated
13:22 < ijuma> DRMacIver: is this based on personal experience? ;)
13:22 < DRMacIver> Yes
13:22  * DRMacIver finds it much harder to get worked up about language issues these days
13:23 < DRMacIver> Which is in large part because I'm doing a lot more interesting borderline computer sciencey work
13:24 < dgreensp> yeah, I do a lot of interesting work and only rarely stop to think about languages
13:24 < ijuma> yeah, I noticed. I think it makes sense. People have limited bandwidth and if work is interesting, it's likely to take quite a bit of it
13:25 < DRMacIver> I think the other issue is that people want to find what they do interesting. And so if *what* they do isn't interesting they have to become interested in *how* they do it.
13:25 < DRMacIver> But yes, the bandwidth thing is also a big part of it
13:26 < dgreensp> bandwidth minimum, bandwidth maximum :)
13:26 < DRMacIver> ?
13:27 < ijuma> DRMacIver: agreed
13:27 < dgreensp> er, the two points seemed related -- people need to be interested in something, but they can't be interested in too many things at once
13:28 < DRMacIver> Oh, right.
13:28 < DRMacIver> Yes. That's a good way of looking at it.
This entry was posted in programming and tagged , , on by .