Author Archives: david

When you have a hammer: Squish usage example

So I’m playing around with frequent item set mining, and the approach I’m trying (not that I expect it’s at all novel. I just want to get a feel for the problems, and maybe build a useful tool if I can’t find any that match my needs) needs a “getting started” step which works out the event list for all frequent one and two item sets.

So basically what I want is a map of tags to maps of tags to events. So my_map[t1] gives a map of tags to which events they cooccur with t1 in (this should include t1, so it also includes all events t1 occurs in as a special case).

Further I want to do this in a way that can be expected to process huge amounts of data (certainly gigabytes) without falling over.

I was thinking about how to do the problem in a sensible way, and then I realised “Ah ha! I can use squish for this!”

In fact the solution ends up using squish twice. Here’s the code:

awk '{
  line_count++;
  for(i = 1; i <= NF; i++) 
    for(j = 1; j <= NF; j++)
      print $(i) " " $(j) ":" line_count
}' |  sort | squish -d ":" -s ", " | sed 's/ /:/' | squish -d ":" -s "; " | sed 's/:/: /'

It takes input like this:

foo 
bar
foo bar
foo bar baz
baz
foo bar baz bif bing bloop blorp
blorp

And outputs data like this:

bar: bar:2, 3, 4, 6; baz:4, 6; bif:6; bing:6; bloop:6; blorp:6; foo:3, 4, 6
baz: bar:4, 6; baz:4, 5, 6; bif:6; bing:6; bloop:6; blorp:6; foo:4, 6
bif: bar:6; baz:6; bif:6; bing:6; bloop:6; blorp:6; foo:6
bing: bar:6; baz:6; bif:6; bing:6; bloop:6; blorp:6; foo:6
bloop: bar:6; baz:6; bif:6; bing:6; bloop:6; blorp:6; foo:6
blorp: bar:6; baz:6; bif:6; bing:6; bloop:6; blorp:6, 7; foo:6
foo: bar:3, 4, 6; baz:4, 6; bif:6; bing:6; bloop:6; blorp:6; foo:1, 3, 4, 6

How does it work?

Well, we first use awk to transform it into data that looks like this:

foo foo:1
bar bar:2
foo foo:3
foo bar:3
bar foo:3

giving us cooccurring pairs followed by an event they cooccur in.

We then sort it to get all cooccurrences next to eachother, and pass it through squish using : as the delimiter and , as the separator, which gives us something that looks like this:

bar bar:2, 3, 4, 6
bar baz:4, 6
bar bif:6
bar bing:6
bar bloop:6

A bit of sed (not really necessary, but helps for formatting) changes the character after the first tag, and now we can apply squish again with the new delimiter and use it to join lines together with the separator as ;. We then reformat slightly to get the desired end result.

This wouldn't be terribly complicated to do without squish, but squish made it basically trivial, and because sort scales reasonable well to large data and squish processes only in a pipeline, this too should handle as much data as you want to throw at it.

This entry was posted in Uncategorized on by .

An interesting idea for authentication

So one of my pet peeves is authentication. I absolutely hate having to remember a pile of different passwords. I would love everyone to use openid, but clearly it just isn’t happening. I can live with people using twitter and facebook and similar for auth, although I am rather disinclined to give you the ability to tweet as me so I can use your social cat tagging application (yes, I know about read only. No one is bloody using it), but even this doesn’t seem to get much uptake.

Mike and I were talking earlier, and we had an interesting idea. After talking it over with a few friends (Andy Bennett and Alaric Snell-Pym) we refined the idea a bit.

The idea is basically as follows:

  1. Login is done by dedicated personalised login links. It would be something like http(s)://mysite.com/login/some-big-random-string.
  2. When you sign up you’re given a page which says “this is your login link. Please bookmark it”
  3. You also provide your email address, and you can at any point say “I need a new login link. Please email it to me”. It will email you the new link, and the old one will be invalidated as soon as you click on it (it can’t be invalidated before that, because otherwise anyone can invalidate your link).

It should be simple to use – you just bookmark links for your site that are specific to you, which is no worse than storing passwords locally, and it can easily be invalidated by email equivalently to “I forgot my password” links people are familiar with. The security is not really any worse than normal password based authentication (there’s the potential to see it in the URL bar, but you just redirect away from it quickly and make it a long random string, which renders this basically not an issue).

This seem so painfully simple it’s astonishing no one’s doing it if there’s not a major flaw I’m missing.

What do you think?

This entry was posted in Uncategorized on by .

A possibly new unix-style utility

I’ve found a couple times in the past that I have the following problem:

I’ve got a bunch of records of the form

key1: value1
key1: value2
key2: value3
...

And I want to transform them into the form

key1: value1, value2
key2: value3
...

so combining consecutive lines with the same initial key and joining the values together on one line.

It’s not a terribly hard problem, but doing it quickly and in bounded amount of memory for very large volumes of data is at least non-trivial enough to require a bit of care.

In the past I generally solved this in some application specific way, but recently I decided to do it properly and have extracted a nice little command line utility from it. It accepts a sensible range of options for configuring behaviour, is reasonably fast (I haven’t benchmarked it very extensively, but on all data I’ve tried it on it’s about an order of magnitude faster than sorting the data, which I tend to want to do before feeding it into squish anyway, and about half the speed of uniq) and runs in memory that will never grow past O(size of largest key).

I didn’t have a good name for it, so I picked a bad one. It’s called “squish” and is available from my data-tools repository (aka “Where I put random crap”). If you have a better name for it I’m all ears.

It’s possible that this duplicates functionality of something that already exists. Anyone know if it does?

This entry was posted in Uncategorized on by .

On hiring (developers)

First off, we’re hiring. You should apply. Even if you don’t apply, check out the spec. This post will make more sense if you do, and I’d love it if you were to forward it to anyone you know who would fit.

We’ve yet to see if it will produce the right results, but one interesting thing has emerged from writing this job spec is that we’ve not done it as a collective. It’s largely been my doing, with editorial feedback from the rest of the team.

There are a couple consequences of this. Some good, some bad, some still in a superposition of the two.

we’re looking to hire someone who can do this job better than me

The part of it I especially liked is that the way the job requirements (both in the spec and as we’re viewing it) is that we’re looking at the person who currently is doing / would do the job and saying “You need to do the job better than him”. We’ve done this a bit less formally previously when hiring our new sysadmin, and it just sortof emerged naturally this time as a result of the fact that I was the one pushing the hardest to get this job spec done and posted. Whatever happens, I would very much want to retain this aspect of the process: Pick a person in the company who would do the job if you don’t hire for this role, make them responsible for speccing out the job and make them the primary decision maker on whether the person gets hired – other people get to veto, but they’re the one doing the initial approval by saying “Yes, I would rather have this job done by this candidate than done by (a perfect clone of) me”. I think this is very much the right way to do it.

15 years of BBQML

The fact that I wrote it brings with it a certain authorial style. This is both good and bad. If you’re reading this blog you probably are familiar with that style by now – whimsical, fairly abrasive and decidedly non-traditional. All of that comes through in the job posting.

Personally, I think that’s ok. Everyone in the company seems to have liked the posting (Well, they laughed. That’s good, right?), so enjoying the style is probably a good sign of a compatible personality. On the other hand it may turn out to be a bad thing – it may scare off some people who would actually work well for the role (I have at least one friend who is giving me a hard time because he thinks the posting is unprofessional and makes me sound like a wanker), on the other hand it may attract too many people who self-describe as rockstar ninja cyborg artist programmers.

If it turns out not to work, I will modify my style for postings in future, but I am hopeful.

we don’t want yet another [REDACTED]

The biggest feature of this posting that may or may not work in our favour is its honesty. It’s very much “This is who we are, this is what we’re looking for”. There are some tensions you can see behind it as a result. Will they scare people off? I hope not. As a company, we’re very much not perfect. No one is, and if you expect the company you’re applying to to be perfect then you’re in for a bit of a shock whomever they are. I hope that being up front about this is a good thing, and a nice departure from the whitewashed corporate-speak you get in a lot of job postings.

and…?

All in all, it’s been an interesting experiment. I look forward to seeing the results.

Let me know what you think.

This entry was posted in life, programming, Writing on by .

Northpaw, part 1

So I’ve bought myself a Northpaw kit.

Why? Well, partly because I’m interested in this sort of thing – different modes of perception of and interaction with information – but mostly because it looked cool and I was in a gadget buying mood.

After some grief from the royal mail it finally arrived on friday, and I spent some time today at the london hack space putting it together.

I don’t have anything very exciting to report so far: I basically got it to work, but mangled the assembly. The electronics turned out to be relatively easy – the results aren’t elegant, as I have the manual dexterity of a blind walrus and haven’t soldered in 12 years, but they seem to work. The positioning of the motors however turns out to be rather a challenge.

I found the ribbon extremely hard to configure correctly, and the power connection for it very hard to get right in a way that gave me any capability to position the motors correctly. I’m sure it’s doable, but I couldn’t do it. So the upshot is that I’m going to have to rebuild that part. Fortunately the motors all work and the electronics seems to talk to them correctly. My inclination is to take the ribbon mount off and wire it together with normal wires rather than ribbon.

Another update will be forthcoming when I’ve actually got the thing wearable.

This entry was posted in Uncategorized on by .