Category Archives: programming

Personal Code Writing Month (PeCoWriMo)

I was thinking about doing NaNoWriMo this year. Had a story plotted out and everything. I never quite got around to it though – I started writing, but didn’t really feel motivated to continue.

But what occurred to me is that what I’m currently actually much more bothered about is not novel writing, it’s code writing. As you’ve probably noticed from the change in and absence of subject matter on this blog I haven’t really been writing much code on my own time recently. Further, work is currently full of large piles of integration and way more ruby than I’d ever care to see again.

So I’ve decided to declare November to be my personal code writing month. I’m going to do something completely different. It might not be useful to me (although I actually did start from a practical problem when deciding to do this), but it’s going to be interesting and computer sciencey and at least possibly useful to someone.

Here’s the goal. I’m writing a library for neighbour searching (both “nearest” and “within epsilon”) in Haskell. The objective is that by the end of November it will be in a state that I am prepared to bless as a release. These are the requirements:

  1. It has to have an API I’m reasonably happy with blessing as “good”
  2. It has to be well enough tested that I can say with a reasonable degree of confidence that it has no obvious bugs
  3. It has to have at least a basic benchmark suite
  4. It has to perform well enough to be able to be used on lowish (say <= 10) dimensional data sets of a few hundred thousand points (I'd like it to do a lot better than that of course)
  5. It has to have a package up on Hackage (I might change my mind on this one and relax it to “it has to be ready to upload to hackage”)

i.e. it has to be in a state where if someone goes “I want to do nearest neighbour search in Haskell!” (for some reason) I wouldn’t feel embarrassed to point them to this library.

How’s it doing so far? Well, the API is ok. Needs a bit of work, but it’d probably do for a first release if it had to.

The testing is actually pretty good. It needs to be improved to cover more interesting classes of metric spaces, but that’s a fairly simple matter which will take me all of half an hour to do when I get around to it (and then however long it takes to fix any bugs that uncovers).

The performance? Ha ha. It is to laugh. I started with the hyper optimised data structure of “keep every point in the index in a linked list” and using scanning and sorting respectively to do epsilon and nearest neighbour queries. I’m fully aware of how stupid this is of course, but the idea was that by first writing the API and tests I would then have the freedom to tinker.

I have now tinkered a little bit, and the result is an awesome system which has the properties of being both slow to build an index and slow to query the index. That’s fine. I knew it would be when I was writing it. It’s just to flush out a few of the details.

Anyway, the next step is to build the benchmark suite, and then I can start doing Computer Science to it.

This entry was posted in life, programming on by .

Generating random spheres

So when I was in the shower earlier and I wondered to myself “Hm. How would I generate random points on an N-sphere?” (true story).

I thought about it for a bit and generated the two obvious solutions. Unfortunately the two obvious solutions are, respectively, wrong and massively inefficient (actually the inefficient one isn’t too bad for low dimensions, but I was interested in the high dimensional case where it rapidly becomes massively inefficient). Once out of the shower I looked the answer up, and found the whole thing interesting enough that I figured someone else might too, so this is a post about that.

Obvious but wrong solution

Generate N random uniform numbers between -1 and 1. Divide the resulting vector by its norm.

Why is this wrong? Well, the initial N random numbers will lie uniformly in the N cube. So the probability density at a given point is proportional to the length of the line that goes from the zero, through that point and out to the N sphere. This line is not of uniform length for the very simple reason that cubes aren’t spheres.

Obvious but slow solution

Generate N uniform random numbers between -1 and 1. Retry this until the resulting norm is <= 1. Divide the resulting vector by its norm. Why does this work? Because we're using a rejection method to make sure we get points inside the unit ball. The result is not uniformly distributed inside the unit ball but it is (evidently) spherically symmetrical. By dividing by the norm we’re then projecting the spherically symmetrical distribution onto the sphere and thus getting a uniform distribution on it.

Why is it slow? Because of the curse of dimensionality. When we do the rejection we’re effectively generating a geometric distribution with probability of success equal to the proportion of the N-cube the N-ball occupies. This means we expect to perform the generation 1 / that probability times, which tends towards infinity quite rapidly as N gets large.

Obvious in retrospect solution

At this point I looked up the answer, though I should have been able to figure it out.

What we want is a spherically symmetric distribution a la the previous solution but which is more efficient to calculate.

It turns out that a good choice for this is an N dimensional normal distribution, which can be obtained by simply generating N Normal(0, 1) random variables. You can then divide by the norm of those variables and get a distribution on the unit sphere.

This is apparently not the best solution (generating normal random numbers is a little expensive), but I couldn’t find a reference to a version which works in N dimensions for any N.

Here’s some C code implementing this: https://gist.github.com/1354700.

This entry was posted in Numbers are hard, programming on by .

(Runtime) Cost-free abstractions in clay

My own-time programming habits have been way down recently, which is a bit sad. Much of what I have been doing has been in clay though – partly a project of my own, partly contributing to the standard library (there’s been a cycle of “own project, ooh, this bit could possibly be in the standard library, let’s flesh it out…” resulting in probably more time spent on the latter than on my own).

One of my recent contributions has to been to custom sorting and ordering. My starting point was simple: I wanted to be able to sort by things which were not the natural order (in my particular use case I had a scoring function and wanted to sort things from “best” to “worst”).

I started out in the obvious way, and just let it take an argument that defined a custom < for the type - essentially just a function comparator argument. This was fine, and works perfectly naturally in clay (yes, clay is a systems language, but it's got a perfectly reasonable lambda system. The capturing rules are a little odd, but for things where you don't have to capture state it Just Works). I then took a leaf out of Haskell's book and defined a function comparing. In Haskell speak, comparing has the signature comparing :: (Ord b) => (a -> b) -> (a -> a -> Bool). (I’m not 100% sure this is the signature in Haskell, but it’s the equivalent signature to what I defined).

This all worked fine. But I found it a little annoyingly specific. What if I wanted to use <= in the sort? Ok, sure, you can do x <= y as not(x > y), but that’s a little awkward. And then what if I wanted to do ==? Again, you can define it in terms of <=, but that's a bit rubbish. Thus was born the comparators API. It’s very straightforward. Essentially the idea is that all of the standard functions that correspond to the comparison operators get overloaded to take a three-argument version where the first argument is a “Comparator” type (in clay this currently just means that Comparator?(T) returns true. There’s no true type-class system, only compile-time predicates which test types).

Having the comparator types, it was natural to define combinators for them. So far we’ve only got the basic ones (comparing, natural, and reversed), but it would be highly reasonable to define more.

But, I asked, you really want these operations to be fast. e.g. a sort lives or dies on how fast its comparisons are. How much overhead does using the comparators API introduce?

Err, none, it turns out. At least if you use the standard comparators with procedures for arguments (and most of the time when you use lambdas too). I’m afraid I’ve lost the examples, but I did some experimenting, and it turns out that with things like reversed(comparing(size)) and similar, clay actually just gives this a unique type which requires no data to represent, and type specialisation and inlining take over and the comparator literally completely disappears at runtime: I compared assembly output from using the comparator and writing the same code by hand, and they were literally byte identical.

Most languages (yes guys, I know you can do this in C++, or with a whole program optimising compiler like mlton) when you introduce an abstraction like this, the result would have a performance hit. Not a lot in many cases, but maybe you’re passing a dictionary at runtime, or introduce boxing overhead, or even just have to pass a token around which makes function calls slightly more expensive. Here you can introduce the abstraction, which can clean your code up a lot, without any runtime difference from the specialised code you would have written by hand. That’s pretty cool.

This entry was posted in programming on by .

On hiring (developers)

First off, we’re hiring. You should apply. Even if you don’t apply, check out the spec. This post will make more sense if you do, and I’d love it if you were to forward it to anyone you know who would fit.

We’ve yet to see if it will produce the right results, but one interesting thing has emerged from writing this job spec is that we’ve not done it as a collective. It’s largely been my doing, with editorial feedback from the rest of the team.

There are a couple consequences of this. Some good, some bad, some still in a superposition of the two.

we’re looking to hire someone who can do this job better than me

The part of it I especially liked is that the way the job requirements (both in the spec and as we’re viewing it) is that we’re looking at the person who currently is doing / would do the job and saying “You need to do the job better than him”. We’ve done this a bit less formally previously when hiring our new sysadmin, and it just sortof emerged naturally this time as a result of the fact that I was the one pushing the hardest to get this job spec done and posted. Whatever happens, I would very much want to retain this aspect of the process: Pick a person in the company who would do the job if you don’t hire for this role, make them responsible for speccing out the job and make them the primary decision maker on whether the person gets hired – other people get to veto, but they’re the one doing the initial approval by saying “Yes, I would rather have this job done by this candidate than done by (a perfect clone of) me”. I think this is very much the right way to do it.

15 years of BBQML

The fact that I wrote it brings with it a certain authorial style. This is both good and bad. If you’re reading this blog you probably are familiar with that style by now – whimsical, fairly abrasive and decidedly non-traditional. All of that comes through in the job posting.

Personally, I think that’s ok. Everyone in the company seems to have liked the posting (Well, they laughed. That’s good, right?), so enjoying the style is probably a good sign of a compatible personality. On the other hand it may turn out to be a bad thing – it may scare off some people who would actually work well for the role (I have at least one friend who is giving me a hard time because he thinks the posting is unprofessional and makes me sound like a wanker), on the other hand it may attract too many people who self-describe as rockstar ninja cyborg artist programmers.

If it turns out not to work, I will modify my style for postings in future, but I am hopeful.

we don’t want yet another [REDACTED]

The biggest feature of this posting that may or may not work in our favour is its honesty. It’s very much “This is who we are, this is what we’re looking for”. There are some tensions you can see behind it as a result. Will they scare people off? I hope not. As a company, we’re very much not perfect. No one is, and if you expect the company you’re applying to to be perfect then you’re in for a bit of a shock whomever they are. I hope that being up front about this is a good thing, and a nice departure from the whitewashed corporate-speak you get in a lot of job postings.

and…?

All in all, it’s been an interesting experiment. I look forward to seeing the results.

Let me know what you think.

This entry was posted in life, programming, Writing on by .

The role of a CTO

IRC conversation with Al Davidson

< DRMacIver> I must admit to only having a relatively vague idea of what a CTO job involves (and a much stronger conviction that I don’t want one any time soon).
< DRMacIver> As I’m fairly sure that craig isn’t typical of the role.
< al_> i think you’re right on that score
< DRMacIver> My impression is that it seems to be somewhere more in the realm of “You’re supposed to get to work on the fun stuff but so much of your time is taken up by boring administrative crap that you don’t get to, but it’s still your fault if it goes wrong”
< al_> bingo
< DRMacIver> I see the old chestnut of “take rosy eyed view of the world and apply cynicism until it starts to weep” still works.
< al_> david, i think you just described the human experience in a nutshell

This entry was posted in life, programming on by .