I was thinking about doing NaNoWriMo this year. Had a story plotted out and everything. I never quite got around to it though – I started writing, but didn’t really feel motivated to continue.
But what occurred to me is that what I’m currently actually much more bothered about is not novel writing, it’s code writing. As you’ve probably noticed from the change in and absence of subject matter on this blog I haven’t really been writing much code on my own time recently. Further, work is currently full of large piles of integration and way more ruby than I’d ever care to see again.
So I’ve decided to declare November to be my personal code writing month. I’m going to do something completely different. It might not be useful to me (although I actually did start from a practical problem when deciding to do this), but it’s going to be interesting and computer sciencey and at least possibly useful to someone.
Here’s the goal. I’m writing a library for neighbour searching (both “nearest” and “within epsilon”) in Haskell. The objective is that by the end of November it will be in a state that I am prepared to bless as a release. These are the requirements:
- It has to have an API I’m reasonably happy with blessing as “good”
- It has to be well enough tested that I can say with a reasonable degree of confidence that it has no obvious bugs
- It has to have at least a basic benchmark suite
- It has to perform well enough to be able to be used on lowish (say <= 10) dimensional data sets of a few hundred thousand points (I'd like it to do a lot better than that of course)
- It has to have a package up on Hackage (I might change my mind on this one and relax it to “it has to be ready to upload to hackage”)
i.e. it has to be in a state where if someone goes “I want to do nearest neighbour search in Haskell!” (for some reason) I wouldn’t feel embarrassed to point them to this library.
How’s it doing so far? Well, the API is ok. Needs a bit of work, but it’d probably do for a first release if it had to.
The testing is actually pretty good. It needs to be improved to cover more interesting classes of metric spaces, but that’s a fairly simple matter which will take me all of half an hour to do when I get around to it (and then however long it takes to fix any bugs that uncovers).
The performance? Ha ha. It is to laugh. I started with the hyper optimised data structure of “keep every point in the index in a linked list” and using scanning and sorting respectively to do epsilon and nearest neighbour queries. I’m fully aware of how stupid this is of course, but the idea was that by first writing the API and tests I would then have the freedom to tinker.
I have now tinkered a little bit, and the result is an awesome system which has the properties of being both slow to build an index and slow to query the index. That’s fine. I knew it would be when I was writing it. It’s just to flush out a few of the details.
Anyway, the next step is to build the benchmark suite, and then I can start doing Computer Science to it.