Author Archives: david

A Vantage Point Tree implementation in (shudder) Java

My friend Dave Stark has an ongoing project to write an OCR engine in Java called Longan. It’s still very much a work in progress, but I’ve been enjoying hearing about that progress and it sounds like it’s going well.

He’s currently experimenting with a new backend for the recognition based on a metric search approach instead of the neural networks he’s been using so far and mentioned that it was currently rather slow. So I was all “Ooh! Vantage Point Trees! You should use Vantage Point Trees! They’re fun and awesome. Here, I will write you one!”

Err, so I did. It took me about two hours all told, plus some tinkering afterwards to fix the bits I was ashamed of. It’s a reasonably complete implementation – it does sensible center selection when building the tree, it buckets leaves into small numbers of points rather than making you do distance comparisons all the way down to the bottom and it tracks the radius of subtrees to let you quickly eliminate trees which aren’t going to hit at all.

Currently supported API:

  • Given a metric and a collection of list of points, build a vantage tree
  • Given a vantage tree and a point p find all points in the tree within epsilon of p
  • Given a vantage tree and a point p find the n nearest points to p in the tree

In particular there are no mutation operations on the tree to speak of, mostly because I couldn’t be bothered.

It’s currently what you might call “lightly tested” in the sense that there isn’t even the pretence of a test suite. I’ve tried it and it seems to work (and my trying it was thorough enough that I found and fixed various bugs) but I haven’t got around to writing proper tests for it. The performance is surprisingly good – on the order of a few ms for querying out of a million (reasonably well distributed – I haven’t tried it on less favourable cases) points.

Although it works, it should probably be considered more of a tech demo than a real library. I’m happy to accept patches, I may fix reported bugs, but I’m not likely to otherwise maintain it and I’m certainly not going to sort out packaging for it, but if you would find such a thing useful be my guest to take it, run with it and otherwise do whatever you want with it.

This entry was posted in programming on by .

A cute lower bound for weighted FAS tournaments

As mentioned, for my random hack project du jour I am currently working on an optimiser for weighted feedback arc sets.

This problem can largely be summarised as follows: I have a non-negative weight matrix \(A\) and I want to find a permutation \(\pi\) that maximizes the score $$w(\pi) = \sum_{\pi(i) < \pi(j)} A_{ij}$$ (It's more normal to phrase this in terms of minimizing the sum going the other way, but the problems are equivalent) The thing I was wondering about recently: What sort of scores am I guaranteed to be able to achieve? There's the trivial lower bound that $$w(\pi) \geq \sum_{i < j} \min(A_{ij}, A_{ji})$$ but, equally trivially, in the case where \(A\) is not a constant matrix this is easy to exceed. My question is: What sort of values are we guaranteed to be able to find a value of \(\pi\) that exceeds them? I realized earlier that there's a really nice solution to this. Let \(\chi\) be a random variable that picks a permutation uniformly at random. Then we can write $$w(\chi) = \sum_{i < j} S_{ij}$$ where \(S_{ij}\) is a random variable that takes the value \(A_{ij}\) if \(\chi(i) < \chi(j)\) and else takes the value \(A_{ji}\). It's easy to see that \(S_{ij}\) takes its two values with equal probability, so $$E(S_{ij}) = \frac{1}{2}(A_{ij} + A_{ji})$$ But then we have $$E(w(\chi)) = E(\sum_{i < j} S_{ij}) = \sum_{i < j} E(S_{ij}) = \frac{1}{2} \sum A_{ij}$$ So given that the expected weight of a random permutation is \(\geq\) this value, it's certainly the case that there exists a permutation with weight \(\geq\) this value. This is nice and all, but it turns out to be relatively easy to do better. Consider \(S_{ij}\) and \(S_{kl}\). You can prove by inspection of cases that these are independent of eachother (note that the set \(S_{ij}\) is not independent - even sets of three elements of it typically aren't. This only works for pairs). This is enough that the variances add. Therefore we have $$Var(w(\chi)) = \sum Var(S_{ij})$$ It's a simple calculation to show that $$Var(S_{ij}) = \frac{1}{4} (A_{ij} - A_{ji})^2$$ So $$Var(w(\chi)) = \frac{1}{4} \sum_{i < j} (A_{ij} - A_{ji})^2$$ Why is this important? Well, we are guaranteed that we have some point which is at least the standard deviation away from the mean, and the distribution of \(w(\chi)\) is symmetric about its mean (you can see this by considering what happens under \(\pi \to \tau \cdot \pi\) where \(\tau\) is a reversal). Therefore there must be a point which is at least the mean plus the standard deviation. Therefore we can find a permutation \(\pi\) with $$w(\pi) \geq \frac{1}{2}\left(\sum A_{ij} + \sqrt{\sum_{i < j} (A_{ij} - A_{ji})^2} \right)$$ Some closing notes:

  1. Yay for mathjax. This is my first post using it
  2. My optimizer pretty consistently exceeds this bound anyway (it would be sad if it were doing worse than moderately determined random sampling), but I’ve now added a first pass which shuffles until it does.
  3. I’m nearly certain this bound is not tight except for in the trivial case of the constant matrix. When plotting graphs of the distribution of \(w(\chi)\) the result has a distinctly bell curved nature (although obviously it’s actually discrete) which drops off smoothly rather than cutting off at the SD. However calculating more about the distribution of random sampling is hard, and I suspect may not be that useful, so for now I’m leaving it at this.
  4. I have no idea if this is a known result, but it’s pretty trivial to calculate once you have the basic idea, so it’s likely it’s one of those things that has been discovered several times but wasn’t regarded as interesting enough to publish. Yay for blogs
This entry was posted in Numbers are hard on by .

PeCoWriMo derail

Well I got slightly distracted from my PeCoWriMo project. No particularly good reason – I was just feeling a bit disenchanted with the subject matter and Haskell was annoying me slightly (as usual). Nothing enough to actually derail me on it’s own, but I got to thinking about a problem I’d worked on before and decided to do some more work on it.

And in keeping with the mild irritation with Haskell and the “oh for the love of god anything but more ruby” I decided to do it in another nicer, friendlier language.

Err, that is to say, C.

The problem this is working on is that of finding a minimal feedback arc set in a weighted graph (although I’ve actually posed it as a maximization problem). This boils down to the following problem:

Given N points and a weight function w(x, y) find an enumeration x_1, …, x_N that maximizes sum_{i < j} w(x_i, x_j) The main application I care about is voting, but I'm sure it has other uses. Code is available on github

This entry was posted in Uncategorized on by .

Notes on design of a Haskell API

As previously mentioned, I’m currently writing a little Haskell library. Here are some notes on the API’s many iterations.

Background

The primary feature of this API is building a search index for things in a metric space.

What’s a metric space? It’s essentially points together with a “distance function” d, satisfying the following rules:

  1. d(x, y) == 0 iff x == y
  2. d(x, y) == d(y, x)
  3. d(x, z) <= d(x, y) + d(y, z) [this is called the triangle inequality]

And the main operations we want the search index to be able to support are:

  1. Given a metric and a set of points, build an index
  2. Given an index I, a point x and a distance e, give me all points y in I such that d(x, y) < e

The question I will be looking at here is what the right design for the metric API is.

Attempt 1: Passing around a distance function

This was pretty straightforward. Metrics are represented as a function with the following type signature:

type Metric a = a -> a -> Double

and Indexes have the signature

data MetricSearchIndex a

You pass such a function in when creating the search index. Done.

Unfortunately, this turns out not to be adequate for my goals.

Why?

Well, one of the things I want to be able to do (I’ve not done it yet) is write a command line interface to this program. In order to do that, I really need to be able to save model files. In order to do that I’d like to use Data.Binary. It’s pleasant to use and produces very clean binary formats that if you had to would be easy to write a manual parser for in whatever language you wanted unlike many other serialization libraries.

But unfortunately it can’t serialize functions. And really that’s not surprising – they’re far too opaque to do anything about.

So, the explicit function passing was out.

Metric type class, take one

The next obvious thing to do is to make Metric a type class with the following signature:

   class Metric a where
      distance :: a -> a -> Double

This worked reasonably well. At some point during it I realised that actually there was another function on metric spaces that was really useful, so I added that and the class became the following:

   class Metric a where
      distance :: a -> a -> Double
      bound :: a -> Double -> Double -> Double
      bound _ = (+)

What's the bound parameter for? Well, remember the triangle inequality? It's pretty useful. In particular it lets us deal with the following case:

Suppose we have a point c and a distance r. We denote the set of points y with distance c y < r as B(c, r). We're then looking for everything in the ball B(x, e). We can completely eliminate anything in B(c, r) if d(x, c) >= r + e.

Why? Because if there were some point y which was in both balls then we would have d(x, c) <= d(x, y) + d(c, y) < r + e by the triangle inequality. So the result of this is that the triangle inequality lets us aggressively prune entire branches of the tree without looking at them. That's the basic idea behind the sort of metric search we're doing. But it turns out that there's an interesting class of metrics which satisfy a much stronger bound: d(x, z) <= max(d(x, y), d(y, z)). These are called ultrametrics. When performing searches on things we know to be ultrametrics we can prune the tree even more aggressively. Rather than having a "isAnUltrametric" boolean function, capturing this property as the bound method seemed more sensible. The rule to be satisfied is then: distance x z <= bound x (distance x y) (distance y z)

What's the x parameter there for? It's a dummy. You need something of type a to dispatch the function on, but its value shouldn't be used. A bit ugly I know...

So, anyway, this type class now largely works. What's wrong with it?

Going back to the command line interface: How I wanted it to work was that it's indexing finite dimensional vectors in an l^p space. I wanted p to be a parameter. This means that it can't be part of the type, so you want one LP type. This means that you then have to attach the value of p to every instance of the type, two vectors need to check whether they're being compared with different values of p, etc. The result is slower, more irritating to work with and has a higher memory footprint. Premature optimisation? Maybe. But when what you're designing is a library it's important to get the performance decisions right. And definitely let's not discount the "more irritating to work with" part.

Metric class, now with multiple parameters

So we're going back to the first version where we passed an explicit metric around. Our two constraints which that didn't satisfy are:

  1. We need (in principle) to be able to serialize metrics
  2. We need to include the bounds function in it

Metrics need to be able to encode arbitrary functions, so in order to serialize them we're going to effectively have to explicitly defunctionalize and make different metrics be encoded by different types. So this is how we do it with a multiparameter type class:

   class Metric m a where
      distance :: m -> a -> a -> Double
      bound :: m -> Double -> Double -> Double
      bound _ = (+)

This changes the metric search tree to encode the metric type in its type parameter:

data MetricSearchTree m a

What's the problem with this?

Well, it doesn't work. The problem is that because the type "a" does not appear in the bound function it is unable to tell which instance of this type class you are using to get bound. The signature needs to be

   class Metric m a where
      distance :: m -> a -> a -> Double
      bound :: m -> a -> Double -> Double -> Double
      bound _ _ = (+)

This was really irritating to use as well as being aesthetically displeasing, so I dismissed it.

At this point there is a correct fix, which if you're familiar with Haskell you may well have already spotted, but I went with a wrong one at first. Which leads us to

Metric class with an associated type synonym

This uses associated type synonyms to express what the points of the metric space are.

   class Metric m where
      type Point m
      distance :: Point m -> Point m -> Double
      bound :: m -> Double -> Double -> Double
      bound _ = (+)

The metric search trees become parameterized by the metric alone.

data MetricSearchTree m

This was relatively pleasant to use. What was the problem?

Maybe this was me doing it wrong, but I found it almost impossible to write generic code with this.

For context: I have a test suite which is a bunch of quickcheck tests parameterized by the metric and the type of point in the metric space. I need to be able to generate arbitrary lists of points in that metric space, as well as have them implement Eq and Show for testing.

I simply couldn't make this work. Near as I can tell. Writing a type constraint as (Metric m, Arbitrary (Point m)) => fails with "no such type constuctor Point". Letting GHC infer the type results in it being unable to infer the relevant constraints. I struggled for a good 40 minutes or so, with the help of a few people in #haskell (though no one who would admit to knowing the associated types thing very well) before giving up. I think I've since figured out how to do it (you can introduce a type variable and add a constraint that the type variable is equal to the associated type synonym. Maybe. I've not tried it), but the eventual solution was nicer anyway.

So, where to now?

Metric class with a functional dependency

It turns out that there's a feature that's exactly for the sort of problem I ran into earlier. It's not one I'd used before, and I'd got the impression that it was generally recommended that you use associated types instead (I don't know if this is correct), but it turned out to be easy to understand and straightforward to use.

   class Metric m a | m -> a where
      distance :: m -> a -> a -> Double
      bound :: m -> Double -> Double -> Double
      bound _ = (+)

What does this do?

It says "For each type m, only one type a is allowed to have an instance of this". This means that you don't need the a in the signature of bound to deduce the full instance declaration because once you know m there's only one valid instance. It's very much like the associated type but implemented in a way that means you can easily constrain the type.

I've now rewritten all the code to use this version. I'm relatively happy with it: It meets all the requirements I've had so far and isn't especially cumbersome to use. The process of getting here was pretty annoying though.

This entry was posted in Uncategorized on by .

Personal Code Writing Month (PeCoWriMo)

I was thinking about doing NaNoWriMo this year. Had a story plotted out and everything. I never quite got around to it though – I started writing, but didn’t really feel motivated to continue.

But what occurred to me is that what I’m currently actually much more bothered about is not novel writing, it’s code writing. As you’ve probably noticed from the change in and absence of subject matter on this blog I haven’t really been writing much code on my own time recently. Further, work is currently full of large piles of integration and way more ruby than I’d ever care to see again.

So I’ve decided to declare November to be my personal code writing month. I’m going to do something completely different. It might not be useful to me (although I actually did start from a practical problem when deciding to do this), but it’s going to be interesting and computer sciencey and at least possibly useful to someone.

Here’s the goal. I’m writing a library for neighbour searching (both “nearest” and “within epsilon”) in Haskell. The objective is that by the end of November it will be in a state that I am prepared to bless as a release. These are the requirements:

  1. It has to have an API I’m reasonably happy with blessing as “good”
  2. It has to be well enough tested that I can say with a reasonable degree of confidence that it has no obvious bugs
  3. It has to have at least a basic benchmark suite
  4. It has to perform well enough to be able to be used on lowish (say <= 10) dimensional data sets of a few hundred thousand points (I'd like it to do a lot better than that of course)
  5. It has to have a package up on Hackage (I might change my mind on this one and relax it to “it has to be ready to upload to hackage”)

i.e. it has to be in a state where if someone goes “I want to do nearest neighbour search in Haskell!” (for some reason) I wouldn’t feel embarrassed to point them to this library.

How’s it doing so far? Well, the API is ok. Needs a bit of work, but it’d probably do for a first release if it had to.

The testing is actually pretty good. It needs to be improved to cover more interesting classes of metric spaces, but that’s a fairly simple matter which will take me all of half an hour to do when I get around to it (and then however long it takes to fix any bugs that uncovers).

The performance? Ha ha. It is to laugh. I started with the hyper optimised data structure of “keep every point in the index in a linked list” and using scanning and sorting respectively to do epsilon and nearest neighbour queries. I’m fully aware of how stupid this is of course, but the idea was that by first writing the API and tests I would then have the freedom to tinker.

I have now tinkered a little bit, and the result is an awesome system which has the properties of being both slow to build an index and slow to query the index. That’s fine. I knew it would be when I was writing it. It’s just to flush out a few of the details.

Anyway, the next step is to build the benchmark suite, and then I can start doing Computer Science to it.

This entry was posted in life, programming on by .