The primary feature of this API is building a search index for things in a metric space.
What’s a metric space? It’s essentially points together with a “distance function” d, satisfying the following rules:
- d(x, y) == 0 iff x == y
- d(x, y) == d(y, x)
- d(x, z) <= d(x, y) + d(y, z) [this is called the triangle inequality]
And the main operations we want the search index to be able to support are:
- Given a metric and a set of points, build an index
- Given an index I, a point x and a distance e, give me all points y in I such that d(x, y) < e
The question I will be looking at here is what the right design for the metric API is.
Attempt 1: Passing around a distance function
This was pretty straightforward. Metrics are represented as a function with the following type signature:
type Metric a = a -> a -> Double
and Indexes have the signature
data MetricSearchIndex a
You pass such a function in when creating the search index. Done.
Unfortunately, this turns out not to be adequate for my goals.
Well, one of the things I want to be able to do (I’ve not done it yet) is write a command line interface to this program. In order to do that, I really need to be able to save model files. In order to do that I’d like to use Data.Binary. It’s pleasant to use and produces very clean binary formats that if you had to would be easy to write a manual parser for in whatever language you wanted unlike many other serialization libraries.
But unfortunately it can’t serialize functions. And really that’s not surprising – they’re far too opaque to do anything about.
So, the explicit function passing was out.
Metric type class, take one
The next obvious thing to do is to make Metric a type class with the following signature:
class Metric a where distance :: a -> a -> Double
This worked reasonably well. At some point during it I realised that actually there was another function on metric spaces that was really useful, so I added that and the class became the following:
class Metric a where distance :: a -> a -> Double bound :: a -> Double -> Double -> Double bound _ = (+)
What's the bound parameter for? Well, remember the triangle inequality? It's pretty useful. In particular it lets us deal with the following case:
Suppose we have a point c and a distance r. We denote the set of points y with distance c y < r as B(c, r). We're then looking for everything in the ball B(x, e). We can completely eliminate anything in B(c, r) if d(x, c) >= r + e.
Why? Because if there were some point y which was in both balls then we would have d(x, c) <= d(x, y) + d(c, y) < r + e by the triangle inequality.
So the result of this is that the triangle inequality lets us aggressively prune entire branches of the tree without looking at them. That's the basic idea behind the sort of metric search we're doing.
But it turns out that there's an interesting class of metrics which satisfy a much stronger bound: d(x, z) <= max(d(x, y), d(y, z)). These are called ultrametrics. When performing searches on things we know to be ultrametrics we can prune the tree even more aggressively.
Rather than having a "isAnUltrametric" boolean function, capturing this property as the bound method seemed more sensible. The rule to be satisfied is then:
distance x z <= bound x (distance x y) (distance y z)
What's the x parameter there for? It's a dummy. You need something of type a to dispatch the function on, but its value shouldn't be used. A bit ugly I know...
So, anyway, this type class now largely works. What's wrong with it?
Going back to the command line interface: How I wanted it to work was that it's indexing finite dimensional vectors in an l^p space. I wanted p to be a parameter. This means that it can't be part of the type, so you want one LP type. This means that you then have to attach the value of p to every instance of the type, two vectors need to check whether they're being compared with different values of p, etc. The result is slower, more irritating to work with and has a higher memory footprint. Premature optimisation? Maybe. But when what you're designing is a library it's important to get the performance decisions right. And definitely let's not discount the "more irritating to work with" part.
Metric class, now with multiple parameters
So we're going back to the first version where we passed an explicit metric around. Our two constraints which that didn't satisfy are:
- We need (in principle) to be able to serialize metrics
- We need to include the bounds function in it
Metrics need to be able to encode arbitrary functions, so in order to serialize them we're going to effectively have to explicitly defunctionalize and make different metrics be encoded by different types. So this is how we do it with a multiparameter type class:
class Metric m a where distance :: m -> a -> a -> Double bound :: m -> Double -> Double -> Double bound _ = (+)
This changes the metric search tree to encode the metric type in its type parameter:
data MetricSearchTree m a
What's the problem with this?
Well, it doesn't work. The problem is that because the type "a" does not appear in the bound function it is unable to tell which instance of this type class you are using to get bound. The signature needs to be
class Metric m a where distance :: m -> a -> a -> Double bound :: m -> a -> Double -> Double -> Double bound _ _ = (+)
This was really irritating to use as well as being aesthetically displeasing, so I dismissed it.
At this point there is a correct fix, which if you're familiar with Haskell you may well have already spotted, but I went with a wrong one at first. Which leads us to
Metric class with an associated type synonym
This uses associated type synonyms to express what the points of the metric space are.
class Metric m where type Point m distance :: Point m -> Point m -> Double bound :: m -> Double -> Double -> Double bound _ = (+)
The metric search trees become parameterized by the metric alone.
data MetricSearchTree m
This was relatively pleasant to use. What was the problem?
Maybe this was me doing it wrong, but I found it almost impossible to write generic code with this.
For context: I have a test suite which is a bunch of quickcheck tests parameterized by the metric and the type of point in the metric space. I need to be able to generate arbitrary lists of points in that metric space, as well as have them implement Eq and Show for testing.
I simply couldn't make this work. Near as I can tell. Writing a type constraint as (Metric m, Arbitrary (Point m)) => fails with "no such type constuctor Point". Letting GHC infer the type results in it being unable to infer the relevant constraints. I struggled for a good 40 minutes or so, with the help of a few people in #haskell (though no one who would admit to knowing the associated types thing very well) before giving up. I think I've since figured out how to do it (you can introduce a type variable and add a constraint that the type variable is equal to the associated type synonym. Maybe. I've not tried it), but the eventual solution was nicer anyway.
So, where to now?
Metric class with a functional dependency
It turns out that there's a feature that's exactly for the sort of problem I ran into earlier. It's not one I'd used before, and I'd got the impression that it was generally recommended that you use associated types instead (I don't know if this is correct), but it turned out to be easy to understand and straightforward to use.
class Metric m a | m -> a where distance :: m -> a -> a -> Double bound :: m -> Double -> Double -> Double bound _ = (+)
What does this do?
It says "For each type m, only one type a is allowed to have an instance of this". This means that you don't need the a in the signature of bound to deduce the full instance declaration because once you know m there's only one valid instance. It's very much like the associated type but implemented in a way that means you can easily constrain the type.
I've now rewritten all the code to use this version. I'm relatively happy with it: It meets all the requirements I've had so far and isn't especially cumbersome to use. The process of getting here was pretty annoying though.