Archive for the ‘Uncategorized’ Category

Integer sets with O(1) range allocation and O(log(n)) deletion

Monday, January 30th, 2012

This is another post brought to you by the school of the bloody obvious. Or at least it should be, but despite having implemented a trivial variation on this data structure before it still took me far too long to remember it existed.

My previous use case was the original version of IntAllocator in the rabbitmq Java client API (it’s an internal class used for allocating channel numbers). In looking it up I’ve noticed that it no longer does, for reasons that do make sense, but those reasons don’t apply for my current use case (basically the new version uses an internal bitset, trading space for time. RabbitMQ only needs to allocate one of these, and the range isn’t that large, so trading space for time is completely acceptable).

My current use case was for building separation graphs for metric spaces. Part of an algorithm I’m playing with requires keeping a candidate list for each element: That is, a list of elements which might be > t away but we’re not sure. This needs to not take up O(n) space when the answer is “the whole set” and we need to be able to remove elements from it efficiently. Hence the constraints in the title.

The resulting structure is basically a range compressed bitset built out of a balanced binary tree. It works as follows:

A node in the tree may be either empty, an inclusive range represented by its endpoints or a split around a pivot of m such that the left hand side of the tree contains elements <= m and the right contains elements > m. We enforce the invariants that neither child of a split may be empty, and that ranges must have endpoints which admit at least one element.

Allocating an entire range is then just a matter of creating a range node holding two integers, i.e. O(1).

Deleting an element is harder, but not much.

  • Deleting an element from an empty node does nothing.
  • Deleting an element from a range has three distinct cases:
    1. If the element is not contained within the range, do nothing
    2. If the element is one of the endpoints, decrement or increment the relevant endpoint. If this results in the range being empty, replace this with an empty node
    3. If the element is internal to the range, split this range around its midpoint and delete from the split instead
  • Deleting an element from a split is even more straightforward: You determine which child the element should belong in and delete from that. If this results in that child being empty, replace this node with the other child.
  • Notes

    • I’ve put together a Haskell implementation to demonstrate the algorithm
    • There are at least two sensible ways to split a range. You can either do it around the element being deleted (which guarantees you’ll not have to recurse – you just allocate a split and two new ranges) or you can do it around the midpoint of the range. The former is cheaper but can result in the tree becoming unbalanced, the latter guarantees you’ll never need more than log(n) levels of recursion but also means most of the time you’ll use log(n) levels of recursion. You could also implement a hybrid where you split around the deleted point unless you’re below a certain depth in the tree, but I’m not sure it’s worth the added implementation cost
    • It is useful for split nodes to keep track of the smallest interval containing them as you can then shortcut if the element to be deleted lies outside that interval
    • You can also implement deleteRange efficiently (I think in O(log(n), though there’s a detail I haven’t checked there). This is left as an exercise for the interested reader
    • When I implemented IntAllocator I added a cache for this: A bounded size stack which you pushed elements to delete onto. When the stack grew full you sorted it into ranges and deleted the ranges all at once. There was a specific reason for this that doesn’t apply (sometimes you wanted to say “pop an element, any element” and that came from the stack if it was non-empty), but this might still be a useful thing to do for imperative versions of this code (it’s a bad idea for pure ones)

Butternut squash, coconut and chickpea curry

Sunday, January 29th, 2012

We had our housewarming party last night, and I’d made this as food. A friend asked me for the recipe, and I was surprised it wasn’t already on here due to it being one of my standards, so I’m now fixing that.

I should warn you that this recipe is largely a lie, in that the details change every time I make it. This is the version I made last night though (well, this is a halved version of that).

Ingredients

  • 1 white onion
  • A smallish knob of fresh ginger
  • Fresh chilli to taste (depends on type of chilli and spiciness you want. Do use fresh rather than dried though)
  • One medium butternut squash
  • One can chickpeas
  • One small can of coconut cream
  • A bit of sugar
  • Quite a lot of salt
  • Quite a lot of garam massala
  • Plenty of sunflower oil

I can’t really give precise measurements on the salt and garam massala because I was adding them continuously throughout until I got the taste right. My guess is several tea spoons of salt and several table spoons of garam massala.

Cooking

  1. Dice the ginger, onion and chilli finely.
  2. Peel the squash and cut it into roughly 1cm cubes
  3. Fry the ginger, onion and chilli with salt and sugar until the onion starts to go transparent
  4. Add the squash and garam massala. Mix it all up thoroughly and fry for a little bit longer
  5. Add boiling water until the squash is nearly covered. Allow to simmer for a while, stirring occasionally
  6. When the squash is starting to go soft, add the coconut cream. Continue simmering and stirring
  7. Eventually the squash should be breaking up, causing the sauce to thicken. Once it’s starting to taste nearly cooked, add the chickpeas and continue simmering

At this point it’s basically ready to serve whenever, but it improves significantly by leaving it to cook on a low heat for longer. Also, keep tasting it whilst it’s cooking and add more salt and garam massala to taste.

PeCoWriMo derail

Monday, November 28th, 2011

Well I got slightly distracted from my PeCoWriMo project. No particularly good reason – I was just feeling a bit disenchanted with the subject matter and Haskell was annoying me slightly (as usual). Nothing enough to actually derail me on it’s own, but I got to thinking about a problem I’d worked on before and decided to do some more work on it.

And in keeping with the mild irritation with Haskell and the “oh for the love of god anything but more ruby” I decided to do it in another nicer, friendlier language.

Err, that is to say, C.

The problem this is working on is that of finding a minimal feedback arc set in a weighted graph (although I’ve actually posed it as a maximization problem). This boils down to the following problem:

Given N points and a weight function w(x, y) find an enumeration x_1, …, x_N that maximizes sum_{i < j} w(x_i, x_j)

The main application I care about is voting, but I'm sure it has other uses.

Code is available on github

Notes on design of a Haskell API

Monday, November 21st, 2011

As previously mentioned, I’m currently writing a little Haskell library. Here are some notes on the API’s many iterations.

Background

The primary feature of this API is building a search index for things in a metric space.

What’s a metric space? It’s essentially points together with a “distance function” d, satisfying the following rules:

  1. d(x, y) == 0 iff x == y
  2. d(x, y) == d(y, x)
  3. d(x, z) <= d(x, y) + d(y, z) [this is called the triangle inequality]

And the main operations we want the search index to be able to support are:

  1. Given a metric and a set of points, build an index
  2. Given an index I, a point x and a distance e, give me all points y in I such that d(x, y) < e

The question I will be looking at here is what the right design for the metric API is.

Attempt 1: Passing around a distance function

This was pretty straightforward. Metrics are represented as a function with the following type signature:

type Metric a = a -> a -> Double

and Indexes have the signature

data MetricSearchIndex a

You pass such a function in when creating the search index. Done.

Unfortunately, this turns out not to be adequate for my goals.

Why?

Well, one of the things I want to be able to do (I’ve not done it yet) is write a command line interface to this program. In order to do that, I really need to be able to save model files. In order to do that I’d like to use Data.Binary. It’s pleasant to use and produces very clean binary formats that if you had to would be easy to write a manual parser for in whatever language you wanted unlike many other serialization libraries.

But unfortunately it can’t serialize functions. And really that’s not surprising – they’re far too opaque to do anything about.

So, the explicit function passing was out.

Metric type class, take one

The next obvious thing to do is to make Metric a type class with the following signature:

   class Metric a where
      distance :: a -> a -> Double

This worked reasonably well. At some point during it I realised that actually there was another function on metric spaces that was really useful, so I added that and the class became the following:

   class Metric a where
      distance :: a -> a -> Double
      bound :: a -> Double -> Double -> Double
      bound _ = (+)

What's the bound parameter for? Well, remember the triangle inequality? It's pretty useful. In particular it lets us deal with the following case:

Suppose we have a point c and a distance r. We denote the set of points y with distance c y < r as B(c, r). We're then looking for everything in the ball B(x, e). We can completely eliminate anything in B(c, r) if d(x, c) >= r + e.

Why? Because if there were some point y which was in both balls then we would have d(x, c) <= d(x, y) + d(c, y) < r + e by the triangle inequality.

So the result of this is that the triangle inequality lets us aggressively prune entire branches of the tree without looking at them. That's the basic idea behind the sort of metric search we're doing.

But it turns out that there's an interesting class of metrics which satisfy a much stronger bound: d(x, z) <= max(d(x, y), d(y, z)). These are called ultrametrics. When performing searches on things we know to be ultrametrics we can prune the tree even more aggressively.

Rather than having a "isAnUltrametric" boolean function, capturing this property as the bound method seemed more sensible. The rule to be satisfied is then:

distance x z <= bound x (distance x y) (distance y z)

What's the x parameter there for? It's a dummy. You need something of type a to dispatch the function on, but its value shouldn't be used. A bit ugly I know...

So, anyway, this type class now largely works. What's wrong with it?

Going back to the command line interface: How I wanted it to work was that it's indexing finite dimensional vectors in an l^p space. I wanted p to be a parameter. This means that it can't be part of the type, so you want one LP type. This means that you then have to attach the value of p to every instance of the type, two vectors need to check whether they're being compared with different values of p, etc. The result is slower, more irritating to work with and has a higher memory footprint. Premature optimisation? Maybe. But when what you're designing is a library it's important to get the performance decisions right. And definitely let's not discount the "more irritating to work with" part.

Metric class, now with multiple parameters

So we're going back to the first version where we passed an explicit metric around. Our two constraints which that didn't satisfy are:

  1. We need (in principle) to be able to serialize metrics
  2. We need to include the bounds function in it

Metrics need to be able to encode arbitrary functions, so in order to serialize them we're going to effectively have to explicitly defunctionalize and make different metrics be encoded by different types. So this is how we do it with a multiparameter type class:

   class Metric m a where
      distance :: m -> a -> a -> Double
      bound :: m -> Double -> Double -> Double
      bound _ = (+)

This changes the metric search tree to encode the metric type in its type parameter:

data MetricSearchTree m a

What's the problem with this?

Well, it doesn't work. The problem is that because the type "a" does not appear in the bound function it is unable to tell which instance of this type class you are using to get bound. The signature needs to be

   class Metric m a where
      distance :: m -> a -> a -> Double
      bound :: m -> a -> Double -> Double -> Double
      bound _ _ = (+)

This was really irritating to use as well as being aesthetically displeasing, so I dismissed it.

At this point there is a correct fix, which if you're familiar with Haskell you may well have already spotted, but I went with a wrong one at first. Which leads us to

Metric class with an associated type synonym

This uses associated type synonyms to express what the points of the metric space are.

   class Metric m where
      type Point m
      distance :: Point m -> Point m -> Double
      bound :: m -> Double -> Double -> Double
      bound _ = (+)

The metric search trees become parameterized by the metric alone.

data MetricSearchTree m

This was relatively pleasant to use. What was the problem?

Maybe this was me doing it wrong, but I found it almost impossible to write generic code with this.

For context: I have a test suite which is a bunch of quickcheck tests parameterized by the metric and the type of point in the metric space. I need to be able to generate arbitrary lists of points in that metric space, as well as have them implement Eq and Show for testing.

I simply couldn't make this work. Near as I can tell. Writing a type constraint as (Metric m, Arbitrary (Point m)) => fails with "no such type constuctor Point". Letting GHC infer the type results in it being unable to infer the relevant constraints. I struggled for a good 40 minutes or so, with the help of a few people in #haskell (though no one who would admit to knowing the associated types thing very well) before giving up. I think I've since figured out how to do it (you can introduce a type variable and add a constraint that the type variable is equal to the associated type synonym. Maybe. I've not tried it), but the eventual solution was nicer anyway.

So, where to now?

Metric class with a functional dependency

It turns out that there's a feature that's exactly for the sort of problem I ran into earlier. It's not one I'd used before, and I'd got the impression that it was generally recommended that you use associated types instead (I don't know if this is correct), but it turned out to be easy to understand and straightforward to use.

   class Metric m a | m -> a where
      distance :: m -> a -> a -> Double
      bound :: m -> Double -> Double -> Double
      bound _ = (+)

What does this do?

It says "For each type m, only one type a is allowed to have an instance of this". This means that you don't need the a in the signature of bound to deduce the full instance declaration because once you know m there's only one valid instance. It's very much like the associated type but implemented in a way that means you can easily constrain the type.

I've now rewritten all the code to use this version. I'm relatively happy with it: It meets all the requirements I've had so far and isn't especially cumbersome to use. The process of getting here was pretty annoying though.

New Blog: Imbibliotech

Friday, October 21st, 2011

This is just a quick note to say I have a new blog called Imbibliotech.

This is not a replacement for the current blog – it’s a subject specific blog on the topic of drinking (primarily cocktails and spirits, but also other things including coffee), as I’m doing a bunch of learning and experimenting on that subject which I felt would be better posted elsewhere.