Archive for the ‘Code’ Category

An algorithm for incrementally building separation graphs

Monday, January 30th, 2012

You know that annoying thing when you’re reading a paper and it’s fairly clear they’ve not actually tried that algorithm in practice? Don’t you hate that?

Anyway, here’s an algorithm I’ve not tried in practice. I haven’t even written code for it yet. This post is as much to get the ideas clear in my head as anything else. This algorithm may turn out to be worse than doing it the stupid way in practice.

This is the problem we’re trying to solve:
x
We have n points, \(X = \{x_1, \ldots, x_n\}\), and a metric function d. We are trying to build the unordered graph with edges \(E = \{ (i, j) : d(x_i, x_j) > t \} \).

This is motivated by trying to find small diameter partitions of X, which we will be doing by trying to either prove (X, E) is not bipartite or finding a bipartition of it (see this previous post)

The question is how to do this efficiently? The obvious brute force approach is \(O(n^2)\), and indeed any solution must have \(O(n^2)\) worst-case. We’d like to do well in practice by applying the following two key ideas:

  1. By using the triangle inequality, we may often be able to escape having to do distance calculations because we can infer the distance will be too large / too small
  2. We are likely not to need the entire graph – a proof that a graph is not bipartite may involve only a very small subset of the nodes, so if we build the graph incrementally we may be able to stop early

The way we do this will be as follows:

For each node x, we keep track of two sets, Definite(x) and Candidates(x). We will also index distances as we find them by tracking Distances(x), which will be track all points we’ve already calculated the distance from x to.

Roles and implementation requirements

Definite(x) is a set containing all nodes we know are > t away from x and starts empty. We require efficient addition of new elements to it without introducing duplicates and the creation of an iterator which is guaranteed to iterate over all elements in the set even if new ones are added during iteration, even after it has previously claimed there are no elements left. One way to do this would be to have it consist of both an array and a hash set.

Candidates(x) contains all points which we don’t yet know for sure whether or not they are > t from x and starts equal to X. As a result we want a set implementation which is cheap to allocate an instance containing the set and cheap to delete from. This is where the previous post about integer sets comes in (we represent a node by its index).

Distances(x) is some sort of ordered map from distances to lists of points. It needs to support efficient range queries (i.e. give me everything nearer than s or further than s).

The Algorithm

Our two basic operations for manipulating the data we’re tracking are Close(x, y) and Far(x, y). Close(x, y) removes x from Candidates(y) and y from Candidates(x). Far does the same thing but also adds x to Definite(y) and y to Definite(x).

Our iterator protocol is that an iterator has an operation Next which returns either None or an Element. We will construct Neighbours(x), which creates an iterator that incrementally returns all points > t away from the x and uses information discovered whilst doing so to flesh out the information we know about other nodes too.

Here is the algorithm, written in some bizarre pseudocode language that looks like nothing I actually write (no, I don’t know why I’m using it either):

  Neighbours(X) = NI(x, Iterator(Definite(x)))

  Next(NI(x, found))

    # If we
    while ((result = Next(found)) == None && not IsEmpty(Candidates)) do
      candidate = Pop(Candidates(x))
      s = d(x, candidate)

      if s > t then Far(x, candidate) else Near(x, candidate)

      # Because d(y, candidate) <= d(y, x) + d(candidate, x) <= t - s + s = t
      for(y in Distances(x) where d(y, x) <= t - s) Near(candidate, y)

      # Because if d(y, candidate) <= t then d(y, x) <= d(candidate, x) + t <= s + t
      for(y in Distances(x) where d(y, x) > t + s) Far(candidate, y)

      # Because if d(candidate, y) <= t then d(candidate, x) <= d(candidate, x) + d(y, x) <= t - s + s = t
      if s > t then for(y in Distances(x) where d(y, x) <= t - s) Far(candidate, y)

      # If we have no more candidates left then Distances(x) will never be used again and is wasted space
      if IsEmpty(Candidates(x)) then delete Distances(x) else Put(Distances(x), s, y))
    done

    return result

The idea being that as we're searching for edges for a given node we're also using this to fill out the edges for the other nodes. This should work particularly well for doing depth first search, because in particular it often means that the nodes we're going to transition to will have some of their neighbours filled in already, and we may in fact be able to spend most of our time just chasing through Definite sets rather than having to do new distance calculations. Even where we haven't been able to do that, hopefully we've managed to prune the lists of elements to consider significantly before then.

Notes

  • It should be fairly evident that this is not only \(O(n^2)\) worst case, it's probably \(O(n^2)\) expected case for exploring the whole graph, even in reasonable metric spaces - it's likely that one or the other of the two sets of points we're looking at during each iteration step will be O(n) - so the only possible win over the brute force algorithm is if it does significantly fewer expensive distance calculations
  • I've not yet written any real code for this, so I'm sure some of the details are wrong or awkward
  • Even if it's not wrong or awkward it may well turn out to have sufficiently bad constant factors, or save few enough distance invocations, that it's worthless in practice

A cute algorithm for a problem you’ve never cared to solve

Saturday, January 28th, 2012

As part of my reason playing with metric search structures, I was wondering about the following problem: Given a metric space X I want to partition X into two sets A and B such that each of A and B has a small diameter. Equivalently, I’m trying to find A that minimizes max(diam(A), diam(\(A^c\))).

The brute force solution for this would be fairly awful: There are \(2^n\) partitions, calculating the score of each is \(O(n^2)\), so the brute force solution takes \(O(2^n n^2)\). So let’s not do that.

I was expecting that I’d have to find an approximate solution, as this looked like a classic hairy NP-hard optimization problem. I’d started contemplating local optimizations, simulated annealing, etc. Then after doing a bit more searching I happened across this paper, which in fact has a polynomial time exact solution to the problem: Diameter Partitioning by David Avis.

The algorithm is incredibly cute – it uses absolutely no advanced ideas, just two clever observations which reduce the entire problem to elementary algorithms. It’s the sort of thing which as soon as you see it you just kick yourself for not having thought of it. So I thought I’d write about it to share how much I enjoyed it, even if no one else is actually likely to care about solving this problem.

The first observation is this:

We can find a partition A, B which has max(diam(A), diam(B))) \(\leq\) t if and only if we can find two sets A, B such that if \(d(x, y) > t\) they are in different halves of the partition (this is restating the definition).

That is, if we draw an edge between x and y if and only if \(d(x, y) > t\), then a partition with max(diam(A), diam(B))) \(\leq\) t is precisely a bipartite matching for this graph.

But bipartite matchings are easy to find! Or at least, \(O(n^2)\) to find, which is no worse than calculating the diameter of a set in the first place. You just do a graph search on each of the components, marking nodes alternatively black and white (i.e. if you’re coming from a white node you mark this node black, if you’re coming from a black you mark this node white) as you find them, and yell “Impossible!” if you ever try to paint a node a different colour than you’ve already painted it. Done.

So for any given t we have an \(O(n^2)\) algorithm that either finds us a partition at least that good or tells us that none exists. Now how do we find a best possible partition?

The obvious answer is binary search, but binary searching on floats is never going to give us the exact answer, only a series of successively better approximations. This then motivates the second observation: Our distances are discrete.

If we work out the distance between every pair (we already have to do that for building the graph) then you can put these pairs + their distances in an array and sort that array. Then it’s just a simple matter of binary searching that array, and additionally you can use it to build the graph without calculating any more distances – just take the pairs to the right of your current point in the array.

Building the array is \(O(n^2 log(n))\), and we will do \(log(n)\) binary searches, each performing an \(O(n^2)\) operation, so in total the cost of this algorithm is \(O(n^2 log(n))\). A lot better than brute force.

Notes

  • It apparently is NP-hard to do this if you want a partition of more than two elements
  • Any algorithm which guarantees within a factor of two of the right answer has to examine every pair of points and thus must be \(O(n^2)\). You can see this by constructing a metric space of n points, picking a distinguished pair x, y and setting \(d(x, y) = 2\) whilst setting all other distances for 1. Any partition which separates x and y will have a score of 1, any other will have a score of 2, and by randomly picking x and y you can see that there’s no way to know this except examining every pair
  • The above example also demonstrates that it’s \(O(n^2)\) to even calculate a diameter, which means that this is also a lower bound on even knowing for sure how good your partition is
  • If you don’t care about solving this exactly and only care that the result is “good” in some fuzzy sense I think you can do a lot better in practice. Tere are good approximate diameter algorithms, and we didn’t take advantage of the triangle equality at all in our algorithm for building the separation graph, so I think it’s possible to adjust the partition testing code in a way that is much better than \(O(n^2)\) in many nice cases, even if it necessarily degenerates to \(O(n^2)\) in the worst cases. I’ll write more about this once I’ve worked out the details.

I don’t write Scala

Wednesday, January 11th, 2012

A lot of people follow me on Twitter. I don’t mean Stephen Fry level a lot, but it’s about 8 or 9 times as many people as I follow back.

Based on a (purely visual) random sampling of this, a significant proportion of these are people interested in Scala. That’s fine. I know many such people. I could also be misrepresenting them and they’re just people interested in programming and following me because I’m a programmer (and occasionally even tweet programming related things). I’m not sure, and can’t really be sure without doing a lot more research that I’m actually interested in doing.

So I’m going to assume it’s a Scala thing, because this fits into a general perception issue I’ve noticed people have about me.

You see, I don’t write Scala. I haven’t since late 2009. I didn’t make a big deal about it, because that would have been childish, I just informed a few people in the Scala community I thought should know that I was leaving, along with a few of my reasons why, and then quietly did so. Most of them took it very graciously.

Weirdly, two and some years on, most people seem not to have noticed the complete absence of Scala related content from me. I suspect it’s because I’ve mostly dropped off their radar for one reason or another, so there’s just a vague general impression of me as a Scala person. Hopefully this post should help remove some of that.

To be clear: This is not a normative statement. Just because I don’t write Scala, doesn’t mean you shouldn’t. Scala is pretty neat. I have my reasons to not use it, but don’t wish to explain as I would find the resulting language flamewar extremely tedious. I would greatly appreciate it if you don’t use this post to start one anyway.

This is also not a statement that I hate Scala and will never use it again. I’ve no immediate plans to, but never is a long time. I expect Scala will do quite well, and if it does I expect I will at some point find myself using it again.

To forestall the inevitable question: I am currently mostly writing Ruby at work, and a whole smattering of unrelated things at home (recents include Haskell, Java, Clay, C, a little C++, some Lua…) as the whim takes me. This list is not intended to be prescriptive, and I’m not really interested in the inevitable suggestions for what languages to try next.

TLDR: I write a bunch of languages, mostly Ruby for reasons of circumstance rather than design, but Scala is not numbered amongst them. This is a statement of fact, not a piece of advice.

Sleep as an Droid export data

Monday, May 23rd, 2011

This is possibly the single most niche piece of software I have ever written…

I use a sleep tracker called “Sleep as an Droid”. I wish to be able to get data out of it. Fortunately, it has a data export feature. Unfortunately, its data export format is a bit bizarre (it’s CSV, Jim, but not as we know it). So I decided to convert it to a tolerable-structured JSON output before using it, and have put it on the web for anyone to use. source is also available on github if you care.

Playing with a new language: Clay

Sunday, January 9th, 2011

There’s been a distinct lack of interesting language related posts around here recently.

Why? Well because I’ve been mostly writing Ruby and C. Neither of which languages fills me with an urge to post HEY GUYS, I FOUND THIS COOL THING. Ruby because I vaguely dislike it, so I resent even the kindof cool things I find as I use it, C because my use of C tends towards the extremely straightforward so C posts tend to be about the result rather than the method (arguably that’s what programming posts should be like anyway, but that’s a separate issue).

But anyway, I’ve been window shopping around for a new language to learn. My list of features I wanted in said language were:

  1. Good performance
  2. Higher level than C
  3. Preferably statically typed
  4. Generic collections
  5. Didn’t run on the JVM

The closest to this on the list of languages I already knew was Haskell, and I’ve been using it a little bit, but it and I don’t really get on for a variety of reasons that are beyond the scope of this blog post.

I was pretty much thinking it was going to be O’Caml (which is another language in the “vaguely dislike” category. I want to like it, I really do, but somehow I can never manage) and C++ (oh dear god no).

So, anyway, my ears perked up when clay came up on reddit the other day. I had already been vaguely aware of it, but somehow it dropped off my radar before I did anything about it.

So I’ve been spending today playing with it, and the results are two little projects: An interval tree implementation and a very rudimentary start on some client bindings to PostgreSQL. Neither are written with any particular purpose in mind, ‘though both are something I might want to use at some point, it’s really just a way to get a feel for the language.

I’m pleasantly impressed. Is it going to be the Next Big Language? Who knows. Certainly not me. But despite a lot of stupid questions on my part it was a pleasant experience, and I’m definitely going to continue playing with it.

What follows are some initial impressions. Bear in mind it’s the result of only a day of playing with it, so they may be incomplete or outright wrong:

The Good

Here are some of the things I liked:

Good support for generic programming

Like it says on the label really. This is one of the things the language bills itself as being good at, so if it weren’t it would be dead on launch.

I’ve had a reasonable play around with it, and it works well. The entire system is basically a giant (mostly) statically resolved multimethod system. Given my fondness for multimethods, that’s a plus as far as I’m concerned. Runtime polymorphism comes in through an extensible variant system: You can declare variants in one module and add new ones in another.

Anyway, the result is that there are generic collections and they Just Work, exceptions make use of the extensible variant system and seem well behaved.

There’s no object orientation support and no subtyping. As far as I’m concerned, that’s a good thing.

Decent FP primitives

So I haven’t actually had much of a play with these yet believe it or not, but they’re there and seem to work. lambdas don’t implement non-local return (‘though I suppose you could do this with an exception), but do capture their environment. The main place I’ve used this is in the test system.

clay bind-gen

clay bind-gen takes a C header file and gives you a clay library. It’s a bit finicky, and obviously the code is far from idiomatic, but it does work and works well. It took me about 10 minutes to get this working on the libpq header and port a C example of using it to clay. Hopefully this means that it’s rather easy to build on the available C ecosystem.

Decentish standard library

All quite basic at the moment, but it already annoys me less than Scala’s used to when it was at a much later stage in the game than this. Granted that’s because Scala could get away with it by falling back on the JVM ecosystem and clay can’t, but it’s a good start. It’s pleasantly reminiscent of the sort you’d expect with a high level language, but seems well tuned for clay’s suitability as a systems language.

Valgrind works flawlessly

As per title. Clay is an unsafe language – it has manual memory management (which, thanks to things like stack allocation and deterministic destructors, is much less of a pain than in C, but is still definitely not hard to get wrong), pointer arithmetic, no bounds checking on array access.

The only reason I trust myself to write C is because valgrind exists. I would have had a very hard time trusting myself to write clay if it didn’t work on it. Fortunately, no problems.

Friendly and intelligent IRC channel

I’m a big fan of IRC as a learning environment. I’ve been on IRC on a fairly regular basis in various channels for 10-15 years (‘though rarely any given one for more than 4 or 5. I have community attention span issues), and so the first thing I do when checking out a new tech or language is long onto their IRC channel. Clay’s does not disappoint. It’s small, but there are a bunch of smart people in there, and they were very patient and helpful with my stupid questions.

The bad

Confusing semantics

Generally there’s a good reason for it, and things are still in flux so I think many of these will improve, but there have been a bunch of cases where I’ve gone “Wait, what?”. Some of these are just the result of my never having used a language with non-trivial value semantics before (tree = tree.child generates an invalid read), some of them are gotchas which you’ll probably only be caught by once but are still pretty ugly (it’s very easy to think you’ve redefined a constructor when you haven’t because the type arguments are part of the constructor’s name), and a few others that slip my mind at the moment. The semantics of how copying, moving and assignment work are not so much confusing as very easy to get wrong. I’m still not 100% sure I understand how variants interact with their members.

I think I’m largely on top of it now, and I expect it will become clearer as I write more of it, but I definitely spent a good chunk of today quite bewildered.

Basically zero documentation

Well, what do you expect? The language is young and still massively in flux. I’m not surprised there’s no documentation and, as mentioned, the community is very helpful in making up for its lack, but it definitely makes the learning process much harder.

The language is young and still massively in flux

As mentioned above. This is definitely an experimental language – the compiler seems to currently be undergoing a rewrite which will, amongst other things, be changing the backend target away from LLVM and to C. The syntax is expected to change. The semantics are expected to change. The language designers will, and should, feel free to break backwards compatibility.

Error messages

As well as the usual problems with new languages and quality of error messages, clay’s error messages, both at runtime and compile time, are very short on line numbers. This can make it very hard to track down what’s going wrong. It wasn’t drastically difficult with what I was doing today, but they were both rather small projects. I don’t know how hard it will be on larger projects, but I’m hoping that some of this (particularly compile time error messages) will be fixed by the time I have to worry about that.

Late discovery of errors

It’s not quite as bad as C++ template errors are reputed to be, but generally you don’t discover an error in a function until you try to use that function. This is somewhat deliberate, as it ties in with how the genericity works, but when combined with the line numbers problem can be quite frustrating. Additionally the support for giving the compiler hints to get better error messages is a bit rudimentary: You can define compile time functions which constrain arguments, but that’s about it.

The ugly

Keywords vs operators

Clay doesn’t have much C compatibility in terms of its operators: the bitwise operators (in particular <<, >>, |, & and ^) are replaced by functions. The shortcutting boolean operators are replaces with keywords and and or. It’s not a major problem, but it annoyed me vaguely when porting from C.

Keywords

Clay has a fair few keywords: ref, forward, lvalue, rvalue, lambda, and, or, overload, procedure, variant, record, callbyname… probably more.

It’s not a massive problem, but I tend to regard keyword surplus as a sign of insufficiently well factored features. I’ve yet to form an opinion on whether or not this is the case here.

Everything is by reference

I don’t yet know if I like this or not, but Clay’s calling convention passes everything as a mutable reference. So assigning to a function argument assigns in the calling context. Really. This can be quite useful, particularly given that copying a clay value can be a very expensive operation, but I find it deeply disconcerting. I feel like it may grow on me. We’ll see.

A few minor syntax weirdnesses

e.g. one that bugged me earlier is the syntax for return type declaration on a function: foo(a : Blargh) ReturnType. No colon on the return type, but colon on the arguments.

Conclusion

Overall, definitely more good than bad. I intend to keep playing with it and see how I feel after a while doing so.