Category Archives: Code

Calvin and Hobbes, gocomics and dotjs

Ok. So, I like comics. It’s not something I write about, but at any given time I probably have a dozen webcomics that I’m actively subscribed to (by the way, if you’re not, you should be reading Schlock Mercenary and Girl Genius. Not what this post is about, just a heads up).

One thing I really like is discovering large archives of comics I’ve not read, or read a long time ago but had forgotten about / not realised they were available online.

So I was pretty excited when a friend pointed out that Calvin and Hobbes was available on gocomics (or if you don’t want to hunt for the first comic, here it is) especially when I realised to my surprise that I actually didn’t recognise a lot of the most recent ones. I’m pretty sure there aren’t new Calvin and Hobbes, but apparently I’ve read less of it than I thought. So as far as I’m concerned: Yay, new Calvin and Hobbes!

There’s just one problem. The gocomics interface is terrible for reading comic archives. And there’s a lot of archive here. I don’t think I’d be able to make it through this without developing acute RSI.

But! I am a programmer. Solving irritating problems with software is basically my job description. So I shopped around for a way to solve this and came across dotjs (the original is OSX specific, but there’s a fork that works on linux too which is what I’m using). It’s just a nice easy way of injecting site specific javascript.

So, I installed this, created ~/.js/gocomics.com.js and added the following code to it:

function followLinkLike(matcher){
  document.location = $(matcher)[0]["href"]
}
 
document.onkeydown = function(e){
  switch(e.keyCode){
    case 74: followLinkLike("a.prev"); break;
    case 75: followLinkLike("a.next"); break;
  }
};

Problem solved. Gocomics now has keyboard navigation. j will take you backwards, k will take you forwards (I’m a vim user ok. So sue me).

dotjs is a really nice solution to this sort of problem. I’m definitely going to be on the look out for broader applications of it.

This entry was posted in Code on by .

A green build on my weighted FAS solver

A momentous occasion has just, err, occurred. I got a green build on my weighted feedback arc set solver. This has never happened before.

This might seem like a confusing concept. How can it never have passed its own test suite?

Well, the test suite has always been more aspirational than reality based. The way it effectively works is this: It has a lot of test data and it tracks the best solution found for each file, along with its score. The quality test is that the score of the solution found is within 5% of the best known solution (side-note: part of why the tests are now green is that I did raise this threshold a bit, it was originally at 1%. Most tests still hit that better threshold). A side-effect of this is that improvements to the solver will often make the tests harder to pass because they will find better solutions (and, surprisingly, this is still happening – I’ve spent enough CPU cycles and algorithmic tweaks that I keep thinking I’ve found all the optimal solutions only to be surprised by an improvement. The improvements are rarely large, but they still keep coming)

The result was that I would often make improvements to the FAS solver and then back them out because they were too slow, but leaving the test data behind, and the test would be failing as the FAS solver dreamed of the long lost days when it was cleverer.

Additionally, all the tests are capped at the fairly arbitrary maximum acceptable runtime of one minute.

At this point I should mention that finding even an approximation (I forget to within what bound – I think it’s something like half) of the optimal score for weighted FAS is an NP-hard problem, and that even calculating the score of a given instance is \(O(n^2)\). So this isn’t an easy problem to solve.

I’d largely stalled on development on it, but today I decided to go back to it and after a day of tinkering I’ve managed to make a surprising amount of progress. At the start of the day it was painfully slow on several test cases (several minutes), non-deterministic and badly failing several of its quality goals. Now it is deterministic, runs each test case in under a minute and gets within 2.5% of the best known result on all of them. For added pleasantry, the latest version has even provided the best known result for several of them, and the code is shorter and clearer than it was at the start of the day.

How does it work?

It’s pretty simple. In the end it’s ended up looking up hilariously close to the algorithm I threw together in an hour from a random paper for The Right Tool (now Hammer Principle) a few years ago.

The process is basically a mix of the following techniques:

  1. Initial scoring. This is basically a Markov chain scoring system as before. The details are slightly different, but not really in an interesting way – the idea is still to transition to better nodes and use an approximation of the stable distribution as a score
  2. What I refer to in the code as “forcing connectivity”. This is today’s clever idea, and it works really well, but it probably shouldn’t and the fact that it appears to might be a test data artifact. If you’re feeling generous it could be described as a heuristic approach, but a fairer description would perhaps be an unprincipled hack. It’s based on the observation that in sparse data sets you often end up with long chunks of ties, and these almost completely defeat local searches and make it hard to improve things. The force runs through the array one element at a time and if it’s unconnected to the next element pulls forward the next element to which it is connected (it doesn’t concern itself with ordering beyond that – the element might be strictly greater or less than it, it doesn’t care)
  3. Local sort, as before. Basically an insertion sort on the hopelessly inconsistent ordering the tournament induces. This had largely disappeared from the code – it was implicitly happening anyway, but there was no explicit local sort stage – and bringing it back in was a surprisingly big win
  4. Window optimisation. Basically, you can brute force the problem for small arrays. Window optimisation is brute forcing every subrange of some length. This is applied at various lengths in order to get mixing
  5. Single move optimisation. This is a really expensive process, but wins big enough that it’s worth it. Basically we try every possible move of a single element from one point to another and apply any that improve the score

It’s surprising how well this combination works – I’ve tried a whole bunch of more clever things, between various random search approaches, simulated annealing, block sorting (basically: build a new tournament with chunks of the existing one and recurse) and a variety of local search techniques, but they generally seemed to be significantly slower and/or less effective.

Update: The test suite is now failing again. This is partly because I’ve tightened the requirements (down to within 3% rather than 5% of best known result), partly because after some tinkering with code I’ve managed to improve the best known quality of several test cases. This is actually quite pleasant, as it shows that there are some nicely tunable parameters in here.

This entry was posted in Code, programming on by .

An algorithm for incrementally building separation graphs

You know that annoying thing when you’re reading a paper and it’s fairly clear they’ve not actually tried that algorithm in practice? Don’t you hate that?

Anyway, here’s an algorithm I’ve not tried in practice. I haven’t even written code for it yet. This post is as much to get the ideas clear in my head as anything else. This algorithm may turn out to be worse than doing it the stupid way in practice.

This is the problem we’re trying to solve:
x
We have n points, \(X = \{x_1, \ldots, x_n\}\), and a metric function d. We are trying to build the unordered graph with edges \(E = \{ (i, j) : d(x_i, x_j) > t \} \).

This is motivated by trying to find small diameter partitions of X, which we will be doing by trying to either prove (X, E) is not bipartite or finding a bipartition of it (see this previous post)

The question is how to do this efficiently? The obvious brute force approach is \(O(n^2)\), and indeed any solution must have \(O(n^2)\) worst-case. We’d like to do well in practice by applying the following two key ideas:

  1. By using the triangle inequality, we may often be able to escape having to do distance calculations because we can infer the distance will be too large / too small
  2. We are likely not to need the entire graph – a proof that a graph is not bipartite may involve only a very small subset of the nodes, so if we build the graph incrementally we may be able to stop early

The way we do this will be as follows:

For each node x, we keep track of two sets, Definite(x) and Candidates(x). We will also index distances as we find them by tracking Distances(x), which will be track all points we’ve already calculated the distance from x to.

Roles and implementation requirements

Definite(x) is a set containing all nodes we know are > t away from x and starts empty. We require efficient addition of new elements to it without introducing duplicates and the creation of an iterator which is guaranteed to iterate over all elements in the set even if new ones are added during iteration, even after it has previously claimed there are no elements left. One way to do this would be to have it consist of both an array and a hash set.

Candidates(x) contains all points which we don’t yet know for sure whether or not they are > t from x and starts equal to X. As a result we want a set implementation which is cheap to allocate an instance containing the set and cheap to delete from. This is where the previous post about integer sets comes in (we represent a node by its index).

Distances(x) is some sort of ordered map from distances to lists of points. It needs to support efficient range queries (i.e. give me everything nearer than s or further than s).

The Algorithm

Our two basic operations for manipulating the data we’re tracking are Close(x, y) and Far(x, y). Close(x, y) removes x from Candidates(y) and y from Candidates(x). Far does the same thing but also adds x to Definite(y) and y to Definite(x).

Our iterator protocol is that an iterator has an operation Next which returns either None or an Element. We will construct Neighbours(x), which creates an iterator that incrementally returns all points > t away from the x and uses information discovered whilst doing so to flesh out the information we know about other nodes too.

Here is the algorithm, written in some bizarre pseudocode language that looks like nothing I actually write (no, I don’t know why I’m using it either):

 
  Neighbours(X) = NI(x, Iterator(Definite(x)))

  Next(NI(x, found))
    
    # If we 
    while ((result = Next(found)) == None && not IsEmpty(Candidates)) do
      candidate = Pop(Candidates(x))    
      s = d(x, candidate)

      if s > t then Far(x, candidate) else Near(x, candidate)
      
      # Because d(y, candidate) <= d(y, x) + d(candidate, x) <= t - s + s = t
      for(y in Distances(x) where d(y, x) <= t - s) Near(candidate, y)

      # Because if d(y, candidate) <= t then d(y, x) <= d(candidate, x) + t <= s + t
      for(y in Distances(x) where d(y, x) > t + s) Far(candidate, y)

      # Because if d(candidate, y) <= t then d(candidate, x) <= d(candidate, x) + d(y, x) <= t - s + s = t
      if s > t then for(y in Distances(x) where d(y, x) <= t - s) Far(candidate, y)

      # If we have no more candidates left then Distances(x) will never be used again and is wasted space
      if IsEmpty(Candidates(x)) then delete Distances(x) else Put(Distances(x), s, y))
    done

    return result

The idea being that as we're searching for edges for a given node we're also using this to fill out the edges for the other nodes. This should work particularly well for doing depth first search, because in particular it often means that the nodes we're going to transition to will have some of their neighbours filled in already, and we may in fact be able to spend most of our time just chasing through Definite sets rather than having to do new distance calculations. Even where we haven't been able to do that, hopefully we've managed to prune the lists of elements to consider significantly before then.

Notes

  • It should be fairly evident that this is not only \(O(n^2)\) worst case, it's probably \(O(n^2)\) expected case for exploring the whole graph, even in reasonable metric spaces - it's likely that one or the other of the two sets of points we're looking at during each iteration step will be O(n) - so the only possible win over the brute force algorithm is if it does significantly fewer expensive distance calculations
  • I've not yet written any real code for this, so I'm sure some of the details are wrong or awkward
  • Even if it's not wrong or awkward it may well turn out to have sufficiently bad constant factors, or save few enough distance invocations, that it's worthless in practice
This entry was posted in Code, programming on by .

A cute algorithm for a problem you’ve never cared to solve

As part of my recent playing with metric search structures, I was wondering about the following problem: Given a metric space X I want to partition X into two sets A and B such that each of A and B has a small diameter. Equivalently, I’m trying to find A that minimizes max(diam(A), diam(\(A^c\))).

The brute force solution for this would be fairly awful: There are \(2^n\) partitions, calculating the score of each is \(O(n^2)\), so the brute force solution takes \(O(2^n n^2)\). So let’s not do that.

I was expecting that I’d have to find an approximate solution, as this looked like a classic hairy NP-hard optimization problem. I’d started contemplating local optimizations, simulated annealing, etc. Then after doing a bit more searching I happened across this paper, which in fact has a polynomial time exact solution to the problem: Diameter Partitioning by David Avis.

The algorithm is incredibly cute – it uses absolutely no advanced ideas, just two clever observations which reduce the entire problem to elementary algorithms. It’s the sort of thing which as soon as you see it you just kick yourself for not having thought of it. So I thought I’d write about it to share how much I enjoyed it, even if no one else is actually likely to care about solving this problem.

The first observation is this:

We can find a partition A, B which has max(diam(A), diam(B))) \(\leq\) t if and only if we can find two sets A, B such that if \(d(x, y) > t\) they are in different halves of the partition (this is restating the definition).

That is, if we draw an edge between x and y if and only if \(d(x, y) > t\), then a partition with max(diam(A), diam(B))) \(\leq\) t is precisely a bipartite matching for this graph.

But bipartite matchings are easy to find! Or at least, \(O(n^2)\) to find, which is no worse than calculating the diameter of a set in the first place. You just do a graph search on each of the components, marking nodes alternatively black and white (i.e. if you’re coming from a white node you mark this node black, if you’re coming from a black you mark this node white) as you find them, and yell “Impossible!” if you ever try to paint a node a different colour than you’ve already painted it. Done.

So for any given t we have an \(O(n^2)\) algorithm that either finds us a partition at least that good or tells us that none exists. Now how do we find a best possible partition?

The obvious answer is binary search, but binary searching on floats is never going to give us the exact answer, only a series of successively better approximations. This then motivates the second observation: Our distances are discrete.

If we work out the distance between every pair (we already have to do that for building the graph) then you can put these pairs + their distances in an array and sort that array. Then it’s just a simple matter of binary searching that array, and additionally you can use it to build the graph without calculating any more distances – just take the pairs to the right of your current point in the array.

Building the array is \(O(n^2 log(n))\), and we will do \(log(n)\) binary searches, each performing an \(O(n^2)\) operation, so in total the cost of this algorithm is \(O(n^2 log(n))\). A lot better than brute force.

Notes

  • It apparently is NP-hard to do this if you want a partition of more than two elements
  • Any algorithm which guarantees within a factor of two of the right answer has to examine every pair of points and thus must be \(O(n^2)\). You can see this by constructing a metric space of n points, picking a distinguished pair x, y and setting \(d(x, y) = 2\) whilst setting all other distances for 1. Any partition which separates x and y will have a score of 1, any other will have a score of 2, and by randomly picking x and y you can see that there’s no way to know this except examining every pair
  • The above example also demonstrates that it’s \(O(n^2)\) to even calculate a diameter, which means that this is also a lower bound on even knowing for sure how good your partition is
  • If you don’t care about solving this exactly and only care that the result is “good” in some fuzzy sense I think you can do a lot better in practice. There are good approximate diameter algorithms, and we didn’t take advantage of the triangle equality at all in our algorithm for building the separation graph, so I think it’s possible to adjust the partition testing code in a way that is much better than \(O(n^2)\) in many nice cases, even if it necessarily degenerates to \(O(n^2)\) in the worst cases. I’ll write more about this once I’ve worked out the details.
This entry was posted in Code, programming on by .

I don’t write Scala

A lot of people follow me on Twitter. I don’t mean Stephen Fry level a lot, but it’s about 8 or 9 times as many people as I follow back.

Based on a (purely visual) random sampling of this, a significant proportion of these are people interested in Scala. That’s fine. I know many such people. I could also be misrepresenting them and they’re just people interested in programming and following me because I’m a programmer (and occasionally even tweet programming related things). I’m not sure, and can’t really be sure without doing a lot more research that I’m actually interested in doing.

So I’m going to assume it’s a Scala thing, because this fits into a general perception issue I’ve noticed people have about me.

You see, I don’t write Scala. I haven’t since late 2009. I didn’t make a big deal about it, because that would have been childish, I just informed a few people in the Scala community I thought should know that I was leaving, along with a few of my reasons why, and then quietly did so. Most of them took it very graciously.

Weirdly, two and some years on, most people seem not to have noticed the complete absence of Scala related content from me. I suspect it’s because I’ve mostly dropped off their radar for one reason or another, so there’s just a vague general impression of me as a Scala person. Hopefully this post should help remove some of that.

To be clear: This is not a normative statement. Just because I don’t write Scala, doesn’t mean you shouldn’t. Scala is pretty neat. I have my reasons to not use it, but don’t wish to explain as I would find the resulting language flamewar extremely tedious. I would greatly appreciate it if you don’t use this post to start one anyway.

This is also not a statement that I hate Scala and will never use it again. I’ve no immediate plans to, but never is a long time. I expect Scala will do quite well, and if it does I expect I will at some point find myself using it again.

To forestall the inevitable question: I am currently mostly writing Ruby at work, and a whole smattering of unrelated things at home (recents include Haskell, Java, Clay, C, a little C++, some Lua…) as the whim takes me. This list is not intended to be prescriptive, and I’m not really interested in the inevitable suggestions for what languages to try next.

TLDR: I write a bunch of languages, mostly Ruby for reasons of circumstance rather than design, but Scala is not numbered amongst them. This is a statement of fact, not a piece of advice.

This entry was posted in Code, life on by .