# Computing connected graph components via SQL

Hi, I don’t post to here much. I’m one of the devs working on SONAR, focusing on mostly theme extraction.

As with many applications, SONAR’s data crunching is basically relational database driven. We keep thinking about experimenting with graph DB based approaches, but never manage to find quite a compelling enough reason – there’s no way we’d give up the relational approach entirely, so it needs to be a really big win to be worth the annoyance of having to maintain two different types of database in synch with eachother.

Unfortunately this sometimes leaves us in the unenviable position of having to do graph algorithms in SQL. This is about as much fun as you might imagine it to be. Most recent challenge: Computing the connected components of a graph in SQL.

There’s always the option of loading it into memory and doing it there of course. But the graph in question is rather large. With our little demo data sets it would be fine to do that, but any larger (e.g. on a real live sonar deployment) and this starts to sound like a really bad idea.

It turns out this is surprisingly simple to do once you have the key insight. I couldn’t find anything on the web explaining this though, so thought I’d write a post about it in case anyone else needs to do the same. It’s not rocket science, but hopefully this will save someone some time.

Consider the following setup:

create table if not exists items( id int primary key, component_id int
); create table if not exists links( first int references items (id), second int references items (id)
);


We consider entries in links as undirected edges in a graph and we want to update items so that all items in the same component have the same component_id and distinct components have distinct component ids.

We’ll do this incrementally and merge. In order to do this we need a new table which we’ll use as scratch space (this should be a temporary table, but MySQL has irritating restrictions on temporary tables which make this not work):

create table if not exists components_to_merge( component1 int, component2 int);


(Side note: All SQL here is tested only on MySQL. It shouldn’t be hard to make it work on any other database though).

The idea is that at each step we’ll merge components, using components_to_merge to map components to the component they’ll be merged with.

So we start with a set of candidate components. That’s simple enough: We take each node as a potential starting component.

Now at each stage we merge components by finding links between them. For every potential component we look at all other potential components it’s connected to via some link. We insert all component-component links into components to merge. This is straightforward enough:

insert into components_to_merge select distinct t1.component_id c1, t2.component_id c2 from links join items t1 on links.first = t1.id join items t2 on links.second = t2.id where t1.component_id != t2.component_id insert into components_to_merge
select component2, component1 from components_to_merge; -- ensure symmetricity


So, now we have a list of groups to merge in this table. If the table is empty then we’re done – all the groups are maximal (and because of the way we constructed them they’re connected – at each point they were built by joining together two connected sets which were the connected to eachother), so the component_ids currently in items describe the actual components. If not, we now reassign components:

 update items join ( select component1 source, min(component2) target from components_to_merge group by source ) new_components on new_components.source = component_id set items.component_id = least(items.component_id, target)

This step merges each component with another one (although “merging” conveys a slightly inaccurate sense of what happens. Consider a graph 1-2-3. The first step would result in 1 and 2 being assigned component_id 1, and 3 being assigned component_id 2. So the {3} component took the place of the {2} component).

What’s the complexity of this code? Well, it’s not amazing, but it’s not terrible either. The complexity of each step in the loop is probably somewhere around O(n log(n)) depending on exactly what the database does to it. The worst case number of queries is the size of the largest component: It’s obviously an upper bound as the number groups decreases by at least one each time; In order to see that it’s achieved, consider an extension of the 1-2-3 example where we have a graph 1-2-…-n. Then at each stage what happens is that we end up with components [1], [2], …, [n] -> [1, 2], [3], …, [n] -> [1, 2, 3], [4], …, [n], etc, taking n steps to terminate. On the other hand, a complete graph terminates in one step.

I suspect that the expected run time is O(log(n)), with each part of the component chosen approximately doubling in size each time, but I confess to not actually having bothered to run the maths: For our particular use case at the moment this is fine – it turns out the graph we’re considering is fairly sparse and tends to have small components, so for the moment this is more than fast enough. On the other hand, it would be nice to have a better guaranteed time, so if anyone has a smarter approach I’d love to hear it.

Anyway, here’s some sample code that ties all of this together: http://code.trampolinesystems.com/components.rb

This entry was posted in Uncategorized and tagged on by .

Surgeon General’s Warning: This post contains an excess of hyperlinks. There is anecdotal evidence that excessive linking may be hazardous to your health.

I’m still here. :-) I’ve just been busy with my new job at Trampoline Systems and non-code things.

On the computer front, I’ve been learning (a bit) about spatial data structures and graph algorithms, mostly for work related reasons although also for personal interest. Tinkering with Haskell continues apace – I’ve been finding that it’s a very good language for thinking in, even if I don’t write anything big in it.

The Lazy Strings project may look like it died, but fear not! It continues. Ok, you probably didn’t care either way, but still it continues.

I’ve decided that a) Life is too short to write it in Java and that b) I should just pick an implementation and stick to it. Consequently I’ve moved to Scala, and have the basics of an implementation based on a finger tree, rather than the traditional balanced binary tree used in a rope.

Why a finger tree? Well, umm. Because. :-) Finger trees have nice performance characteristics, are easy to implement, and seem well suited to the task. The version I’m using is very heavily specialised to the task at hand, and measures a number of things up front (currently just length and hash code) to improve performance and allow for various nice optimisations.

The main thing to note is that I’ve been using scalacheck to test properties of the string. It’s been a great help. I’ve not found it that useful for actually tracking down the specifics of the bugs – its error reporting isn’t that great – but it’s been very useful for showing that they exist and providing enough of a general area that I can track them down myself. The fact that Scala has a REPL has been very useful in doing enough experimentation to pin it down.

The utility of these will come to no surprise to those functional programmers reading my blog. :-) Scalacheck isn’t quite as nice as Quickcheck and the Scala REPL isn’t as nice as most of the ML ones (it’s better than ghci though), but they’re both good enough, and it’s nice having these in Scala.

Scala itself I continue to have mixed feelings about. It’s a little too Java like. I very much like what it’s done with the object system (first class modules. Yay!), and about half of what it’s done with the type system, but the whole effect still feels kludgy to me. It’s definitely infinitely better than Java though, and doesn’t fall much short of being as pleasant as an ML (it’s better in some ways, worse in others).