Computing connected graph components via SQL

Hi, I don’t post to here much. I’m one of the devs working on SONAR, focusing on mostly theme extraction.

As with many applications, SONAR’s data crunching is basically relational database driven. We keep thinking about experimenting with graph DB based approaches, but never manage to find quite a compelling enough reason – there’s no way we’d give up the relational approach entirely, so it needs to be a really big win to be worth the annoyance of having to maintain two different types of database in synch with eachother.

Unfortunately this sometimes leaves us in the unenviable position of having to do graph algorithms in SQL. This is about as much fun as you might imagine it to be. Most recent challenge: Computing the connected components of a graph in SQL.

There’s always the option of loading it into memory and doing it there of course. But the graph in question is rather large. With our little demo data sets it would be fine to do that, but any larger (e.g. on a real live sonar deployment) and this starts to sound like a really bad idea.

It turns out this is surprisingly simple to do once you have the key insight. I couldn’t find anything on the web explaining this though, so thought I’d write a post about it in case anyone else needs to do the same. It’s not rocket science, but hopefully this will save someone some time.

Consider the following setup:

create table if not exists items( id int primary key, component_id int
); create table if not exists links( first int references items (id), second int references items (id)

We consider entries in links as undirected edges in a graph and we want to update items so that all items in the same component have the same component_id and distinct components have distinct component ids.

We’ll do this incrementally and merge. In order to do this we need a new table which we’ll use as scratch space (this should be a temporary table, but MySQL has irritating restrictions on temporary tables which make this not work):

create table if not exists components_to_merge( component1 int, component2 int);

(Side note: All SQL here is tested only on MySQL. It shouldn’t be hard to make it work on any other database though).

The idea is that at each step we’ll merge components, using components_to_merge to map components to the component they’ll be merged with.

So we start with a set of candidate components. That’s simple enough: We take each node as a potential starting component.

Now at each stage we merge components by finding links between them. For every potential component we look at all other potential components it’s connected to via some link. We insert all component-component links into components to merge. This is straightforward enough:

insert into components_to_merge select distinct t1.component_id c1, t2.component_id c2 from links join items t1 on links.first = join items t2 on links.second = where t1.component_id != t2.component_id insert into components_to_merge
select component2, component1 from components_to_merge; -- ensure symmetricity

So, now we have a list of groups to merge in this table. If the table is empty then we’re done – all the groups are maximal (and because of the way we constructed them they’re connected – at each point they were built by joining together two connected sets which were the connected to eachother), so the component_ids currently in items describe the actual components. If not, we now reassign components:

 update items join ( select component1 source, min(component2) target from components_to_merge group by source ) new_components on new_components.source = component_id set items.component_id = least(items.component_id, target)

This step merges each component with another one (although “merging” conveys a slightly inaccurate sense of what happens. Consider a graph 1-2-3. The first step would result in 1 and 2 being assigned component_id 1, and 3 being assigned component_id 2. So the {3} component took the place of the {2} component).

What’s the complexity of this code? Well, it’s not amazing, but it’s not terrible either. The complexity of each step in the loop is probably somewhere around O(n log(n)) depending on exactly what the database does to it. The worst case number of queries is the size of the largest component: It’s obviously an upper bound as the number groups decreases by at least one each time; In order to see that it’s achieved, consider an extension of the 1-2-3 example where we have a graph 1-2-…-n. Then at each stage what happens is that we end up with components [1], [2], …, [n] -> [1, 2], [3], …, [n] -> [1, 2, 3], [4], …, [n], etc, taking n steps to terminate. On the other hand, a complete graph terminates in one step.

I suspect that the expected run time is O(log(n)), with each part of the component chosen approximately doubling in size each time, but I confess to not actually having bothered to run the maths: For our particular use case at the moment this is fine – it turns out the graph we’re considering is fairly sparse and tends to have small components, so for the moment this is more than fast enough. On the other hand, it would be nice to have a better guaranteed time, so if anyone has a smarter approach I’d love to hear it.

Anyway, here’s some sample code that ties all of this together:

This entry was posted in Uncategorized and tagged on by .