David R. MacIver's Blog: Pearsons in the database, part 2

Pearsons in the database, part 2

9 December 2008

Before I explain what this is about, the following tweets provide useful context for how I feel about this:

http://twitter.com/DRMacIver/status/1047320819

http://twitter.com/DRMacIver/status/1047321174

We’ve been discovering that the SQL query I wrote for calculating pearsons, while it works fine for small datasets (say a few hundred thousand ratings), once you get to around the million rating mark starts being unusably slow, even for a nightly job. Basically if you look at the query plan it ends up doing a filesort. This is not nice, as it involves an awful lot of data paging to and from disk. Having tried to optimise it directly and failed, I’ve spent the last two weeks writing nasty hacks to try to make it fast and was going to follow up to the blog post with my final solution (which involves a temporary table and a stored procedure. It’s pretty grim).

But today I was doing something else which involved a similar sort of query and got really pissed off that I was having exactly the same problems. I boiled it down to a very simple example which illustrated it and logged onto freenode’s #mysql to see if anyone could help me figure out what’s going on. Last time I tried this I was not successful, but one lives in hope.

Well, as it turns out, someone could. Here’s a snippet of conversation:

16:20 < DRMacIver> I'm finding this pattern comes up a lot in what I'm doing at the moment, and I simply can't figure out a sensible way to optimise it. Any suggestions? http://pastebin.com/m5be98ace
16:21 < DRMacIver> The self join on the same column followed by a group by seems to produce a filesort no matter what I do. As per example, basically all indices that could exist on the table do.
16:22 < jbalint> DRMacIver: i think you can order by null
16:22 * DRMacIver boggles
16:22 < DRMacIver> That works. Thank you.

So, by updating one line in the SQL I’ve posted previously the query becomes dramatically faster. I’ve also updated it to use an update join (which is MySQL specific) instead of the dependent subquery in the set (which is not, but which MySQL runs appallingly slowly) in order to get the standard deviation calculations to run in reasonable time.

Why is this happening? Well, http://dev.mysql.com/doc/refman/5.0/en/order-by-optimization.html contains the answer. It turns out that if you have a group by then it implicitly orders by that group. Even if the group clause has no index on it (which it can’t in this case because it’s a join across two tables), and ordering a large dataset by a non-indexed key will cause a filesort. If you add the clause “order by null” then it will not order by the group by clause, and the filesort goes away and the query becomes much faster.

We are not amused.

As another general tip, we’ve also found that this rapidly starts to generate unacceptably large amount of data, but that if you set some sensible lower bounds (such as only generating the statistics for things which cooccur more than once and have a pearsons > 0.1) it can easily be reduced to sensible levels.