Calculating Pearson’s correlation coefficient in SQL

Apparently I’m crazy.

Well, it’s not the first time this accusation has been levelled. But it’s one of the more amusing ones.

My colleague Jan Berkel was at Ruby Manor, where one of the talks was on acts_as_recommendable. Here’s what he had to say on the subject:

a plugin for rails to generate recommendations. lets you do things like user.similar_users / books.similar_books etc. it’s based on pearson correlation. the guy mentioned he ran into memory problems using large datasets for doing the computation (load all objects , calculate, store objects). i asked whether they tried performing the calculation on SQL level to avoid runtrips to the db (the way david implemented it). he first didn’t understand what i was saying, then someone from the audience said “you’d have to be crazy to do that”… i guess this elegant solution is not the ruby way.. anyway.

This entertained me, as one of the things I introduced recently to our theme extraction process is using pearson’s coefficient to rate similarity of themes. It never even occurred to me that loading it all to the client side doing it in Ruby and then saving it back might be considered an option.

Pearsons is very far from being the most powerful metric in the world, but remarkably useful for all that: It gives a decent scaled measure of similarity and is cheap and easy to calculate.

Ok. Well. Easy to calculate. And cheap to calculate if you don’t do clever things like loading the entire data set into memory. Anyway…

All joking aside, it wasn’t entirely trivial to do, and clearly this is something which is of interest to more people than us, so this post is an explanation of how it works.

The Pearson product-moment corellation coefficient of two distributions is a measurement of their similarity. Given random variables X, Y we define the covariance of the two to be:

Cov(X, Y) = E(XY) – E(X) E(Y)

The reason for this slightly strange looking expression is that random variables have the nice property that for two independent random variables E(XY) = E(X) E(Y). So if X and Y are independent then Cov(X, Y) = E(XY) – E(X) E(Y) = E(X) E(Y) – E(X) E(Y) = 0. The converse isn’t generally true (it’s true under some circumstances), but nevertheless we can use Cov(X, Y) as a cheap measure of degree of independents.

It has some other nice properties: Suppose Y = aX. Then Cov(X, Y) = a Var(X). So if Y is a positive multiple of X then the covariance is positive, if Y is a negative multiple of X the covariance is negative.

This also shows us that there’s no inherent significance to the size of the covariance. It depends on the variance of both X and Y – in fact, scaling either will increase the covariance proportionately.

However it does have the following nice property:

|Cov(X, Y)| <= StdDev(X) StdDev(Y)

(When one is a multiple of the other, equality is achieved).

This allows us to define the pearson’s coefficient (normally written rho):

rho(X, Y) = Cov(X, Y) / (StdDev(X) * StdDev(Y))

Which by the above property takes values between -1 and 1 (it will take value 1 when one is a positive multiple of the other and value -1 when one is a negative multiple of the other).

That’s all there is to it. It’s a nice scaled measure of how similar the two distributions are.

So, why do we care and how do we use this?

Well, the details of how we actually use this are cunning and proprietary. So I can’t really tell you those (I’m hoping that we may be able to disclose some of them at some point, but not just yet). So let’s pretend we’re trying to win the netflix challenge (I think some solutions actually do use this, but this example has the nice property of having absolutely nothing to do with how we use it in SONAR). We have movies, we have people, we have ratings of movies by people. Given two movies we want to determine how similar they are, based on the people who liked them.

Sound interesting now?

We can do this by considering the ratings of each movie as a random variable. A person is a sample of this rating space. So by considering the set of all people as a sample we can calculate the pearsons corellation over this sample (I know I only explained how it works for random variables, but it carries over in the obvious way for samples).

So we’ve got a setup: People, movies, ratings. We’re going to calculate the correlation between movies and store the results in the database (I’d like to use views for this, but MySQL does horrible bad things to view join performance).

Before I proceed, there are a few subtle points:

  • Not everyone has rated every movie. I’m going to treat movies which have not been rated as implicitly having a rating of 0. This is sortof justified: Not watching a movie is often a stance about what sort of movie you like. Mostly it just makes the maths convenient. :-)
  • I’m going to ignore pairs of movies which have no rating in common. These have a corellation which is easy to work out (rho(x, y) = – mean(x) * mean(y) / (std_dev(x), std_dev(y))), and there tend to be a lot of them. This simplifies calculations and reduces storage overhead. Heavily anticorrelated movies also tend to be more because of holes in the data than interesting facts.

So let’s imagine our schema is set up as follows:

create table people( id int primary key
) create table movies( id int primary key
) create table ratings( person int references people (id), movie int references movies (id), rating float not null, unique key (person, movie)
)

We start out by computing mean and standard deviations for movies. Normally we’d use the built in functions for this, but unfortunately because we’ve got all those implicit 0 ratings floating around we have to do it ourselves.

Here’s the table in which we’ll store per movie stats:

 create table movie_statistics( movie int primary key references movies(id), mean float, stddev float )

We’ll only actually include stats for movies which have been rated at some point.

First we’ll calculate the mean:

 insert into movie_statistics (movie, mean) select movie, sum(rating) / (select count(*) from people) mean from ratings group by movie

Then the standard deviation:

 update movie_statistics set stddev = ( select sqrt( sum(rating * rating) / (select count(*) from people) - mean * mean ) stddev from ratings where ratings.movie = movie_statistics.movie group by ratings.movie)

All perfectly straightforward, if a little annoyingly wordy.

Now we get to the actual interesting bit: Calculating correlations. First we create a table in which to store the results:

create table movie_correlations(

movie1 int references movies(id),

movie2 int references movies(id),
rho float,
unique key (movie1, movie2),
key (movie1),
key (movie2)
)
Now we need to populate it. This will proceed in essentially three parts. First, we need to calculate E(XY), so for every pair of co-occurring movies we’ll get a sum of their products over all people who have rated them. That’s Sum(XY). We then divide by the number of people to get the average value of XY (implicit 0s again). Then we’ll use our already calculated statistics to give us the covariance (E(XY) – E(X) * E(Y)), and then the correlation (divide by standard deviations).Or, to put it altogether in one massive great big query:

 insert into movie_correlations ( movie1, movie2, rho ) select sf.m1, sf.m2, (sf.sum / (select count(*) from people) - stats1.mean * stats2.mean ) -- covariance / (stats1.stddev * stats2.stddev) from ( select r1.movie m1, r2.movie m2, sum(r1.rating * r2.rating) sum from ratings r1 join ratings r2 on r1.person = r2.person group by m1, m2 ) sf join movie_statistics stats1 on stats1.movie = sf.m1 join movie_statistics stats2 on stats2.movie = sf.m2

There, that wasn’t so bad was it? The inner query does the summing: By joining the ratings table to itself on the people we get all pairs of movies where we have a rating for both. We group by pairs of movies, thus summing over all the people who rated them. It’s then a simple matter to calculate the covariance and thus the correlation.

Or maybe I am crazy to look at a 23 line SQL query and say “that wasn’t so bad, was it?”. But at least I’m a crazy person whose code doesn’t run out of memory nearly as soon. :-)

Here’s some code that ties all of this together: http://code.trampolinesystems.com/pearsons.rb

This entry was posted in Uncategorized and tagged on by .