Archive for the ‘Uncategorized’ Category

A workaround for misbehaving X citizens

Wednesday, April 1st, 2009

Some programs (most notably amongst those I use firefox and pidgin) don’t work properly with shift-insert for pasting from the primary X Selection. If you’re like me and virtually live in the command line, this is a real pain in the ass. It became more of a pain in the ass recently because I’ve started using a graphics tablet instead of a mouse, so no longer have a middle click. I could no longer paste from IRC into firefox.

After some hacking around, I settled on the following solution. “xsel -o -p | xsel -i -b” dumps the primary X selection into the clipboard (xsel is a nice little application. You should probably have it installed). So I’ve bound this to a shortcut in xmonad and now can at the press of a magic key combination take the current selection and make it pasteable. Hurray.

Left folds with early termination

Wednesday, March 18th, 2009

Consider what should be the basis of a general purpose collections API. In Scala it is Iterable, which uses an Iterator to define a foreach method. Most other things are defined in terms of this (note: This statement is untrue. Many of them use the iterator directly. But it is morally true).

This is all icky and imperative. Yuck.

As every good functional programmer knows, the true notion behind this silly “iteration” thing is actually folding.

The JVM is mean to us and doesn’t support lazy evaluation or variable sized stacks, so foldRights tend to be an exercise in pain: They stack overflow distressingly soon. So let’s base our collections API on foldLeft: Most designs of it are tail recursive (or iterative. Sigh). As well as dealing with JVM limitations this lets us develop a very high performance interface, and we can do pleasant things like dealing with very large lazy collections in constant space[1]

So:

trait Foldable[T]{
 def foldLeft[S](s : S)(f : (S, T) => S) : S;
}

There. That’s a nice basic design.

What methods can we add to this?

Well the obvious thing we can do to placate the Java programmers:

def foreach(f : T => Unit) = foldLeft[Unit](())((s, t) => f(t))

Now let’s define some stuff we care about. One handy method on Iterable is find. So let’s add that.

def find(f : T => Boolean) : Option[T] =
   foldLeft[Option[T]](None)((s, t) =>
     s match {
       case Some(s) => Some(s);
       case None if f(t) => Some(t);
       _ => None
  })

Nice, isn’t it?

So what happens if we do Stream.from(1).find(_ > 10).

It loops forever. :-(

Unfortunately our nice efficient foldLeft based implementation always eats the entire collection. This is sad: We can’t deal with infinite collections.It’s also really inefficient for large collections where we don’t need the whole thing.

So let’s add a way of stopping when we’re done:

def foldLeft[S](s : S)(f : (S, T) => Option[S]) : S;

The idea is that the Option is used to signal a stop condition. We return a None when we’re done consuming the collection, and the foldLeft stops.

Rewriting find in terms of this:

def find(f : T => Boolean) : Option[T] =
   foldLeft[Option[T]](None)((s, t) =>
     s match {
       case Some(s) => None;
       case None if f(t) => Some(Some(t));
       _ => Some(None)
  })

Ok. That’s a bit gross. But that’s purely a function of the fact that we were already trying to return an Option[T]. I’m sure for normal code it will look ok.

What happens is that once we’ve already found something, we now know we can stop, so we return a None to indicate stopping.

Hmm. It would be nice if we could incorporate some of this functionality into find itself. Let’s consider the following example:

(1 to 100000000 by 3).find(_ == 11)

We know this collection is ordered, so once we’ve gotten to 13 we know we’re never going to find 11. But find keeps on trucking till it hits the end of the collection. :-(

So, let’s modify find to support early termination as well. First we change its signature to f => Option[Boolean], and then…

No, sorry. My sense of good taste just revolted.

The problem is we now have to modify all signatures that might possibly want to support early termination of the collection. And they then have to explicitly deal with passing that on to foldLeft in the right way. All just to allow callers to pretend that the collection is smaller than it really is. It would be nice if we could write this functionality once and then have all the higher order functions based on foldLeft inherit it for free.

Agreed?

Well, there fortunately *is* a way to do that. It involves some loss of type safety, but as long as foldLeft is always properly implemented this should be acceptable.

case object Stop extends Throwable;
def stop : Nothing = throw Stop;

Implementations of foldLeft then have to catch Stop and treat that point as the end of the collection.[2]

Now, let’s implement find as follows:

def find(f : T => Boolean) : Option[T] =
   foldLeft[Option[T]](None)((s, t) =>
     s match {
       case Some(s) => stop;
       case None if f(t) => Some(t);
       _ => None
  })

Only a simple modification to our original implementation, but now it correctly consumes only a minimal amount of the collection (actually it consumes one element too many, but that’s because I’ve modified the implementation from the real one we’d use for pedagogical reasons).

And, for free, we can use stop in the argument to find as well:

(1 to 100000000 by 3).find{x => val c = x compare 11; if(c > 0) stop else (c == 0)}

And we no longer continue past the point where we need to, even though we don’t find our element. Yay!

So, hopefully I have persuaded you that left folds with an early termination forms a nice basis for a collection library: Clean, functional, and efficient. What more could you hope for?

Oh, before I finish. One more little bone to throw to our Java friends to make them feel more comfortable with this design:

def break = stop;

[1] See “Towards the best collections API” for a discussion of building an API around this design (note that their solution does involve a bit more than what we talk about here, some of which is linguistically difficult in Scala without continuations).

[2] You may be making squawking noises about performance now. Don’t. It’s fine. But now is not the time for discussion of this.

Planet Scala gets its own domain

Sunday, March 15th, 2009

It can now be found at http://planetscala.com/

The old location and feeds will continue to work for the foreseeable future, but I’d appreciate it if people were to start using the new URL instead.

Pearsons in the database, part 2

Tuesday, December 9th, 2008

Before I explain what this is about, the following tweets provide useful context for how I feel about this:

http://twitter.com/DRMacIver/status/1047320819

http://twitter.com/DRMacIver/status/1047321174

We’ve been discovering that the SQL query I wrote for calculating pearsons, while it works fine for small datasets (say a few hundred thousand ratings), once you get to around the million rating mark starts being unusably slow, even for a nightly job. Basically if you look at the query plan it ends up doing a filesort. This is not nice, as it involves an awful lot of data paging to and from disk. Having tried to optimise it directly and failed, I’ve spent the last two weeks writing nasty hacks to try to make it fast and was going to follow up to the blog post with my final solution (which involves a temporary table and a stored procedure. It’s pretty grim).

But today I was doing something else which involved a similar sort of query and got really pissed off that I was having exactly the same problems. I boiled it down to a very simple example which illustrated it and logged onto freenode’s #mysql to see if anyone could help me figure out what’s going on. Last time I tried this I was not successful, but one lives in hope.

Well, as it turns out, someone could. Here’s a snippet of conversation:

16:20 < DRMacIver> I'm finding this pattern comes up a lot in what I'm doing at the moment, and I simply can't figure out a sensible way to optimise it. Any suggestions? http://pastebin.com/m5be98ace
16:21 < DRMacIver> The self join on the same column followed by a group by seems to produce a filesort no matter what I do. As per example, basically all indices that could exist on the table do.
16:22 < jbalint> DRMacIver: i think you can order by null
16:22 * DRMacIver boggles
16:22 < DRMacIver> That works. Thank you.

So, by updating one line in the SQL I’ve posted previously the query becomes dramatically faster. I’ve also updated it to use an update join (which is MySQL specific) instead of the dependent subquery in the set (which is not, but which MySQL runs appallingly slowly) in order to get the standard deviation calculations to run in reasonable time.

Why is this happening? Well, http://dev.mysql.com/doc/refman/5.0/en/order-by-optimization.html contains the answer. It turns out that if you have a group by then it implicitly orders by that group. Even if the group clause has no index on it (which it can’t in this case because it’s a join across two tables), and ordering a large dataset by a non-indexed key will cause a filesort. If you add the clause “order by null” then it will not order by the group by clause, and the filesort goes away and the query becomes much faster.

We are not amused.

As another general tip, we’ve also found that this rapidly starts to generate unacceptably large amount of data, but that if you set some sensible lower bounds (such as only generating the statistics for things which cooccur more than once and have a pearsons > 0.1) it can easily be reduced to sensible levels.

Calculating Pearson’s correlation coefficient in SQL

Sunday, November 23rd, 2008

Apparently I’m crazy.

Well, it’s not the first time this accusation has been levelled. But it’s one of the more amusing ones.

My colleague Jan Berkel was at Ruby Manor, where one of the talks was on acts_as_recommendable. Here’s what he had to say on the subject:

a plugin for rails to generate recommendations. lets you do things like user.similar_users / books.similar_books etc. it’s based on pearson correlation. the guy mentioned he ran into memory problems using large datasets for doing the computation (load all objects , calculate, store objects). i asked whether they tried performing the calculation on SQL level to avoid runtrips to the db (the way david implemented it). he first didn’t understand what i was saying, then someone from the audience said “you’d have to be crazy to do that”… i guess this elegant solution is not the ruby way.. anyway.

This entertained me, as one of the things I introduced recently to our theme extraction process is using pearson’s coefficient to rate similarity of themes. It never even occurred to me that loading it all to the client side doing it in Ruby and then saving it back might be considered an option.

Pearsons is very far from being the most powerful metric in the world, but remarkably useful for all that: It gives a decent scaled measure of similarity and is cheap and easy to calculate.

Ok. Well. Easy to calculate. And cheap to calculate if you don’t do clever things like loading the entire data set into memory. Anyway…

All joking aside, it wasn’t entirely trivial to do, and clearly this is something which is of interest to more people than us, so this post is an explanation of how it works.

The Pearson product-moment corellation coefficient of two distributions is a measurement of their similarity. Given random variables X, Y we define the covariance of the two to be:

Cov(X, Y) = E(XY) - E(X) E(Y)

The reason for this slightly strange looking expression is that random variables have the nice property that for two independent random variables E(XY) = E(X) E(Y). So if X and Y are independent then Cov(X, Y) = E(XY) - E(X) E(Y) = E(X) E(Y) - E(X) E(Y) = 0. The converse isn’t generally true (it’s true under some circumstances), but nevertheless we can use Cov(X, Y) as a cheap measure of degree of independents.

It has some other nice properties: Suppose Y = aX. Then Cov(X, Y) = a Var(X). So if Y is a positive multiple of X then the covariance is positive, if Y is a negative multiple of X the covariance is negative.

This also shows us that there’s no inherent significance to the size of the covariance. It depends on the variance of both X and Y - in fact, scaling either will increase the covariance proportionately.

However it does have the following nice property:

|Cov(X, Y)| <= StdDev(X) StdDev(Y)

(When one is a multiple of the other, equality is achieved).

This allows us to define the pearson’s coefficient (normally written rho):

rho(X, Y) = Cov(X, Y) / (StdDev(X) * StdDev(Y))

Which by the above property takes values between -1 and 1 (it will take value 1 when one is a positive multiple of the other and value -1 when one is a negative multiple of the other).

That’s all there is to it. It’s a nice scaled measure of how similar the two distributions are.

So, why do we care and how do we use this?

Well, the details of how we actually use this are cunning and proprietary. So I can’t really tell you those (I’m hoping that we may be able to disclose some of them at some point, but not just yet). So let’s pretend we’re trying to win the netflix challenge (I think some solutions actually do use this, but this example has the nice property of having absolutely nothing to do with how we use it in SONAR). We have movies, we have people, we have ratings of movies by people. Given two movies we want to determine how similar they are, based on the people who liked them.

Sound interesting now?

We can do this by considering the ratings of each movie as a random variable. A person is a sample of this rating space. So by considering the set of all people as a sample we can calculate the pearsons corellation over this sample (I know I only explained how it works for random variables, but it carries over in the obvious way for samples).

So we’ve got a setup: People, movies, ratings. We’re going to calculate the correlation between movies and store the results in the database (I’d like to use views for this, but MySQL does horrible bad things to view join performance).

Before I proceed, there are a few subtle points:

  • Not everyone has rated every movie. I’m going to treat movies which have not been rated as implicitly having a rating of 0. This is sortof justified: Not watching a movie is often a stance about what sort of movie you like. Mostly it just makes the maths convenient. :-)
  • I’m going to ignore pairs of movies which have no rating in common. These have a corellation which is easy to work out (rho(x, y) = - mean(x) * mean(y) / (std_dev(x), std_dev(y))), and there tend to be a lot of them. This simplifies calculations and reduces storage overhead. Heavily anticorrelated movies also tend to be more because of holes in the data than interesting facts.

So let’s imagine our schema is set up as follows:

create table people( id int primary key
) create table movies( id int primary key
) create table ratings( person int references people (id), movie int references movies (id), rating float not null, unique key (person, movie)
)

We start out by computing mean and standard deviations for movies. Normally we’d use the built in functions for this, but unfortunately because we’ve got all those implicit 0 ratings floating around we have to do it ourselves.

Here’s the table in which we’ll store per movie stats:

 create table movie_statistics( movie int primary key references movies(id), mean float, stddev float )

We’ll only actually include stats for movies which have been rated at some point.

First we’ll calculate the mean:

 insert into movie_statistics (movie, mean) select movie, sum(rating) / (select count(*) from people) mean from ratings group by movie

Then the standard deviation:

 update movie_statistics set stddev = ( select sqrt( sum(rating * rating) / (select count(*) from people) - mean * mean ) stddev from ratings where ratings.movie = movie_statistics.movie group by ratings.movie)

All perfectly straightforward, if a little annoyingly wordy.

Now we get to the actual interesting bit: Calculating correlations. First we create a table in which to store the results:

create table movie_correlations(

movie1 int references movies(id),

movie2 int references movies(id),
rho float,
unique key (movie1, movie2),
key (movie1),
key (movie2)
)
Now we need to populate it. This will proceed in essentially three parts. First, we need to calculate E(XY), so for every pair of co-occurring movies we’ll get a sum of their products over all people who have rated them. That’s Sum(XY). We then divide by the number of people to get the average value of XY (implicit 0s again). Then we’ll use our already calculated statistics to give us the covariance (E(XY) - E(X) * E(Y)), and then the correlation (divide by standard deviations).Or, to put it altogether in one massive great big query:

 insert into movie_correlations ( movie1, movie2, rho ) select sf.m1, sf.m2, (sf.sum / (select count(*) from people) - stats1.mean * stats2.mean ) -- covariance / (stats1.stddev * stats2.stddev) from ( select r1.movie m1, r2.movie m2, sum(r1.rating * r2.rating) sum from ratings r1 join ratings r2 on r1.person = r2.person group by m1, m2 ) sf join movie_statistics stats1 on stats1.movie = sf.m1 join movie_statistics stats2 on stats2.movie = sf.m2

There, that wasn’t so bad was it? The inner query does the summing: By joining the ratings table to itself on the people we get all pairs of movies where we have a rating for both. We group by pairs of movies, thus summing over all the people who rated them. It’s then a simple matter to calculate the covariance and thus the correlation.

Or maybe I am crazy to look at a 23 line SQL query and say “that wasn’t so bad, was it?”. But at least I’m a crazy person whose code doesn’t run out of memory nearly as soon. :-)

Here’s some code that ties all of this together: http://code.trampolinesystems.com/pearsons.rb