Tag Archives: work

Open source term extraction

This is just a quick announcement to let people know that we’ve open sourced our JRuby library for term extraction. You can get the code from my github page.

Unlike a lot of term extraction libraries, this doesn’t take any stance as to the “significance” of the terms it extracts. It’s purely about looking at the syntax and determining where good boundaries for terms are. There are a couple reasons for this, but basically we’ve found that it’s more effective to separate the two steps and makes it easier to tinker around with them independently. The criteria for “interestingness” of terms seem to be largely distinct from those for terms which simply make sense linguistically. So we have a two stage pipeline, one which extracts semantically meaningful terms and one which determines what terms are actually interesting in the context of the document. The second step is much more complicated, and we’re not open sourcing that (yet? probably not any time soon, if ever. Even if we wanted to, it relies on a lot more global information across the document corpus and so is very tied in with how SONAR operates, making it much harder to isolate).

So, how does it work? Black magic and voodoo!

Actually, no. It’s pretty straightforward. It builds on top of the excellent OpenNLP library, using its tools for part of speech tagging, sentence splitting (a much harder problem than you’d imagine) and phrase chunking. It’s currently a rules based system on top of there, as while you’re figuring things out it makes much more sense to stick with something so easily fine tunable. Our expectation is that we’ll gradually start replacing bits of it with machine learning based techniques as we start to hit the limitations of a rules based system, but for now it’s working pretty well.

Let’s have an example. If we feed the second paragraph of this post into the term extractor, we get the following terms back:

term extraction libraries
stance
terms
syntax
good boundaries
couple reasons
two steps
steps
criteria
interestingness
sense
two stage pipeline
stage pipeline
semantically meaningful terms
context
context of the document
document
second step
open sourcing
time
document corpus
SONAR

Hope you find this useful. Let us know if you build anything cool with it!

Open sourcing Pearson’s Correlation calculations

As you might recall, I did some articles on calculating Pearson’s in SQL.

It turns out that this is a hilariously bad idea. The performance you get for it is terrible when the numbers get large. Switching to PostgreSQL seemed to help a bit here, but even then the numbers are not great (and we still aren’t planning on a port to PostgreSQL anyway). So we needed to find a better solution. Doing it in memory would be fast, but it would just fall over on a large dataset.

Anyway, after some tinkering around I came up with a slightly unholy solution. It’s a mix of bash, awk, standard unix tools and Java (the Java parts may be rewritten in something else later). The design is such that much of the heavy lifting is offloaded to sort, which is offline so doesn’t need to load the whole dataset into memory, and processes things in a line oriented manner. This lets it get by with a very reasonable memory usage and, in my fairly informal tests, to perform about 50 times faster than the SQL version.

We’re releasing the code under a BSD license and making it available on github. It’s in a bit of a rough state at the moment, but is usable as is.

Porting Pearsons to Postgres. Performance?

I’ve uploaded a version of the Pearson’s Coefficient code which runs on postgresql. You can download it here.I wrote this as an experiment to see if Postgres could help us with some of our MySQL performance woes.

Some brief experimentation suggests that once you fix PostgreSQL’s ridiculous default configuration the performance story is relatively happy. At small sizes MySQL is moderately faster, but as the sizes get large PostgreSQL seems to take the lead. I don’t have any sort of formal benchmark yet: This needs much more testing before I can definitively claim either is faster than the other, but for now the signs in favour of PostgreSQL are promising.

Yet another MySQL Fail

mysql> create table stuff (name varchar(32));
Query OK, 0 rows affected (0.24 sec)

mysql> insert into stuff values (’foo’), (’1′), (’0′);
Query OK, 3 rows affected (0.00 sec)
Records: 3 Duplicates: 0 Warnings: 0

mysql> select * from stuff;
+——+
| name |
+——+
| foo |
| 1 |
| 0 |
+——+
3 rows in set (0.00 sec)

mysql> delete from stuff where name = 0;
Query OK, 2 rows affected (0.09 sec)

mysql> select * from stuff;
+——+
| name |
+——+
| 1 |
+——+
1 row in set (0.00 sec)

mysql> create table stuff (name varchar(32));
Query OK, 0 rows affected (0.24 sec)

mysql> insert into stuff values (’foo’), (’1′), (’0′);
Query OK, 3 rows affected (0.00 sec)
Records: 3 Duplicates: 0 Warnings: 0

mysql> select * from stuff;
+——+
| name |
+——+
| foo |
| 1 |
| 0 |
+——+
3 rows in set (0.00 sec)

mysql> delete from stuff where name = 0;
Query OK, 2 rows affected (0.09 sec)

mysql> select * from stuff;
+——+
| name |
+——+
| 1 |
+——+
1 row in set (0.00 sec)

mysql> WTF????
-> ;
ERROR 1064 (42000): You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near ‘WTF????’ at line 1

So, what’s going on here? I said to delete everything where the name was 0, but it deleted the row ‘foo’.

The following might help:

mysql> create table more_stuff(id int);
Query OK, 0 rows affected (0.19 sec)

mysql> insert into more_stuff values(’foo’);
Query OK, 1 row affected, 1 warning (0.00 sec)

mysql> select * from more_stuff;
+——+
| id |
+——+
| 0 |
+——+
1 row in set (0.00 sec)

When you try to use a string as an integer in MySQL, it takes non numeric strings and turns them into zero. So when you test name = 0, it converts name into an integer and turns that into 0. Consequently strings which can’t be parsed as an integer result in true for this test.

At this point I would rant about how mindbogglingly stupid this behaviour is, but I don’t think I can really be bothered.

Trampoline Systems and Scala

So, as I’ve mentioned a few times my company Trampoline Systems have been using Scala at work. This is, somewhat unfortunately, about to change.

It’s not really due to any problems with Scala. I’m certainly still planning to continue using it myself. There have been a few hitches that have meant we’ve not been able to take advantage of it as well as I’d like, but this is mainly a strategic rather than a technical decision. The majority of our code is in Ruby (even more so than it was at the start of this project), and most of our expertise is in Ruby, so it was starting to look increasingly silly that we had just this one project in Scala. Consequently we’ve decided to move the stuff we were previously doing in Scala to JRuby.

Oh well. It was nice to be a professional Scala developer for a bit. Now I get to be a professional Ruby developer instead. Life’s all about dealing with changes. :-)