Archive for April, 2009

Command line tools for NLP and Machine Learning

Thursday, April 30th, 2009

I’m a huge fan of command line tools.

I may be 40 years late to the party, but over the last couple of months I’ve been increasingly finding that The Unix Way (described by a friend of mine as “‘loosely coupled’, at least, in the sense that all IPC took the form of text files with ad-hoc formats piped through shell scripts”) is a marvelous way to work.

NLP, Machine Learning and related tasks map very well onto this I find. They’re very often directly concerned with the manipulation of text and, where not, can usually be expressed in quite simple formats (of course, you’ll often need just a big binary blob for model files and the like, but that’s ok). So I’d like to see more tools available for such. Here are some I’m familiar with:

dbacl dbacl is a command line text classifier. It’s uses bigrams for features and, as far as I can tell (I’ve only skimmed the source) builds a maximum entropy model for classification. I’ve only played with it a little bit, but my impressions so far are that it’s easy to use, fast and produces high quality results
MCL MCL is a fast unsupervised clustering algorithm for weighted graphs. This is a command line tool produced by its originator. It appears to be a very solid tool, and the results are always interesting (the larger clusters it produces are often a bit strange, but there’s a lot of interesting info at the small to medium range). I’ve cheerfully fed several hundred MB of graph data through it and had it produce a good result (it took a few minutes to do so, but it worked)
Hunpos Hunpos is a part of speech tagger. We’ve used it successfully in production (though the latest versions of SONAR no longer use it having switched to OpenNLP’s version) and found it to be pretty fast and to produce decent results.
binary-pearsons My only contribution to the list so far. It reads a sequence of labelled events (one per line) and outputs the pearsons correlation between the labels as a measure of their similarity. I’ve not yet got it to a point where I want to release a version, but I’ve already found it very useful (we’re using it in SONAR to dramatically speed up calculations from our previous version, which is where it comes from)
SRILM The SRI Language Modelling toolkit seems to be primarily a library for language modelling, but exposes a lot of its functionality through a collection of command line tools. I’ve not used it, but it seems to offer a bunch of potentially quite useful functionality. (Thanks to Aaron Harnly for the recommendation)
OpenFST OpenFST is a C++ class library for creating and using finite state transducers which also exposes all its functionality as a collection of shell tools. (Thanks to cypherx on reddit for the mention)

That’s all I can think of at the moment, though I swear I’ve encountered a couple more which I’ve found useful in the past. What do you use?

Determining logical project structure from commit logs

Tuesday, April 28th, 2009

In a bored 5 minutes at work I threw the following together: Logical source file groupings in the Scala repo

The largest cluster is clearly noisy and random. I more or less expected that. But the small and medium ones often make a lot of sense.

The basic technique is straightforward: We use a trivial script to scrape SVN logs to get a list of files that change in each commit. We use this to calculate the binary pearsons of these observations to get a measure of the similarity between two files (a number between -1 and 1, though we throw away anything <= 0). We then use markov clustering to cluster the results into distinct groupings.

The results are obviously far from perfect. But equally obviously there’s a lot of interesting information in them, and the technique could certainly be refined (e.g. by looking at sizes of diffs on each file and using that rather than a simple 0/1 changed. Also experimenting with other clustering algorithms, etc). Maybe something worth pursuing?

A reminder: Planet Scala move

Tuesday, April 28th, 2009

Just in case you’ve forgotten (and the number of hits I’m getting on the old location says you have), drmaciver.com/planetscala will cease to be a valid place to point your feed reader in just a few days. Please point it at planetscala.com.

Open sourcing Pearson’s Correlation calculations

Wednesday, April 22nd, 2009

As you might recall, I did some articles on calculating Pearson’s in SQL.

It turns out that this is a hilariously bad idea. The performance you get for it is terrible when the numbers get large. Switching to PostgreSQL seemed to help a bit here, but even then the numbers are not great (and we still aren’t planning on a port to PostgreSQL anyway). So we needed to find a better solution. Doing it in memory would be fast, but it would just fall over on a large dataset.

Anyway, after some tinkering around I came up with a slightly unholy solution. It’s a mix of bash, awk, standard unix tools and Java (the Java parts may be rewritten in something else later). The design is such that much of the heavy lifting is offloaded to sort, which is offline so doesn’t need to load the whole dataset into memory, and processes things in a line oriented manner. This lets it get by with a very reasonable memory usage and, in my fairly informal tests, to perform about 50 times faster than the SQL version.

We’re releasing the code under a BSD license and making it available on github. It’s in a bit of a rough state at the moment, but is usable as is.

Oops

Wednesday, April 15th, 2009

The light from the explosion will probably be reaching us soon.

I hear the lesser magellanic clouds are pretty.