Command line tools for NLP and Machine Learning

I’m a huge fan of command line tools.

I may be 40 years late to the party, but over the last couple of months I’ve been increasingly finding that The Unix Way (described by a friend of mine as “‘loosely coupled’, at least, in the sense that all IPC took the form of text files with ad-hoc formats piped through shell scripts”) is a marvelous way to work.

NLP, Machine Learning and related tasks map very well onto this I find. They’re very often directly concerned with the manipulation of text and, where not, can usually be expressed in quite simple formats (of course, you’ll often need just a big binary blob for model files and the like, but that’s ok). So I’d like to see more tools available for such. Here are some I’m familiar with:

dbacl dbacl is a command line text classifier. It’s uses bigrams for features and, as far as I can tell (I’ve only skimmed the source) builds a maximum entropy model for classification. I’ve only played with it a little bit, but my impressions so far are that it’s easy to use, fast and produces high quality results
MCL MCL is a fast unsupervised clustering algorithm for weighted graphs. This is a command line tool produced by its originator. It appears to be a very solid tool, and the results are always interesting (the larger clusters it produces are often a bit strange, but there’s a lot of interesting info at the small to medium range). I’ve cheerfully fed several hundred MB of graph data through it and had it produce a good result (it took a few minutes to do so, but it worked)
Hunpos Hunpos is a part of speech tagger. We’ve used it successfully in production (though the latest versions of SONAR no longer use it having switched to OpenNLP’s version) and found it to be pretty fast and to produce decent results.
binary-pearsons My only contribution to the list so far. It reads a sequence of labelled events (one per line) and outputs the pearsons correlation between the labels as a measure of their similarity. I’ve not yet got it to a point where I want to release a version, but I’ve already found it very useful (we’re using it in SONAR to dramatically speed up calculations from our previous version, which is where it comes from)
SRILM The SRI Language Modelling toolkit seems to be primarily a library for language modelling, but exposes a lot of its functionality through a collection of command line tools. I’ve not used it, but it seems to offer a bunch of potentially quite useful functionality. (Thanks to Aaron Harnly for the recommendation)
OpenFST OpenFST is a C++ class library for creating and using finite state transducers which also exposes all its functionality as a collection of shell tools. (Thanks to cypherx on reddit for the mention)

That’s all I can think of at the moment, though I swear I’ve encountered a couple more which I’ve found useful in the past. What do you use?

This entry was posted in computational linguistics, programming on by .

Determining logical project structure from commit logs

In a bored 5 minutes at work I threw the following together: Logical source file groupings in the Scala repo

The largest cluster is clearly noisy and random. I more or less expected that. But the small and medium ones often make a lot of sense.

The basic technique is straightforward: We use a trivial script to scrape SVN logs to get a list of files that change in each commit. We use this to calculate the binary pearsons of these observations to get a measure of the similarity between two files (a number between -1 and 1, though we throw away anything <= 0). We then use markov clustering to cluster the results into distinct groupings.

The results are obviously far from perfect. But equally obviously there’s a lot of interesting information in them, and the technique could certainly be refined (e.g. by looking at sizes of diffs on each file and using that rather than a simple 0/1 changed. Also experimenting with other clustering algorithms, etc). Maybe something worth pursuing?

This entry was posted in programming and tagged , , on by .

A reminder: Planet Scala move

Just in case you’ve forgotten (and the number of hits I’m getting on the old location says you have), drmaciver.com/planetscala will cease to be a valid place to point your feed reader in just a few days. Please point it at planetscala.com.

This entry was posted in Admin, programming on by .

Open sourcing Pearson’s Correlation calculations

As you might recall, I did some articles on calculating Pearson’s in SQL.

It turns out that this is a hilariously bad idea. The performance you get for it is terrible when the numbers get large. Switching to PostgreSQL seemed to help a bit here, but even then the numbers are not great (and we still aren’t planning on a port to PostgreSQL anyway). So we needed to find a better solution. Doing it in memory would be fast, but it would just fall over on a large dataset.

Anyway, after some tinkering around I came up with a slightly unholy solution. It’s a mix of bash, awk, standard unix tools and Java (the Java parts may be rewritten in something else later). The design is such that much of the heavy lifting is offloaded to sort, which is offline so doesn’t need to load the whole dataset into memory, and processes things in a line oriented manner. This lets it get by with a very reasonable memory usage and, in my fairly informal tests, to perform about 50 times faster than the SQL version.

We’re releasing the code under a BSD license and making it available on github. It’s in a bit of a rough state at the moment, but is usable as is.

This entry was posted in Code, programming and tagged on by .

Gates

I spent some time as a superintelligence once. It was weird. Eventually I got bored of it, trimmed myself down and stuffed myself back into a body.

I don’t remember a great deal about it – it’s hard to remember what it was like to be smarter than you are now – but every now and then I get flashes where I remember some fact or event.

For example, once I remembered how the travel gates work.

It turns out this was not a good thing. I ended up terrified of using them and couldn’t bring myself to leave the planet I was on at the time. I spent most of the next century drunk out of my mind and, when I finally sobered up, I resolved to do something about it, built myself a slowboat (you wouldn’t believe how much effort it takes to bootstrap a society from hunter gatherer to interstellar) and took a thousand year trip to find someone I trusted to help me edit my memories.

Anyway, mission accomplished, I got the knowledge expunged from my mind and happily returned to the life of a modern interstellar traveller, gating all around the galaxy. What a lark.

Thing is, there’s a problem with memory editing. You tend to edit out the reason you got your memory edited in the first place. And then you start burning up with curiousity. After a good few hundred years I finally couldn’t take it any more and just had to find out. And I did.

Want another drink? I think I’m going to be here a while.

This entry was posted in Fiction on by .