I’m a huge fan of command line tools.
I may be 40 years late to the party, but over the last couple of months I’ve been increasingly finding that The Unix Way (described by a friend of mine as “‘loosely coupled’, at least, in the sense that all IPC took the form of text files with ad-hoc formats piped through shell scripts”) is a marvelous way to work.
NLP, Machine Learning and related tasks map very well onto this I find. They’re very often directly concerned with the manipulation of text and, where not, can usually be expressed in quite simple formats (of course, you’ll often need just a big binary blob for model files and the like, but that’s ok). So I’d like to see more tools available for such. Here are some I’m familiar with:
|dbacl||dbacl is a command line text classifier. It’s uses bigrams for features and, as far as I can tell (I’ve only skimmed the source) builds a maximum entropy model for classification. I’ve only played with it a little bit, but my impressions so far are that it’s easy to use, fast and produces high quality results|
|MCL||MCL is a fast unsupervised clustering algorithm for weighted graphs. This is a command line tool produced by its originator. It appears to be a very solid tool, and the results are always interesting (the larger clusters it produces are often a bit strange, but there’s a lot of interesting info at the small to medium range). I’ve cheerfully fed several hundred MB of graph data through it and had it produce a good result (it took a few minutes to do so, but it worked)|
|Hunpos||Hunpos is a part of speech tagger. We’ve used it successfully in production (though the latest versions of SONAR no longer use it having switched to OpenNLP’s version) and found it to be pretty fast and to produce decent results.|
|binary-pearsons||My only contribution to the list so far. It reads a sequence of labelled events (one per line) and outputs the pearsons correlation between the labels as a measure of their similarity. I’ve not yet got it to a point where I want to release a version, but I’ve already found it very useful (we’re using it in SONAR to dramatically speed up calculations from our previous version, which is where it comes from)|
|SRILM||The SRI Language Modelling toolkit seems to be primarily a library for language modelling, but exposes a lot of its functionality through a collection of command line tools. I’ve not used it, but it seems to offer a bunch of potentially quite useful functionality. (Thanks to Aaron Harnly for the recommendation)|
|OpenFST||OpenFST is a C++ class library for creating and using finite state transducers which also exposes all its functionality as a collection of shell tools. (Thanks to cypherx on reddit for the mention)|
That’s all I can think of at the moment, though I swear I’ve encountered a couple more which I’ve found useful in the past. What do you use?