Category Archives: programming

A problem of language

I try to stay out of language wars these days. I find the whole endeavour incredibly tedious. I don’t really feel like arguing whether OCaml is better than Ruby is better than Scala is better than brainfuck is better than C. I like them all (ok maybe not brainfuck), and there are valid arguments for and against each of them.

But one thing I have a lot of trouble with is bad arguments. Not “arguments I disagree with”, but arguments which are simply outright bad.

Robert Fischer has a post on his blog: Scala is not a functional language. It’s not the first time this idea has come up. It’s not an idea entirely without merit. Scala’s functional features are certainly not as seamless to use as one might hope.

The problem is that while it is not an idea without merit, its presentation is certainly one without content. As per usual, every time this comes up, no one is actually willing to say what on earth they mean by “functional programming language”. This problem is ubiquitous. Consider the following wikipedia entry:

In computer science, functional programming is a programming paradigm that treats computation as the evaluation of mathematical functions and avoids state and mutable data. It emphasizes the application of functions, in contrast to the imperative programming style, which emphasizes changes in state. Functional programming has its roots in the lambda calculus, a formal system developed in the 1930s to investigate function definition, function application, and recursion. Many functional programming languages can be viewed as embellishments to the lambda calculus.

Notice the bait and switch there? It defines “functional programming” and then talks about “functional programming languages“. Everyone does this. The only unambiguous definition you find which includes the term “functional programming language” is that of Purely functional language. This is much less ambiguous, but it’s also the case that the vast majority of people arguing about whether Scala (or other language of choice) is functional are comparing it to a decidedly impure language. Certainly neither Clojure nor OCaml are purely functional.

On the other hand, if you choose “functional programming language” to simply mean “supports functional programming” then suddenly you have to acknowledge various languages like Ruby as functional and for some reason this makes people uncomfortable. So instead we end up with all sorts of hand waving and mumbling and no one is able to have a useful discussion because everyone is too busy making assertions which can’t be argued with because they don’t mean anything.

So no one defines the term, but everyone seems to “know what it means”. And what it inevitably means is “shares features with this language which I use and like and consider to be a functional language”.

For a particular extreme example, consider the following conversation on reddit from the last time this subject came up:

vagif:
Well, here’s my understanding of this disagreement.
I think many people (me included) do not feel that Scala is functional enough. It is not because of the mutable variables. It all boils down to simple syntax. That’s why completely imperative and old fashioned language CL (i use sbcl) feels to me much more functional than Scala will ever be with its modern functional features like pattern matching, type inference and list comprehensions.
“Functional” for me means simple small core of orthogonal features.
“Functional” for me means – List processing.
“Functional” for me means no premature optimization (deciding to use arrays instead of lists because they are faster etc.)
Obviously, trying to accommodate java OO model, makes Scala way to complicated, arcane, baroque in its syntax, to feel functional enough.

DRMacIver:
I enjoy the fact that the word “function” does not appear in your definition of functional…

vagif:
And for a good reason. Functions are part of any imperative language. That does not make them more functional. It is all that infrastructure, that allows for easy juggling those functions, that makes a language functional. Passing them as parameters and return values, anonymous functions, closures, polymorphic functions (be it a result of type inference or dynamic nature of language). It is this small core of orthogonal features, all geared up for list processing, that truly creates a functional feeling in a programmer.

This sort of woolly thinking is endemic in these arguments. Please try to avoid it. If you’re not willing to define “functional programming language” in a way that is at least moderately unambiguous and doesn’t involve arguments about “feelings” or simply feature comparisons with some language which you state as an example of functional programming, don’t bother making arguments about whether a language is functional or not.

Of course, once you start defining the term people will start arguing about the definitions. This is pretty tedious, I know. But as tedious as arguing about definitions is, it can’t hold a candle to arguing without definitions.

This entry was posted in programming and tagged , , on by .

Command line tools for NLP and Machine Learning

I’m a huge fan of command line tools.

I may be 40 years late to the party, but over the last couple of months I’ve been increasingly finding that The Unix Way (described by a friend of mine as “‘loosely coupled’, at least, in the sense that all IPC took the form of text files with ad-hoc formats piped through shell scripts”) is a marvelous way to work.

NLP, Machine Learning and related tasks map very well onto this I find. They’re very often directly concerned with the manipulation of text and, where not, can usually be expressed in quite simple formats (of course, you’ll often need just a big binary blob for model files and the like, but that’s ok). So I’d like to see more tools available for such. Here are some I’m familiar with:

dbacl dbacl is a command line text classifier. It’s uses bigrams for features and, as far as I can tell (I’ve only skimmed the source) builds a maximum entropy model for classification. I’ve only played with it a little bit, but my impressions so far are that it’s easy to use, fast and produces high quality results
MCL MCL is a fast unsupervised clustering algorithm for weighted graphs. This is a command line tool produced by its originator. It appears to be a very solid tool, and the results are always interesting (the larger clusters it produces are often a bit strange, but there’s a lot of interesting info at the small to medium range). I’ve cheerfully fed several hundred MB of graph data through it and had it produce a good result (it took a few minutes to do so, but it worked)
Hunpos Hunpos is a part of speech tagger. We’ve used it successfully in production (though the latest versions of SONAR no longer use it having switched to OpenNLP’s version) and found it to be pretty fast and to produce decent results.
binary-pearsons My only contribution to the list so far. It reads a sequence of labelled events (one per line) and outputs the pearsons correlation between the labels as a measure of their similarity. I’ve not yet got it to a point where I want to release a version, but I’ve already found it very useful (we’re using it in SONAR to dramatically speed up calculations from our previous version, which is where it comes from)
SRILM The SRI Language Modelling toolkit seems to be primarily a library for language modelling, but exposes a lot of its functionality through a collection of command line tools. I’ve not used it, but it seems to offer a bunch of potentially quite useful functionality. (Thanks to Aaron Harnly for the recommendation)
OpenFST OpenFST is a C++ class library for creating and using finite state transducers which also exposes all its functionality as a collection of shell tools. (Thanks to cypherx on reddit for the mention)

That’s all I can think of at the moment, though I swear I’ve encountered a couple more which I’ve found useful in the past. What do you use?

This entry was posted in computational linguistics, programming on by .

Determining logical project structure from commit logs

In a bored 5 minutes at work I threw the following together: Logical source file groupings in the Scala repo

The largest cluster is clearly noisy and random. I more or less expected that. But the small and medium ones often make a lot of sense.

The basic technique is straightforward: We use a trivial script to scrape SVN logs to get a list of files that change in each commit. We use this to calculate the binary pearsons of these observations to get a measure of the similarity between two files (a number between -1 and 1, though we throw away anything <= 0). We then use markov clustering to cluster the results into distinct groupings.

The results are obviously far from perfect. But equally obviously there’s a lot of interesting information in them, and the technique could certainly be refined (e.g. by looking at sizes of diffs on each file and using that rather than a simple 0/1 changed. Also experimenting with other clustering algorithms, etc). Maybe something worth pursuing?

This entry was posted in programming and tagged , , on by .

A reminder: Planet Scala move

Just in case you’ve forgotten (and the number of hits I’m getting on the old location says you have), drmaciver.com/planetscala will cease to be a valid place to point your feed reader in just a few days. Please point it at planetscala.com.

This entry was posted in Admin, programming on by .

Open sourcing Pearson’s Correlation calculations

As you might recall, I did some articles on calculating Pearson’s in SQL.

It turns out that this is a hilariously bad idea. The performance you get for it is terrible when the numbers get large. Switching to PostgreSQL seemed to help a bit here, but even then the numbers are not great (and we still aren’t planning on a port to PostgreSQL anyway). So we needed to find a better solution. Doing it in memory would be fast, but it would just fall over on a large dataset.

Anyway, after some tinkering around I came up with a slightly unholy solution. It’s a mix of bash, awk, standard unix tools and Java (the Java parts may be rewritten in something else later). The design is such that much of the heavy lifting is offloaded to sort, which is offline so doesn’t need to load the whole dataset into memory, and processes things in a line oriented manner. This lets it get by with a very reasonable memory usage and, in my fairly informal tests, to perform about 50 times faster than the SQL version.

We’re releasing the code under a BSD license and making it available on github. It’s in a bit of a rough state at the moment, but is usable as is.

This entry was posted in Code, programming and tagged on by .