Tag Archives: natural language processing

Computational linguistics and Me

Apparently I’m a computational linguistics blogger. This is sortof news to me. The closest I’ve come to blogging about computational linguistics is in writing a borderline rant about academia.

That being said, I do work in computational linguistics: SONAR is basically a great big NLP system.

This fact, however, is almost totally unrepresented in my blogging.

Actually, that’s part of why I’ve been blogging so much less recently. Since moving onto SONAR my brain has been afire with newly acquired knowledge and trying to figure out how best to apply it to work problems. This has left relatively little time for most of the other stuff I think about that normally generates blogging.

Of course the obvious solution is that I should be blogging about computational linguistics. But that has some obstacles. Primarily:


All the computational linguistics stuff I do is for work. I tinker around with it at home, but haven’t really done anything useful. This makes it difficult to know what I can blog about: I certainly can’t go “HEY GUYS. I FIGURED OUT THIS AWESOME ALGORITHM WHICH WE’RE USING IN SONAR” for everything. We rather rely on some of that magic to make us money. :-)

That being said, there’s definitely stuff I can blog about. e.g. there’s nothing particularly confidential in how we extract likely candidate phrases from a document, and it’s at least mildly interesting (probably more to non-linguists, but who knows?). In fact, we’re actually all encouraged to blog more about what we do but never find the time. So, really, work isn’t that much of an obstacle to blogging about this. It just requires a bit of careful thought.


I’m very new to computational lingusitics. As such, I’ve a much less clear idea what’s bloggable about in it. If we look at my blogging history, I started blogging about programming in february 2007. That’s just shy of a year after I started working as a programmer (which, effectively, is just shy of a year after I started programming anything in earnest). And I think it took another six months of blogging before I actually wrote anything worth reading. In comparison, I’ve not even worked in computational linguistics for 6 months (I think I started work on SONAR in september and had no exposure to it before that). So I’m very much still sortof fumbling along, trying to figure out the best way to do things.

From a work point of view that’s fine. Actually some of my best work is done when I don’t know what I’m doing: I’m more able to ask stupid questions and get useful answers and I come at things from a sufficiently different angle to normal that sometimes I produce unexpected results.

But from a blogging point of view it’s pretty likely that what I end up writing about will range from the trivial to the wrong, until I find my feet. Some of it might be of interest to non-linguists but too basic to be of interest to linguists. Some of it might be so esoteric that it would only be of interest to linguists, at least it would if they weren’t so easily able to point out why it’s wrong. Some of it might be of interest only to me.

But actually this is a really piss poor excuse to not blog about it. Because, frankly, I do not write to amuse you. Writing for other people is, to me, a waste of time. I write about what is of interest to me. With any luck other people will find it interesting too, but that isn’t the primary point.


In conclusion, my two main reasons for not blogging more about comptuational linguistics, natural language processing, etc. suck. So expect to see more about it here in the future. This probably means you’ll see more Ruby as well, as that’s what we use at work and I don’t expect I’ll bother translating into Scala except when I have a specific reason to do so.

Living on the edge of academia

I am not an academic.

In fact, as far as computer science goes I have never been an academic. My background is in mathematics.

Nevertheless, I try to read a lot of papers. I’m interested (albeit currently only in an armchair way) in programming language implementation and theory. I occasionally feel the need to implement a cool data structure for fun or profit. And, most importantly, I need to for my job.

I work on SONAR. As such a great deal of what I do is natural language processing and data mining. It’s a fairly heavily studied field in academia and it would be madness for me to not draw on that fact. Eventually I’d like to feed back into it too – we do a bunch of interesting things in SONAR and I’m reasonably confident that we can isolate enough of it into things we can share without giving the entire game away (we need to sell the product after all, so giving away the entire workings of it is non ideal).

I’ve got to say. It’s a pretty frustrating process.

Part of the problem is what I’ve come to refer to as the “Academic firewall of doom”. Everything lives behind paid subscriptions. Moreover, it lives behind exhorbitantly priced paid subscriptions. Even worse: None of the ludicrous quantities of money I’d be paying for these subscriptions if I choose to pay for them would go to the authors of the papers. The problems of academic publishing are well known. So assuming I don’t want to pay the cripplingly high fees for multiple journals, I have to either rely on the kindness of authors in publishing their papers on their own websites (a practice I’m not certain of the legal status of but am exceedingly grateful for), or if they have not done so pay close to $100 for a copy of the paper. Before I know if the paper actually contains the information I need.

Another unrelated problem which is currently being discussed on reddit in response to an older article by Bill Mill is that very few of the papers I depend on for ideas come with reference implementations of the algorithms they describe. I understand the reasons for this. But nevertheless, it’s very problematic for me. The thing is: Most of these papers don’t work nearly as universally as one might like to believe. They frequently contain algorithms which have been… let’s say, empirically derived. Take for example the punkt algorithm which I’m currently implementing (which has an implementation in NLTK, which at least gives me confidence that it works. I’m trying not to look at it too much though due to its GPLed nature). It starts out with a completely legitimate and theoretically justified statistic to do with colocations. That’s fine. Except it turns out that it doesn’t work that well. So they introduce a bunch of additional weightings to fuzz the numbers into something that does work and then through experimentation arrive at a test statistic which does a pretty good job (or so they claim).

That’s fine. It’s how I often work too. I’m certainly not judging them for that. But it does mean that I’m potentially in for a lot of work to reproduce the results of the paper (hopefully only a couple of days of work. But that’s a lot compared to firing up the reference implementation) for very uncertain benefits. And additionally it means that if it doesn’t work I don’t know whose fault it is – mine or the algorithm’s. So then if it doesn’t work I have to make a decision whether to spend time debugging or to cut my losses and run.

It should also be noted that negative results are, for whatever reason, the overwhelming majority. I’ve written enough similar things that actually work (at least some of which are based on papers) that I choose to believe that at least not all of these are due to my incompetence.

So, to recap, I’m in a situation where in order to find out if any given paper is useful to me I need to a) Spend near to $100 if I’m unlucky and b) In the event that the paper actually does contain what I want, take several days of work to implement its results and c) In the common case get negative results and even then not be certain about the validity of them so potentially spend more time debugging.

So the common case is designed in such a way as to cost me a significant amount of money and effort and is unlikely to yield positive results. You can understand why I might find this setup a bit problematic.

This entry was posted in programming, Science and tagged , on by .