Category Archives: Code

Am I being boring?

I had an enlightening conversation with Mark Wotton on twitter recently. It started when I gave the following advice:

Writing advice: You should have a constant mental process going asking “Is this bit I’m writing boring?”. If it is, delete or rewrite it.

I’d meant it to apply to prose. I was reading an article which took an important and exciting piece of information and made it deathly dull. I don’t want to link to the article, but I’m sure you can think of examples yourself.

Mark replied:

I thought you were talking about code for a second there – it actually works pretty well there, too.

This wasn’t a context in which I’d thought about it before.

I won’t quote the entire conversation here (you can check it out at the source if you care), but the conclusions from it were interesting.

A lot of code is boring, and sometimes this is ok. Most code does boring things – display a web page, convert data from this format to that format, write out an error message, etc. Boring tasks are ubiquitous and necessary to get things done.

But, like writing, you can write about things which are boring and you can write about things in boring ways.

What does boring writing look like? Well, it could contain a lot of repetition, it could take forever to get to the point making it non-obvious what it’s about, it could include a lot of irrelevancies…

Hmm. None of those sound like very good things to do when coding, do they?

As people are fond of saying, code should really be first about telling other people what it should do and secondarily about getting the computer to execute it. Coding is a form of writing, and as such it needs to keep the reader interested or their boredom will get in the way of their understanding (side note: this is not the same as making the reader work hard to understand it – not being boring is not the same as being overly clever).

So, it’s ok to write code that is boring because what it is trying to do is inherently boring, but you shouldn’t add unnecessary boredom to the code.

Hmm. That sounds familiar.

So, to borrow the terminology from Fred Brooks: There are two types of boredom. Essential boredom, which is inherent to the problem being solved, and accidental boredom, which is introduced by the programmer. Seek to minimize the latter.

This entry was posted in Code, Writing on by .

Open source term extraction

This is just a quick announcement to let people know that we’ve open sourced our JRuby library for term extraction. You can get the code from my github page.

Unlike a lot of term extraction libraries, this doesn’t take any stance as to the “significance” of the terms it extracts. It’s purely about looking at the syntax and determining where good boundaries for terms are. There are a couple reasons for this, but basically we’ve found that it’s more effective to separate the two steps and makes it easier to tinker around with them independently. The criteria for “interestingness” of terms seem to be largely distinct from those for terms which simply make sense linguistically. So we have a two stage pipeline, one which extracts semantically meaningful terms and one which determines what terms are actually interesting in the context of the document. The second step is much more complicated, and we’re not open sourcing that (yet? probably not any time soon, if ever. Even if we wanted to, it relies on a lot more global information across the document corpus and so is very tied in with how SONAR operates, making it much harder to isolate).

So, how does it work? Black magic and voodoo!

Actually, no. It’s pretty straightforward. It builds on top of the excellent OpenNLP library, using its tools for part of speech tagging, sentence splitting (a much harder problem than you’d imagine) and phrase chunking. It’s currently a rules based system on top of there, as while you’re figuring things out it makes much more sense to stick with something so easily fine tunable. Our expectation is that we’ll gradually start replacing bits of it with machine learning based techniques as we start to hit the limitations of a rules based system, but for now it’s working pretty well.

Let’s have an example. If we feed the second paragraph of this post into the term extractor, we get the following terms back:

term extraction libraries
stance
terms
syntax
good boundaries
couple reasons
two steps
steps
criteria
interestingness
sense
two stage pipeline
stage pipeline
semantically meaningful terms
context
context of the document
document
second step
open sourcing
time
document corpus
SONAR

Hope you find this useful. Let us know if you build anything cool with it!

This entry was posted in Code, programming and tagged on by .

Open sourcing Pearson’s Correlation calculations

As you might recall, I did some articles on calculating Pearson’s in SQL.

It turns out that this is a hilariously bad idea. The performance you get for it is terrible when the numbers get large. Switching to PostgreSQL seemed to help a bit here, but even then the numbers are not great (and we still aren’t planning on a port to PostgreSQL anyway). So we needed to find a better solution. Doing it in memory would be fast, but it would just fall over on a large dataset.

Anyway, after some tinkering around I came up with a slightly unholy solution. It’s a mix of bash, awk, standard unix tools and Java (the Java parts may be rewritten in something else later). The design is such that much of the heavy lifting is offloaded to sort, which is offline so doesn’t need to load the whole dataset into memory, and processes things in a line oriented manner. This lets it get by with a very reasonable memory usage and, in my fairly informal tests, to perform about 50 times faster than the SQL version.

We’re releasing the code under a BSD license and making it available on github. It’s in a bit of a rough state at the moment, but is usable as is.

This entry was posted in Code, programming and tagged on by .