Writing things right

OO has contributed many big and important innovations to programming. Among these, the foremost is that you write functions after rather than before their argument.

No, really.

It’s not just OO languages of course. Concatenative languages do the same thing. There’s a long history of mathematicians doing it as well (though we don’t like to talk about them. The cool mathematicians all write their functions on the left).

It’s funny how attached people get to this fact though.

Consider the following piece of Scala code:

object StringUtils{
  /** 
   * Trims whitespace from the end of s.
   */
  def rtrim(s : String) = ...
}

We can invoke this as StringUtils.rtrim(myString). Or if we import StringUtils, just rtrim(myString);

People get very upset if you ask them to do so though, and they go to all sorts of lengths to avoid it.
Consider the following three examples from different languages:

Scala:

object StringUtils{
   implicit def string2RTrim(s : String) = new { def rtrim = ...; }   
}

Ruby:

class String
  def rtrim
  ...
  end
end

C#:

class StringUtils{
   public static String rtrim(this String s) {
     ...
   }
}

What do these achieve over the previous version? Simple: You can write myString.rtrim instead of rtrim(myString). That’s it. (Actually the Ruby and Scala versions both *can* allow you to do different things than that. It’s just that here and in 90% of the use cases they aren’t used for anything else. The C# version literally doesn’t do anything else).

The thing is, while I’m making fun of this to a certain degree, it’s actually a perfectly reasonable thing to want to do. Designing things in noun-verb order is a good principle of UI design, and it works for programming as well. Things chain better – when you want to add new functions to a pipeline you add them at the point your cursor is naturally at and it matches well with thinking of it as a pipeline of “take this thing, do this to it, do that to it, do this other thing to it, get this value out”. Also you write far fewer brackets. :-) (compare Haskell’s foo . bar . baz $ thing idiom for a similar bracket avoidance tool).

Of these, I’d say that the Ruby solution is the most obvious (it just uses the fact that classes are open to add a new method to String), but it comes with the possibility of amusingly non-obvious runtime errors when someone else defines a conflicting method. The C# solution seems the best to me – it’s relatively little overhead over writing the utility method as you would otherwise and comes with the option to invoke it either as myString.rtrim or StringUtils.rtrim(myString), so when namespacing conflicts inevitably occur you have an easy fallback. But of course it uses a language feature specifically added to do this, while the other two are functions of more general language features. The Scala solution is, to my mind, decidedly the worst of the three.It’s syntactically noisy and comes with a significant additional runtime overhead.

But honestly I’m not particularly happy with any of these solutions. The Scala and Ruby solutions come with disproportionate costs to the benefit they give and the C# solution requires an additional language feature. Moreoever, each of these solutions requires effort at each definition site in order to make something available that you always want at the use site. Wouldn’t it be better if for every utility function you automatically had the option to write it on the right?

Let’s take a digression. What language is the following (rather pointless) code written in?

[1, 2, 3].sort.length

Ruby, right?

Actually, no. It’s Haskell.

Wait, what?

Well, it’s Haskell if you do something slightly evil and redefine the (.) operator (which normally means composition):

Prelude Data.List> let (.) x f = f x
Prelude Data.List> [1, 2, 3].sort.length
3

I saw this trick a while ago (the author was amusingly apologetic for it). It’s evil Haskell code because of the way it redefines an operator that normally means something else (this is totally typesafe of course – existing code will continue to use the old operator definition). But it’s a perfectly valid operator definition, and a rather nice one.

It works well with additional arguments to functions too:

Prelude Data.List> [1, 2, 3].sortBy(compare).length
3

The reason this works is that sortBy takes the list argument curried as its last argument, so sortBy(compare) gives something of type [Int] -> [Int] which we can then apply as above (Haskell’s precedence rules make this work).

So this is a nice trick, but how is it useful to you? Well, it’s probably not. I can’t think of any low noise way of making it work in any of the other languages mentioned so far (the best I can come up with is an evil evil hack in Ruby that would make god go on a kitten killing spree and a mildly nasty hack with operators and implicit conversions in Scala that’s much too noisy to really use), and using it in Haskell will make other Haskell programmers very unhappy with you. But it’s an interesting trick, and I’ll be sure to bear it in mind if I ever get around to creating DRMacIverLang.

This entry was posted in programming and tagged , , , , on by .

Planet Just Scala

After a little bit of hacking around with Yahoo Pipes I’ve created a filtered version of the Planet Scala feed. I couldn’t figure out a way to make Yahoo Pipes give me a pretty URL (which is stupid, as it gives a pretty web URL just fine), so here’s a decent-urled version of it. http://pipes.yahoo.decenturl.com/planet-just-scala
I’ll probably replace this with a quick custom script at some point when I can be bothered, but this works for now. :-)
This entry was posted in Admin and tagged on by .

Planet Scala: By Scala programmers, usually about Scala

Planet Scala’s selection of feeds has always been a bit haphazard – in some cases the whole feed, in some cases a Scala specific one.

As I mentioned I was thinking of doing at the end of last year, I’m experimenting with changing this. Except in the highest non-scala : scala volume ratios, I’m switching the feeds over to provide the full feed. My impression so far is that 90% of the additional stuff acquired this way should be of general interest to Scala programmers and that the volume is not that high.

You’ll probably notice an initial spike in non-Scala content as the backlog of non-Scala posts becomes available, but this should settle down fairly rapidly.

I’ll also look into an easy way of providing a Scala-filtered RSS feed on top of this for people who don’t want the extra content.

This entry was posted in Admin, programming and tagged , on by .

Living on the edge of academia

I am not an academic.

In fact, as far as computer science goes I have never been an academic. My background is in mathematics.

Nevertheless, I try to read a lot of papers. I’m interested (albeit currently only in an armchair way) in programming language implementation and theory. I occasionally feel the need to implement a cool data structure for fun or profit. And, most importantly, I need to for my job.

I work on SONAR. As such a great deal of what I do is natural language processing and data mining. It’s a fairly heavily studied field in academia and it would be madness for me to not draw on that fact. Eventually I’d like to feed back into it too – we do a bunch of interesting things in SONAR and I’m reasonably confident that we can isolate enough of it into things we can share without giving the entire game away (we need to sell the product after all, so giving away the entire workings of it is non ideal).

I’ve got to say. It’s a pretty frustrating process.

Part of the problem is what I’ve come to refer to as the “Academic firewall of doom”. Everything lives behind paid subscriptions. Moreover, it lives behind exhorbitantly priced paid subscriptions. Even worse: None of the ludicrous quantities of money I’d be paying for these subscriptions if I choose to pay for them would go to the authors of the papers. The problems of academic publishing are well known. So assuming I don’t want to pay the cripplingly high fees for multiple journals, I have to either rely on the kindness of authors in publishing their papers on their own websites (a practice I’m not certain of the legal status of but am exceedingly grateful for), or if they have not done so pay close to $100 for a copy of the paper. Before I know if the paper actually contains the information I need.

Another unrelated problem which is currently being discussed on reddit in response to an older article by Bill Mill is that very few of the papers I depend on for ideas come with reference implementations of the algorithms they describe. I understand the reasons for this. But nevertheless, it’s very problematic for me. The thing is: Most of these papers don’t work nearly as universally as one might like to believe. They frequently contain algorithms which have been… let’s say, empirically derived. Take for example the punkt algorithm which I’m currently implementing (which has an implementation in NLTK, which at least gives me confidence that it works. I’m trying not to look at it too much though due to its GPLed nature). It starts out with a completely legitimate and theoretically justified statistic to do with colocations. That’s fine. Except it turns out that it doesn’t work that well. So they introduce a bunch of additional weightings to fuzz the numbers into something that does work and then through experimentation arrive at a test statistic which does a pretty good job (or so they claim).

That’s fine. It’s how I often work too. I’m certainly not judging them for that. But it does mean that I’m potentially in for a lot of work to reproduce the results of the paper (hopefully only a couple of days of work. But that’s a lot compared to firing up the reference implementation) for very uncertain benefits. And additionally it means that if it doesn’t work I don’t know whose fault it is – mine or the algorithm’s. So then if it doesn’t work I have to make a decision whether to spend time debugging or to cut my losses and run.

It should also be noted that negative results are, for whatever reason, the overwhelming majority. I’ve written enough similar things that actually work (at least some of which are based on papers) that I choose to believe that at least not all of these are due to my incompetence.

So, to recap, I’m in a situation where in order to find out if any given paper is useful to me I need to a) Spend near to $100 if I’m unlucky and b) In the event that the paper actually does contain what I want, take several days of work to implement its results and c) In the common case get negative results and even then not be certain about the validity of them so potentially spend more time debugging.

So the common case is designed in such a way as to cost me a significant amount of money and effort and is unlikely to yield positive results. You can understand why I might find this setup a bit problematic.

This entry was posted in programming, Science and tagged , on by .

Pearsons in the database, part 2

Before I explain what this is about, the following tweets provide useful context for how I feel about this:

We’ve been discovering that the SQL query I wrote for calculating pearsons, while it works fine for small datasets (say a few hundred thousand ratings), once you get to around the million rating mark starts being unusably slow, even for a nightly job. Basically if you look at the query plan it ends up doing a filesort. This is not nice, as it involves an awful lot of data paging to and from disk. Having tried to optimise it directly and failed, I’ve spent the last two weeks writing nasty hacks to try to make it fast and was going to follow up to the blog post with my final solution (which involves a temporary table and a stored procedure. It’s pretty grim).

But today I was doing something else which involved a similar sort of query and got really pissed off that I was having exactly the same problems. I boiled it down to a very simple example which illustrated it and logged onto freenode’s #mysql to see if anyone could help me figure out what’s going on. Last time I tried this I was not successful, but one lives in hope.

Well, as it turns out, someone could. Here’s a snippet of conversation:

16:20 < DRMacIver> I'm finding this pattern comes up a lot in what I'm doing at the moment, and I simply can't figure out a sensible way to optimise it. Any suggestions? http://pastebin.com/m5be98ace
16:21 < DRMacIver> The self join on the same column followed by a group by seems to produce a filesort no matter what I do. As per example, basically all indices that could exist on the table do.
16:22 < jbalint> DRMacIver: i think you can order by null
16:22 * DRMacIver boggles
16:22 < DRMacIver> That works. Thank you.

So, by updating one line in the SQL I’ve posted previously the query becomes dramatically faster. I’ve also updated it to use an update join (which is MySQL specific) instead of the dependent subquery in the set (which is not, but which MySQL runs appallingly slowly) in order to get the standard deviation calculations to run in reasonable time.

Why is this happening? Well, http://dev.mysql.com/doc/refman/5.0/en/order-by-optimization.html contains the answer. It turns out that if you have a group by then it implicitly orders by that group. Even if the group clause has no index on it (which it can’t in this case because it’s a join across two tables), and ordering a large dataset by a non-indexed key will cause a filesort. If you add the clause “order by null” then it will not order by the group by clause, and the filesort goes away and the query becomes much faster.

We are not amused.

As another general tip, we’ve also found that this rapidly starts to generate unacceptably large amount of data, but that if you set some sensible lower bounds (such as only generating the statistics for things which cooccur more than once and have a pearsons > 0.1) it can easily be reduced to sensible levels.

This entry was posted in Uncategorized on by .