Category Archives: programming

Porting Pearsons to Postgres. Performance?

I’ve uploaded a version of the Pearson’s Coefficient code which runs on postgresql. You can download it here.I wrote this as an experiment to see if Postgres could help us with some of our MySQL performance woes.

Some brief experimentation suggests that once you fix PostgreSQL’s ridiculous default configuration the performance story is relatively happy. At small sizes MySQL is moderately faster, but as the sizes get large PostgreSQL seems to take the lead. I don’t have any sort of formal benchmark yet: This needs much more testing before I can definitively claim either is faster than the other, but for now the signs in favour of PostgreSQL are promising.

This entry was posted in programming and tagged on by .

Cleaning up a set of tags, part 1

In order to demonstrate some stuff I wanted to have a set of tagged data to play with. Delicious, flickr, that sort of thing. After some digging around on places like theinfo.org I found out that CiteULike (like delicious but targetted at academic papers) makes a dump of their data available. Unfortunately, it’s a bit messy. Not the data format itself, which is a simple pipe separated text file, but the quality of the tags themself. This is more or less to be expected for user reported tagging. It would be nice to have something a bit cleaner though.

I thought it might be illustrative to clean it up in public. It’s a completely hacky process, and not a particularly smart one, but it might be interesting or helpful to someone.

So, first things first. Get the data: http://static.citeulike.org/data/current.bz2. It’s zipped, so not too large, but will be about 300M unzipped.

It looks roughly like this:

42|61baaeba8de136d9c1aa9c18ec3860e8|2004-11-04 02:25:05.373798+00|ecoli
42|61baaeba8de136d9c1aa9c18ec3860e8|2004-11-04 02:25:05.373798+00|metabolism
42|61baaeba8de136d9c1aa9c18ec3860e8|2004-11-04 02:25:05.373798+00|barabasi
42|61baaeba8de136d9c1aa9c18ec3860e8|2004-11-04 02:25:05.373798+00|networks
43|61baaeba8de136d9c1aa9c18ec3860e8|2004-11-04 02:25:51.839281+00|control
43|61baaeba8de136d9c1aa9c18ec3860e8|2004-11-04 02:25:51.839281+00|engineering
43|61baaeba8de136d9c1aa9c18ec3860e8|2004-11-04 02:25:51.839281+00|robustness
44|61baaeba8de136d9c1aa9c18ec3860e8|2004-11-04 02:26:33.156319+00|networks
44|61baaeba8de136d9c1aa9c18ec3860e8|2004-11-04 02:26:33.156319+00|strogatz
44|61baaeba8de136d9c1aa9c18ec3860e8|2004-11-04 02:26:33.156319+00|survey
44|61baaeba8de136d9c1aa9c18ec3860e8|2004-11-04 02:26:33.156319+00|review
45|61baaeba8de136d9c1aa9c18ec3860e8|2004-11-04 02:27:38.983179+00|pleiotropy
45|61baaeba8de136d9c1aa9c18ec3860e8|2004-11-04 02:27:38.983179+00|barabasi
45|61baaeba8de136d9c1aa9c18ec3860e8|2004-11-04 02:27:38.983179+00|notsmall

A couple things to note:

As mentioned, it’s pipe separated. We have a document id, an anonymized user id, a date and a tag. There can be multiple tags for the same (document, user) pair.

Another thing to note is that sometimes it contains concatenatedwords. e.g. “notsmall” (a tag which seems to appear only once, for “Wrestling with pleiotropy: genomic and topological analysis of the yeast gene expression network” for some reason. I think it’s a mistag and was meant to go on “The metabolic world of Escherichia coli is not small”).

Let’s get a sense of what the tags for this look like. We’ll plot a distribution graph like so:

 cat Desktop/current | ruby -ne 'puts $_.split("|")[-1]' | sort | uniq -c | sort -g -r > citeulike_tags

What’s this doing? Not much. We’re catting the data to standard out, feeding it through ruby to split off the last column and sorting the results, giving us a big list of tags with repetitions. We then pipe this through uniq -c, which collates consecutive unique lines with a count prepended (that’s what the -c does). We then sort again in generalised numerical order, reversed, and save the output to a file.

Unix is fun.

Here’s what the results look like:

[email protected]:~$ head citeulike_tags
 212611 bibtex-import
 156901 no-tag
  27926 elegans
  27886 celegans
  27825 c_elegans
  27795 nematode
  27738 wormbase
  27736 caenorhabditis_elegans
  18897 review
  15280 all-articles
[email protected]:~$ tail citeulike_tags
      1 00301512
      1 0025
      1 00208
      1 001287275ep1114608epa00128727epb00128727
      1 0010521342
      1 0009811908
      1 0009390946
      1 000
      1 -------------
      1 ___

So there’s clearly a bunch of random noise there. The top two are unimportant – they’re some sort of autogenerated thing – and the bottom lot are garbage. So, we’ll throw away the top two and everything with only 1 occurrence.

[email protected]:~$ vim citeulike_tags

Let’s look at the data again.

27926 elegans
27886 celegans
27825 c_elegans
27795 nematode
27738 wormbase
27736 caenorhabditis_elegans
18897 review
15280 all-articles
14597 evolution
13694 meeting_abstract
[email protected]:~$ tail citeulike_tags
2 035
2 0325
2 0315
2 021
2 01012
2 004692
2 003
2 0022
2 —-
2 __
The top is looking better. I’m a bit skeptical of “all-articles”. Looking at a few examples it seems to be something generated along with bibtex-import. But we’ll leave that for now.

We’ve got a bunch of duplication there. “celegans”, “c_elegans”, “caenorhabditis_elegans”. Citeulike seems to have a nematode obsession. Not much we can do about that right now though.

The bottom half is the more serious issue. It consists entirely of numbers, which is rubbish. So let’s filter out anything that doesn’t contain some text:

[email protected]:~$ mv citeulike_tags citeulike_tags_old
[email protected]:~$ cat citeulike_tags_old | ruby -ne ‘puts $_ if !($_ =~ /^[^a-z]+$/)’ > citeulike_tags
[email protected]:~$ tail citeulike_tags
2 0graphs
2 0ds-flicker
2 0-doek
2 0compas
2 0cath
2 05-wise-00-01
2 05matheco30_theoretic
2 05matheco25_theoretic
2 04-sose-00
2 041500040u1

Hm. Those are still pretty shitty. At this point I’m tempted to filter out everything which appears only twice. But after a quick trawl through with less I shall resist the temptation, as one finds things like this:

2 accessibility-technology-for-the-deaf
2 accessibilitystandards
2 accesory
2 acceptorphysiology
2 acceptormetabolismphysiology
2 acceptorgeneticsmetabolism
2 acceptability-judgments
2 accents
2 accent_learning
2 accentedness
2 accelerator-physics
2 acceleration-measurement
2 accelerationadverse
2 accelerating-admixture
2 accelerated-combinatorial-synthesis

Which it would be a shame to lose out on if we can avoid it.

At this point I noticed something annoying: There’s absolutely no consistency in how people space things in these tags. Even ignoring the people who concatenatewords, some people use _, some use -, some even have things like “academic-libraries—-collection-development”, which is just aggravating.

Let’s try to get some consistency out of this.

At this point I break out of the console and move to irb for some more interactive hacking. Unfortunately trying to record this turned out to be a pain due to IRB’s tendency to print the whole giant hash. So here’s it is as a ruby script. This looks for all tags which differ only in terms of the presence and type of _s and -s and conflates them all. In each case it chooses the most commonly occuring one, takes that as canonical and gives it the sum of all occurrences of equivalent tags. It then normalises the separator to an underscore.

I ran this as

[email protected]:~$ cat citeulike_tags | ruby group_duplicates.rb > normalized_tags

Now, we’ve still got quite a few tags left:

[email protected]:~$ wc -l normalized_tags
118785 normalized_tags

Let’s see if we can figure out some good ways of reducing this (or at least cutting out noise).

I dug around in it for a bit and noticed that there were a lot of tags of the form “file-import-something”. Not that many (291) but it’s a start. We’ll probably continue blacklisting things as we fine them.

Here’s an example of where we’ve got redundant tags: We’ve got genomic and genomics. statistic and statistics. population and populations. i.e. plurals. Let’s fix that.

For this step we’ll use a stemmer. Snowball has a decent binding for ruby, so we’ll use that. We’ll consider two tags to be equivalent if their parts stem to the same words.

We’re going to repurpose the spacing script above. Software reuse by cut and paste, yay. :-) I’ll need to do this all properly later, so I’ll clean it up for then, but now we’re just experimenting with data. So here’s a rewrite to identify things by stemming.

At this point we’re down to 106229 tags, from 118494 prior to stemming and 272919 originally. Not doing badly, given that most of what we’ve thrown away is junk or redundant information. If we threw out the tags which only appear twice we’d lose another 30352 tags (cat normalized_tags_2 | grep ”    2 ” | wc -l). I was resisting doing that because some of them are quite good quality, but really I don’t think we have enough information to clean the remainder up.

We’re nearly at the point where we’ve run out of what we can do with frequency and word based information – there’s still plenty more we could do in principle, but we’ll hit the point of diminishing returns pretty rapidly from here on out. One thing I have noticed though is that there are a bunch of tags like “of” and “and”. At this point I shall put out the favourite hacky linguistic hammer: The stopword list.

The one I tend to use was compiled for the SMART system by Chris Buckley and Gerard Salton at Cornell University. You can download a copy here. All we’ll use this for is to filter out any tags which happen to be stopwords. Here’s an obvious script to use it to remove stopwords.

This loses us 46 tags. Doesn’t sound like many, but they were mostly fairly high frequency ones, so it’s a nice win.

At this point I declare this part of the work done. We’ve compiled a good list of tags and, although I’ve not actually written the code to do so, because each tag was arrived at by grouping a bunch of existing tags, it’s easy to see how you could figure out which documents are tagged with which of the new tags (if you can’t see, don’t worry, I’ll be tidying this up and using it to generate a tag list next time). In order to go further, we’ll need to actually look at sets of tagged documents. And, to be honest, I don’t know how much further it will take us. It may be that there’s not much more to add. But I’m hopeful.

If you want to have a play with the data, here’s the end result.

This entry was posted in computational linguistics, programming on by .

Yet another MySQL Fail

mysql> create table stuff (name varchar(32));
Query OK, 0 rows affected (0.24 sec)

mysql> insert into stuff values (’foo’), (’1′), (’0′);
Query OK, 3 rows affected (0.00 sec)
Records: 3 Duplicates: 0 Warnings: 0

mysql> select * from stuff;
+——+
| name |
+——+
| foo |
| 1 |
| 0 |
+——+
3 rows in set (0.00 sec)

mysql> delete from stuff where name = 0;
Query OK, 2 rows affected (0.09 sec)

mysql> select * from stuff;
+——+
| name |
+——+
| 1 |
+——+
1 row in set (0.00 sec)

mysql> create table stuff (name varchar(32));
Query OK, 0 rows affected (0.24 sec)

mysql> insert into stuff values (’foo’), (’1′), (’0′);
Query OK, 3 rows affected (0.00 sec)
Records: 3 Duplicates: 0 Warnings: 0

mysql> select * from stuff;
+——+
| name |
+——+
| foo |
| 1 |
| 0 |
+——+
3 rows in set (0.00 sec)

mysql> delete from stuff where name = 0;
Query OK, 2 rows affected (0.09 sec)

mysql> select * from stuff;
+——+
| name |
+——+
| 1 |
+——+
1 row in set (0.00 sec)

mysql> WTF????
-> ;
ERROR 1064 (42000): You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near ‘WTF????’ at line 1

So, what’s going on here? I said to delete everything where the name was 0, but it deleted the row ‘foo’.

The following might help:

mysql> create table more_stuff(id int);
Query OK, 0 rows affected (0.19 sec)

mysql> insert into more_stuff values(’foo’);
Query OK, 1 row affected, 1 warning (0.00 sec)

mysql> select * from more_stuff;
+——+
| id |
+——+
| 0 |
+——+
1 row in set (0.00 sec)

When you try to use a string as an integer in MySQL, it takes non numeric strings and turns them into zero. So when you test name = 0, it converts name into an integer and turns that into 0. Consequently strings which can’t be parsed as an integer result in true for this test.

At this point I would rant about how mindbogglingly stupid this behaviour is, but I don’t think I can really be bothered.

This entry was posted in programming, SQL and tagged on by .

Computational linguistics and Me

Apparently I’m a computational linguistics blogger. This is sortof news to me. The closest I’ve come to blogging about computational linguistics is in writing a borderline rant about academia.

That being said, I do work in computational linguistics: SONAR is basically a great big NLP system.

This fact, however, is almost totally unrepresented in my blogging.

Actually, that’s part of why I’ve been blogging so much less recently. Since moving onto SONAR my brain has been afire with newly acquired knowledge and trying to figure out how best to apply it to work problems. This has left relatively little time for most of the other stuff I think about that normally generates blogging.

Of course the obvious solution is that I should be blogging about computational linguistics. But that has some obstacles. Primarily:

Confidentiality

All the computational linguistics stuff I do is for work. I tinker around with it at home, but haven’t really done anything useful. This makes it difficult to know what I can blog about: I certainly can’t go “HEY GUYS. I FIGURED OUT THIS AWESOME ALGORITHM WHICH WE’RE USING IN SONAR” for everything. We rather rely on some of that magic to make us money. :-)

That being said, there’s definitely stuff I can blog about. e.g. there’s nothing particularly confidential in how we extract likely candidate phrases from a document, and it’s at least mildly interesting (probably more to non-linguists, but who knows?). In fact, we’re actually all encouraged to blog more about what we do but never find the time. So, really, work isn’t that much of an obstacle to blogging about this. It just requires a bit of careful thought.

Experience

I’m very new to computational lingusitics. As such, I’ve a much less clear idea what’s bloggable about in it. If we look at my blogging history, I started blogging about programming in february 2007. That’s just shy of a year after I started working as a programmer (which, effectively, is just shy of a year after I started programming anything in earnest). And I think it took another six months of blogging before I actually wrote anything worth reading. In comparison, I’ve not even worked in computational linguistics for 6 months (I think I started work on SONAR in september and had no exposure to it before that). So I’m very much still sortof fumbling along, trying to figure out the best way to do things.

From a work point of view that’s fine. Actually some of my best work is done when I don’t know what I’m doing: I’m more able to ask stupid questions and get useful answers and I come at things from a sufficiently different angle to normal that sometimes I produce unexpected results.

But from a blogging point of view it’s pretty likely that what I end up writing about will range from the trivial to the wrong, until I find my feet. Some of it might be of interest to non-linguists but too basic to be of interest to linguists. Some of it might be so esoteric that it would only be of interest to linguists, at least it would if they weren’t so easily able to point out why it’s wrong. Some of it might be of interest only to me.

But actually this is a really piss poor excuse to not blog about it. Because, frankly, I do not write to amuse you. Writing for other people is, to me, a waste of time. I write about what is of interest to me. With any luck other people will find it interesting too, but that isn’t the primary point.

So…

In conclusion, my two main reasons for not blogging more about comptuational linguistics, natural language processing, etc. suck. So expect to see more about it here in the future. This probably means you’ll see more Ruby as well, as that’s what we use at work and I don’t expect I’ll bother translating into Scala except when I have a specific reason to do so.

Writing things right

OO has contributed many big and important innovations to programming. Among these, the foremost is that you write functions after rather than before their argument.

No, really.

It’s not just OO languages of course. Concatenative languages do the same thing. There’s a long history of mathematicians doing it as well (though we don’t like to talk about them. The cool mathematicians all write their functions on the left).

It’s funny how attached people get to this fact though.

Consider the following piece of Scala code:

object StringUtils{
  /** 
   * Trims whitespace from the end of s.
   */
  def rtrim(s : String) = ...
}

We can invoke this as StringUtils.rtrim(myString). Or if we import StringUtils, just rtrim(myString);

People get very upset if you ask them to do so though, and they go to all sorts of lengths to avoid it.
Consider the following three examples from different languages:

Scala:

object StringUtils{
   implicit def string2RTrim(s : String) = new { def rtrim = ...; }   
}

Ruby:

class String
  def rtrim
  ...
  end
end

C#:

class StringUtils{
   public static String rtrim(this String s) {
     ...
   }
}

What do these achieve over the previous version? Simple: You can write myString.rtrim instead of rtrim(myString). That’s it. (Actually the Ruby and Scala versions both *can* allow you to do different things than that. It’s just that here and in 90% of the use cases they aren’t used for anything else. The C# version literally doesn’t do anything else).

The thing is, while I’m making fun of this to a certain degree, it’s actually a perfectly reasonable thing to want to do. Designing things in noun-verb order is a good principle of UI design, and it works for programming as well. Things chain better – when you want to add new functions to a pipeline you add them at the point your cursor is naturally at and it matches well with thinking of it as a pipeline of “take this thing, do this to it, do that to it, do this other thing to it, get this value out”. Also you write far fewer brackets. :-) (compare Haskell’s foo . bar . baz $ thing idiom for a similar bracket avoidance tool).

Of these, I’d say that the Ruby solution is the most obvious (it just uses the fact that classes are open to add a new method to String), but it comes with the possibility of amusingly non-obvious runtime errors when someone else defines a conflicting method. The C# solution seems the best to me – it’s relatively little overhead over writing the utility method as you would otherwise and comes with the option to invoke it either as myString.rtrim or StringUtils.rtrim(myString), so when namespacing conflicts inevitably occur you have an easy fallback. But of course it uses a language feature specifically added to do this, while the other two are functions of more general language features. The Scala solution is, to my mind, decidedly the worst of the three.It’s syntactically noisy and comes with a significant additional runtime overhead.

But honestly I’m not particularly happy with any of these solutions. The Scala and Ruby solutions come with disproportionate costs to the benefit they give and the C# solution requires an additional language feature. Moreoever, each of these solutions requires effort at each definition site in order to make something available that you always want at the use site. Wouldn’t it be better if for every utility function you automatically had the option to write it on the right?

Let’s take a digression. What language is the following (rather pointless) code written in?

[1, 2, 3].sort.length

Ruby, right?

Actually, no. It’s Haskell.

Wait, what?

Well, it’s Haskell if you do something slightly evil and redefine the (.) operator (which normally means composition):

Prelude Data.List> let (.) x f = f x
Prelude Data.List> [1, 2, 3].sort.length
3

I saw this trick a while ago (the author was amusingly apologetic for it). It’s evil Haskell code because of the way it redefines an operator that normally means something else (this is totally typesafe of course – existing code will continue to use the old operator definition). But it’s a perfectly valid operator definition, and a rather nice one.

It works well with additional arguments to functions too:

Prelude Data.List> [1, 2, 3].sortBy(compare).length
3

The reason this works is that sortBy takes the list argument curried as its last argument, so sortBy(compare) gives something of type [Int] -> [Int] which we can then apply as above (Haskell’s precedence rules make this work).

So this is a nice trick, but how is it useful to you? Well, it’s probably not. I can’t think of any low noise way of making it work in any of the other languages mentioned so far (the best I can come up with is an evil evil hack in Ruby that would make god go on a kitten killing spree and a mildly nasty hack with operators and implicit conversions in Scala that’s much too noisy to really use), and using it in Haskell will make other Haskell programmers very unhappy with you. But it’s an interesting trick, and I’ll be sure to bear it in mind if I ever get around to creating DRMacIverLang.

This entry was posted in programming and tagged , , , , on by .