Archive for January, 2009

Criticizing programming languages

Saturday, January 31st, 2009

Here are snippets from two conversations I had recently:

20:03 < psnively> DRMacIver: I do think of you as taking a “Scala is one of the few languages worth criticizing” stance.

llimllib @DRMacIver So by my count, you’ve been in public arguments over the quality of Scala, Java, Ruby, Lua, Haskell, and OCaml? Missing any?

psnively’s comment is of course patently untrue. I criticize all sorts of languages. It just happens that Scala is one of those I use the most (at least in my free time), so I tend to criticize it more often than others. It’s true that as I rather like it I hold it to higher standards, but I feel no problem with criticizing other languages. :-)

llimllib’s comment is much more close to home. I get into arguments about languages all over the place. The problem is that:

  • All languages have problems with them
  • There are very few languages which don’t have something good about them as well

So the first point gets me into arguments with the people who love the language, the second gets me into arguments with people who hate it. I really can’t win. :-)

So, just to piss everyone off, I figured I’d write a post about the different programming languages I have opinions on. The structure of this will be simple: For every language I feel like I’ve used enough (or at least know enough about and have used some), I will write four items: Two saying what’s good about it, two saying what’s bad about it. These points are not intended to be the best or worst things about the language, or even neccessarily representative. They’re just something that has struck me as good or bad. Also, the format practically guarantees that I’m probably not saying anything anyone else hasn’t said a few dozen times already. :-)

Scala

Good:

  • Very interesting modularity features
  • Implicit arguments are great.

Bad:

  • The standard library is pretty embarrassing
  • It’s very easy for it to look like Java with weird syntax, particularly when using Java libraries.

Haskell

Good:

  • Extremely powerful and interesting type system
  • Purity enforced at the language level often makes it much easier to write correct code

Bad:

  • Really weak support for modularity
  • Aspects of the type system make it difficult to reuse nearly identical code across different contexts (consider sort vs. sortBy and map vs. mapM).

Ruby

Good:

  • Flexible syntax and semantics make it easy to write some very terse code
  • Makes general purpose scripting tasks very easy to hack together

Bad:

  • Namespacing issues abound, between open classes and a tendency to stomp all over the global scope.
  • The implementations.

Java

Good:

  • High quality implementation, with very good JIT and garbage collector.
  • Very rich ecosystem with lots of libraries.

Bad:

  • Bindings to native libraries tend to be rather low quality or non-existent (JNI is a pain)
  • Their closest Java equivalents typically are too

C

Good:

  • Gives you a lot of fine grained control over details that higher level languages hide
  • High performance native code compilers for just about every platform ever

Bad:

  • Almost no capability for abstracting over type.
  • Language level support for modularity basically doesn’t exist – namespacing by prefixing your function names, yay.

Ocaml

Good:

  • Provides a lot of shiny new features over standard ML – lazy evaluation, row polymorphism, polymorphic variants, etc
  • Very powerful module system (which I’ll admit to not entirely understanding)

Bad:

  • If you ignore the shiny new features, the core language is basically a worse Standard ML.
  • Although it has a justified reputation for writing high performance code, this seems to be only true if you write it as if it were C.

Lua

Good:

  • Borrows many more features from functional languages (tail calls, lexical scoping, etc) than most of its peers.
  • Very embeddable, rendering it very easy to use in the context of applications where the core is written in other languages.

Bad:

  • Weirdly deficient standard library (I know this is related to ease of embedding, as it seems to be the result of a desire to keep Lua small)
  • The language can be very verbose in places – consider the use of “local” to mean “Hey, I like not stomping all over the global scope”.

Python

This one is a bit weak, as I’ve only really used python in a few contexts, and haven’t really formed any strong negative opinions about it aside from a generic mild dislike. :-)

Good:

  • Has a lot of really cool libraries / projects built on top of it. (NLTK, NumPy, Sympy, pygame, etc).
  • The batteries included philosophy of the standard library is very appreciated.

Bad:

  • I find the “there is only one way to do it” attitude of community very dogmatic.
  • I’m not a big fan of the syntax. (I know, I know).

SQL

Good:

  • Forms a nice back end for a data processing app
  • Extremely good for constructing ad hoc queries against your data model

Bad:

  • Dear god can it be verbose.
  • Total lack of standardization a pain in the ass.

Javascript

Good:

  • Pretty decent support for functional programming
  • The prototype based OO is interesting and useful.

Bad:

  • Really weird scoping issues in places (particularly behaviour of globals and “this”).
  • The implementations are weak and inconsistent

And there we have it. :-) Some of these I could probably say a lot more good and bad about, some of them I struggled a bit on one side or the other and probably couldn’t, but those tend to be the languages I’ve used the least (in particular I’ve used Lua, Python and OCaml dramatically less than I have the others on the list).

Porting Pearsons to Postgres. Performance?

Thursday, January 29th, 2009

I’ve uploaded a version of the Pearson’s Coefficient code which runs on postgresql. You can download it here.I wrote this as an experiment to see if Postgres could help us with some of our MySQL performance woes.

Some brief experimentation suggests that once you fix PostgreSQL’s ridiculous default configuration the performance story is relatively happy. At small sizes MySQL is moderately faster, but as the sizes get large PostgreSQL seems to take the lead. I don’t have any sort of formal benchmark yet: This needs much more testing before I can definitively claim either is faster than the other, but for now the signs in favour of PostgreSQL are promising.

Cleaning up a set of tags, part 1

Tuesday, January 27th, 2009

In order to demonstrate some stuff I wanted to have a set of tagged data to play with. Delicious, flickr, that sort of thing. After some digging around on places like theinfo.org I found out that CiteULike (like delicious but targetted at academic papers) makes a dump of their data available. Unfortunately, it’s a bit messy. Not the data format itself, which is a simple pipe separated text file, but the quality of the tags themself. This is more or less to be expected for user reported tagging. It would be nice to have something a bit cleaner though.

I thought it might be illustrative to clean it up in public. It’s a completely hacky process, and not a particularly smart one, but it might be interesting or helpful to someone.

So, first things first. Get the data: http://static.citeulike.org/data/current.bz2. It’s zipped, so not too large, but will be about 300M unzipped.

It looks roughly like this:

42|61baaeba8de136d9c1aa9c18ec3860e8|2004-11-04 02:25:05.373798+00|ecoli
42|61baaeba8de136d9c1aa9c18ec3860e8|2004-11-04 02:25:05.373798+00|metabolism
42|61baaeba8de136d9c1aa9c18ec3860e8|2004-11-04 02:25:05.373798+00|barabasi
42|61baaeba8de136d9c1aa9c18ec3860e8|2004-11-04 02:25:05.373798+00|networks
43|61baaeba8de136d9c1aa9c18ec3860e8|2004-11-04 02:25:51.839281+00|control
43|61baaeba8de136d9c1aa9c18ec3860e8|2004-11-04 02:25:51.839281+00|engineering
43|61baaeba8de136d9c1aa9c18ec3860e8|2004-11-04 02:25:51.839281+00|robustness
44|61baaeba8de136d9c1aa9c18ec3860e8|2004-11-04 02:26:33.156319+00|networks
44|61baaeba8de136d9c1aa9c18ec3860e8|2004-11-04 02:26:33.156319+00|strogatz
44|61baaeba8de136d9c1aa9c18ec3860e8|2004-11-04 02:26:33.156319+00|survey
44|61baaeba8de136d9c1aa9c18ec3860e8|2004-11-04 02:26:33.156319+00|review
45|61baaeba8de136d9c1aa9c18ec3860e8|2004-11-04 02:27:38.983179+00|pleiotropy
45|61baaeba8de136d9c1aa9c18ec3860e8|2004-11-04 02:27:38.983179+00|barabasi
45|61baaeba8de136d9c1aa9c18ec3860e8|2004-11-04 02:27:38.983179+00|notsmall

A couple things to note:

As mentioned, it’s pipe separated. We have a document id, an anonymized user id, a date and a tag. There can be multiple tags for the same (document, user) pair.

Another thing to note is that sometimes it contains concatenatedwords. e.g. “notsmall” (a tag which seems to appear only once, for “Wrestling with pleiotropy: genomic and topological analysis of the yeast gene expression network” for some reason. I think it’s a mistag and was meant to go on “The metabolic world of Escherichia coli is not small”).

Let’s get a sense of what the tags for this look like. We’ll plot a distribution graph like so:

 cat Desktop/current | ruby -ne 'puts $_.split("|")[-1]' | sort | uniq -c | sort -g -r > citeulike_tags

What’s this doing? Not much. We’re catting the data to standard out, feeding it through ruby to split off the last column and sorting the results, giving us a big list of tags with repetitions. We then pipe this through uniq -c, which collates consecutive unique lines with a count prepended (that’s what the -c does). We then sort again in generalised numerical order, reversed, and save the output to a file.

Unix is fun.

Here’s what the results look like:

david@percy:~$ head citeulike_tags
 212611 bibtex-import
 156901 no-tag
  27926 elegans
  27886 celegans
  27825 c_elegans
  27795 nematode
  27738 wormbase
  27736 caenorhabditis_elegans
  18897 review
  15280 all-articles
david@percy:~$ tail citeulike_tags
      1 00301512
      1 0025
      1 00208
      1 001287275ep1114608epa00128727epb00128727
      1 0010521342
      1 0009811908
      1 0009390946
      1 000
      1 -------------
      1 ___

So there’s clearly a bunch of random noise there. The top two are unimportant – they’re some sort of autogenerated thing – and the bottom lot are garbage. So, we’ll throw away the top two and everything with only 1 occurrence.

david@percy:~$ vim citeulike_tags

Let’s look at the data again.

27926 elegans
27886 celegans
27825 c_elegans
27795 nematode
27738 wormbase
27736 caenorhabditis_elegans
18897 review
15280 all-articles
14597 evolution
13694 meeting_abstract
david@percy:~$ tail citeulike_tags
2 035
2 0325
2 0315
2 021
2 01012
2 004692
2 003
2 0022
2 —-
2 __
The top is looking better. I’m a bit skeptical of “all-articles”. Looking at a few examples it seems to be something generated along with bibtex-import. But we’ll leave that for now.

We’ve got a bunch of duplication there. “celegans”, “c_elegans”, “caenorhabditis_elegans”. Citeulike seems to have a nematode obsession. Not much we can do about that right now though.

The bottom half is the more serious issue. It consists entirely of numbers, which is rubbish. So let’s filter out anything that doesn’t contain some text:

david@percy:~$ mv citeulike_tags citeulike_tags_old
david@percy:~$ cat citeulike_tags_old | ruby -ne ‘puts $_ if !($_ =~ /^[^a-z]+$/)’ > citeulike_tags
david@percy:~$ tail citeulike_tags
2 0graphs
2 0ds-flicker
2 0-doek
2 0compas
2 0cath
2 05-wise-00-01
2 05matheco30_theoretic
2 05matheco25_theoretic
2 04-sose-00
2 041500040u1

Hm. Those are still pretty shitty. At this point I’m tempted to filter out everything which appears only twice. But after a quick trawl through with less I shall resist the temptation, as one finds things like this:

2 accessibility-technology-for-the-deaf
2 accessibilitystandards
2 accesory
2 acceptorphysiology
2 acceptormetabolismphysiology
2 acceptorgeneticsmetabolism
2 acceptability-judgments
2 accents
2 accent_learning
2 accentedness
2 accelerator-physics
2 acceleration-measurement
2 accelerationadverse
2 accelerating-admixture
2 accelerated-combinatorial-synthesis

Which it would be a shame to lose out on if we can avoid it.

At this point I noticed something annoying: There’s absolutely no consistency in how people space things in these tags. Even ignoring the people who concatenatewords, some people use _, some use -, some even have things like “academic-libraries—-collection-development”, which is just aggravating.

Let’s try to get some consistency out of this.

At this point I break out of the console and move to irb for some more interactive hacking. Unfortunately trying to record this turned out to be a pain due to IRB’s tendency to print the whole giant hash. So here’s it is as a ruby script. This looks for all tags which differ only in terms of the presence and type of _s and -s and conflates them all. In each case it chooses the most commonly occuring one, takes that as canonical and gives it the sum of all occurrences of equivalent tags. It then normalises the separator to an underscore.

I ran this as

david@percy:~$ cat citeulike_tags | ruby group_duplicates.rb > normalized_tags

Now, we’ve still got quite a few tags left:

david@percy:~$ wc -l normalized_tags
118785 normalized_tags

Let’s see if we can figure out some good ways of reducing this (or at least cutting out noise).

I dug around in it for a bit and noticed that there were a lot of tags of the form “file-import-something”. Not that many (291) but it’s a start. We’ll probably continue blacklisting things as we fine them.

Here’s an example of where we’ve got redundant tags: We’ve got genomic and genomics. statistic and statistics. population and populations. i.e. plurals. Let’s fix that.

For this step we’ll use a stemmer. Snowball has a decent binding for ruby, so we’ll use that. We’ll consider two tags to be equivalent if their parts stem to the same words.

We’re going to repurpose the spacing script above. Software reuse by cut and paste, yay. :-) I’ll need to do this all properly later, so I’ll clean it up for then, but now we’re just experimenting with data. So here’s a rewrite to identify things by stemming.

At this point we’re down to 106229 tags, from 118494 prior to stemming and 272919 originally. Not doing badly, given that most of what we’ve thrown away is junk or redundant information. If we threw out the tags which only appear twice we’d lose another 30352 tags (cat normalized_tags_2 | grep “    2 ” | wc -l). I was resisting doing that because some of them are quite good quality, but really I don’t think we have enough information to clean the remainder up.

We’re nearly at the point where we’ve run out of what we can do with frequency and word based information – there’s still plenty more we could do in principle, but we’ll hit the point of diminishing returns pretty rapidly from here on out. One thing I have noticed though is that there are a bunch of tags like “of” and “and”. At this point I shall put out the favourite hacky linguistic hammer: The stopword list.

The one I tend to use was compiled for the SMART system by Chris Buckley and Gerard Salton at Cornell University. You can download a copy here. All we’ll use this for is to filter out any tags which happen to be stopwords. Here’s an obvious script to use it to remove stopwords.

This loses us 46 tags. Doesn’t sound like many, but they were mostly fairly high frequency ones, so it’s a nice win.

At this point I declare this part of the work done. We’ve compiled a good list of tags and, although I’ve not actually written the code to do so, because each tag was arrived at by grouping a bunch of existing tags, it’s easy to see how you could figure out which documents are tagged with which of the new tags (if you can’t see, don’t worry, I’ll be tidying this up and using it to generate a tag list next time). In order to go further, we’ll need to actually look at sets of tagged documents. And, to be honest, I don’t know how much further it will take us. It may be that there’s not much more to add. But I’m hopeful.

If you want to have a play with the data, here’s the end result.

Yet another MySQL Fail

Monday, January 26th, 2009

mysql> create table stuff (name varchar(32));
Query OK, 0 rows affected (0.24 sec)

mysql> insert into stuff values (’foo’), (’1′), (’0′);
Query OK, 3 rows affected (0.00 sec)
Records: 3 Duplicates: 0 Warnings: 0

mysql> select * from stuff;
+——+
| name |
+——+
| foo |
| 1 |
| 0 |
+——+
3 rows in set (0.00 sec)

mysql> delete from stuff where name = 0;
Query OK, 2 rows affected (0.09 sec)

mysql> select * from stuff;
+——+
| name |
+——+
| 1 |
+——+
1 row in set (0.00 sec)

mysql> create table stuff (name varchar(32));
Query OK, 0 rows affected (0.24 sec)

mysql> insert into stuff values (’foo’), (’1′), (’0′);
Query OK, 3 rows affected (0.00 sec)
Records: 3 Duplicates: 0 Warnings: 0

mysql> select * from stuff;
+——+
| name |
+——+
| foo |
| 1 |
| 0 |
+——+
3 rows in set (0.00 sec)

mysql> delete from stuff where name = 0;
Query OK, 2 rows affected (0.09 sec)

mysql> select * from stuff;
+——+
| name |
+——+
| 1 |
+——+
1 row in set (0.00 sec)

mysql> WTF????
-> ;
ERROR 1064 (42000): You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near ‘WTF????’ at line 1

So, what’s going on here? I said to delete everything where the name was 0, but it deleted the row ‘foo’.

The following might help:

mysql> create table more_stuff(id int);
Query OK, 0 rows affected (0.19 sec)

mysql> insert into more_stuff values(’foo’);
Query OK, 1 row affected, 1 warning (0.00 sec)

mysql> select * from more_stuff;
+——+
| id |
+——+
| 0 |
+——+
1 row in set (0.00 sec)

When you try to use a string as an integer in MySQL, it takes non numeric strings and turns them into zero. So when you test name = 0, it converts name into an integer and turns that into 0. Consequently strings which can’t be parsed as an integer result in true for this test.

At this point I would rant about how mindbogglingly stupid this behaviour is, but I don’t think I can really be bothered.

Importing work posts

Monday, January 26th, 2009

I’m automatically importing a feed of my posts from our work blog into here now. I haven’t yet figured out a way to automatically add a comment saying they were originally from there yet, so until I do some of them might be a little confusing. e.g. one says “Hi. I don’t post here often”. :-)