Living on the edge of academia

I am not an academic.

In fact, as far as computer science goes I have never been an academic. My background is in mathematics.

Nevertheless, I try to read a lot of papers. I’m interested (albeit currently only in an armchair way) in programming language implementation and theory. I occasionally feel the need to implement a cool data structure for fun or profit. And, most importantly, I need to for my job.

I work on SONAR. As such a great deal of what I do is natural language processing and data mining. It’s a fairly heavily studied field in academia and it would be madness for me to not draw on that fact. Eventually I’d like to feed back into it too – we do a bunch of interesting things in SONAR and I’m reasonably confident that we can isolate enough of it into things we can share without giving the entire game away (we need to sell the product after all, so giving away the entire workings of it is non ideal).

I’ve got to say. It’s a pretty frustrating process.

Part of the problem is what I’ve come to refer to as the “Academic firewall of doom”. Everything lives behind paid subscriptions. Moreover, it lives behind exhorbitantly priced paid subscriptions. Even worse: None of the ludicrous quantities of money I’d be paying for these subscriptions if I choose to pay for them would go to the authors of the papers. The problems of academic publishing are well known. So assuming I don’t want to pay the cripplingly high fees for multiple journals, I have to either rely on the kindness of authors in publishing their papers on their own websites (a practice I’m not certain of the legal status of but am exceedingly grateful for), or if they have not done so pay close to $100 for a copy of the paper. Before I know if the paper actually contains the information I need.

Another unrelated problem which is currently being discussed on reddit in response to an older article by Bill Mill is that very few of the papers I depend on for ideas come with reference implementations of the algorithms they describe. I understand the reasons for this. But nevertheless, it’s very problematic for me. The thing is: Most of these papers don’t work nearly as universally as one might like to believe. They frequently contain algorithms which have been… let’s say, empirically derived. Take for example the punkt algorithm which I’m currently implementing (which has an implementation in NLTK, which at least gives me confidence that it works. I’m trying not to look at it too much though due to its GPLed nature). It starts out with a completely legitimate and theoretically justified statistic to do with colocations. That’s fine. Except it turns out that it doesn’t work that well. So they introduce a bunch of additional weightings to fuzz the numbers into something that does work and then through experimentation arrive at a test statistic which does a pretty good job (or so they claim).

That’s fine. It’s how I often work too. I’m certainly not judging them for that. But it does mean that I’m potentially in for a lot of work to reproduce the results of the paper (hopefully only a couple of days of work. But that’s a lot compared to firing up the reference implementation) for very uncertain benefits. And additionally it means that if it doesn’t work I don’t know whose fault it is – mine or the algorithm’s. So then if it doesn’t work I have to make a decision whether to spend time debugging or to cut my losses and run.

It should also be noted that negative results are, for whatever reason, the overwhelming majority. I’ve written enough similar things that actually work (at least some of which are based on papers) that I choose to believe that at least not all of these are due to my incompetence.

So, to recap, I’m in a situation where in order to find out if any given paper is useful to me I need to a) Spend near to $100 if I’m unlucky and b) In the event that the paper actually does contain what I want, take several days of work to implement its results and c) In the common case get negative results and even then not be certain about the validity of them so potentially spend more time debugging.

So the common case is designed in such a way as to cost me a significant amount of money and effort and is unlikely to yield positive results. You can understand why I might find this setup a bit problematic.

This entry was posted in programming, Science and tagged , on by .

13 thoughts on “Living on the edge of academia

  1. BCox

    I see that your company, Trampoline, is located in London. I’m not familiar with the policies of UK university libraries but I know that most public and many private university libraries here in the states allow visitor access to the public. If it’s the same there you should ask your boss for a day outside the office to photocopy as many papers as you can. I have many friends who have spent a few days outside of the office reading journals at a university 20 miles away.

    As for the implementation issues:
    I work in a different field but I’ve found the overstatement of an algorithms effectiveness to be a universial atribute of most scholarly works. It’s incredibly dissapointing to spend a few days implementing a siliver bullet only to find that it’s a lump of coal when applied to real world data. This is just the way it is when the work you’re doing is pushing the boundaries of common knowledge.

  2. david Post author

    Yeah, I overstated the point on access slightly: If it ever comes down to a point where I simply *have* to have access to one particular paper I can just go to the british library or something and photocopy it instead of paying the $100. But having a not that easy to access physical copy is still massively inconvenient compared to an electronic copy. So mostly I just take the path of least resistance and don’t bother with the papers I don’t have immediate electronic access to.

    On implementation: I understand that overstatement of effectiveness is a general problem. It would just be nice to find that out up front rather than after I’ve put in all the work. :-)

    I try to be charitable about it – in some cases it’s probably just that it only works in a certain subset of cases. I’m sure many of the things we’re doing in SONAR are similarly ineffective when applied to things outside our domain (e.g. our techniques are optimised for lots of short documents and probably don’t do nearly so well on fewer long ones). And sometimes I just can’t be charitable. A lot of algorithms published are actually complete nonsense and don’t do what they claim to do but get away with it because of massaging of data and other pre/postprocessing.

  3. Jason Adams

    This is something I have encountered and been bothered by quite a bit. It’s one of the reasons I advocate using git for research. Ted Pedersen wrote about this very issue as well in a recent issue of Computational Linguistics, so it is encouraging to know that it is at least being acknowledged more widely. One of his points is that even though creating distributable software requires more effort and time, it pays off by having people use your software, increasing your reputation.

    It’s not always enough to release the code, though. You can find plenty of bits of software out there. Sometimes it’s in a state so bad, it’s almost worse having the code than just reimplementing it yourself (at least, that’s what you tell yourself).

  4. diN0bot

    spot on! most (computer science) research intends to be open content and open source, but in practice fails.

    the culture is changing, so i’m not too worried. keep spreading the word and encouraging academics to use GitHub or whatever makes sharing knowledge most convenient and effective.

  5. suman

    You can become an ACM or IEEE member to access the papers. The membership fee is cheap compared to the price for a single paper. Also, if you have access to a university library, you can download the papers for free there since the university would already have access.

  6. Xianhang Zhang

    Hi David,

    I’ve generally had very good luck just emailing the authors. I don’t think I’ve ever been rejected with a polite request for a paper and I’ve certainly always been willing to send my paper to anyone who asks me.

    As for not publishing a reference implementation, I’m willing to bet a huge part of that is also that most academics are kind of embarrassed about the quality of their code. It’s not always the most elegant of software engineering and the effort to make it publication worthy is often neglected with so much other stuff to do. Again, I’ve had good luck emailing them and asking if they’d be willing to send the source code. I don’t think I’ve ever had a reply back that didn’t include some sort of apology for the quality of the code.

  7. david Post author

    Thanks all. Glad to know it’s not just me. :-)

    Jason: Thanks for the link. It was an interesting read.

    On the subject of bad code: I genuinely would rather have bad code than no code. Even if I can’t actually get the damn thing to work it still gives me a source for figuring out the hidden details. To take the punkt example – as far as I can tell, nowhere in the paper does it specify its exact tokenizing scheme. It’s easy to guess an appropriate one, and the one I’ve guessed seems to work, but if it didn’t work it would be really nice to be able to go back to the source and figure out exactly what it considers a token.

    Good point about asking people for the papers and code. For some reason I never think to do that. I shall try to be better about it in future.

  8. Hadley Wickham

    Have you thought about joining a university library? It’s usually possible to gain access to the electronic resources of a university library just by paying a fairly reasonable access fee (I just looked at the City University of London and it’s only 100 pounds / year)

  9. Eugene

    I’ve returned to academia after 30 years of professional experience. Having spent time on both sides of the great divide, my suggestion is that you identify those academics who are active in your area of interest and invite them to actively consult and/or perform sponsored research. That way you’ll end up with the advances that you seek and the academic community ends up with practical application and case study. Such a win-win may even have R&D concessions and other government support. (In Australia, your company would access 125% tax concession plus assistance via additional grants.) All universities that I’m acquainted with have commercial research/consulting arms.

  10. Adrian Kuhn

    David, I can only agree with you. Even though I am in academia, I suffer from the same problems. In particular the lack of reference implementations and thus reproducible results(!!!) is very annoying. This is, among others, one of the reasons why we required that any submission for the WASDeTT journal issue must consist of both papers and tool with sources! (The issue is still in the reviewing process, for the moment refer to

    My personal experience wrt asking for papers and code is as follows: Papers good, code bad. For papers beyond my research area (ie LNCS Springer) I am without free access either. My solution: ask other researchers for an ssh account within their LAN and download the papers via VPN tunnel. Regarding references implementations I have very bad experience asking for non-open-sourced code. If researchers dont open-source their code, they still give it to you but will become very possessive about all what you do afterwards, however unrelated it might be.

  11. Pingback: David R. MacIver » Blog Archive » Computational linguistics and Me

  12. Robert Daland

    I second the comment about an electronic subscription to a university library.
    Through Northwestern University’s library I can get electronic access to almost everything that has published since 1996 which is worth reading. A subscription costs something on the order of 120 USD/year.

    Also, about publishing code, consider the incentive structure.
    The citation distribution follows a power law, meaning that a few papers get many citations but most papers get few citations or none at all. In other words, the probability is high that no one will care about your code. Since academics are often ashamed to release an inferior product, they feel they must make a cleaned-up version for public release. And the expected return on this is negative, since the probability is so low that anyone actually cares.
    Academics would publish their code much more freely if they were rewarded for it, and those rewards were clear and tangible. For example, if released code factored into tenure decisions, you would see a quantal change in how much open-sourcing goes on.

Comments are closed.