Category Archives: programming

Rube Goldberg 2.0

I just voted up a link on reddit. The link doesn’t matter, but it started me thinking about the chain of events that followed it.

First, a bit of javascript submits some information back to reddit’s server. The python code receiving reddit stores this.

At some point later, some ruby code running on my slicehost requests a json file from reddit. The python serves this up, the ruby fetches it, parses it and hands it off to delicious.com, where a mix of php and C++ (I think) stores it in their backend.

At some point later a php script running on my blog fetches that data from delicious. It dumps it into a MySQL database.

At some point later still, someone comes along to my website. They see a link to the thing I voted up on reddit, served up by php.

Alternative titles: “Polyglotism now!”, “…except for too much indirection”, “He Knows The Unix Way”.

This entry was posted in programming and tagged , on by .

Open source term extraction

This is just a quick announcement to let people know that we’ve open sourced our JRuby library for term extraction. You can get the code from my github page.

Unlike a lot of term extraction libraries, this doesn’t take any stance as to the “significance” of the terms it extracts. It’s purely about looking at the syntax and determining where good boundaries for terms are. There are a couple reasons for this, but basically we’ve found that it’s more effective to separate the two steps and makes it easier to tinker around with them independently. The criteria for “interestingness” of terms seem to be largely distinct from those for terms which simply make sense linguistically. So we have a two stage pipeline, one which extracts semantically meaningful terms and one which determines what terms are actually interesting in the context of the document. The second step is much more complicated, and we’re not open sourcing that (yet? probably not any time soon, if ever. Even if we wanted to, it relies on a lot more global information across the document corpus and so is very tied in with how SONAR operates, making it much harder to isolate).

So, how does it work? Black magic and voodoo!

Actually, no. It’s pretty straightforward. It builds on top of the excellent OpenNLP library, using its tools for part of speech tagging, sentence splitting (a much harder problem than you’d imagine) and phrase chunking. It’s currently a rules based system on top of there, as while you’re figuring things out it makes much more sense to stick with something so easily fine tunable. Our expectation is that we’ll gradually start replacing bits of it with machine learning based techniques as we start to hit the limitations of a rules based system, but for now it’s working pretty well.

Let’s have an example. If we feed the second paragraph of this post into the term extractor, we get the following terms back:

term extraction libraries
stance
terms
syntax
good boundaries
couple reasons
two steps
steps
criteria
interestingness
sense
two stage pipeline
stage pipeline
semantically meaningful terms
context
context of the document
document
second step
open sourcing
time
document corpus
SONAR

Hope you find this useful. Let us know if you build anything cool with it!

This entry was posted in Code, programming and tagged on by .

Open source nostalgia

I write a lot of open source code.

This is not, however, the same as saying I write a lot of useful open source code. I suspect 90% of open source code I’ve written has never been productively used by anyone except me (and probably half of that 90% hasn’t been used productively by me either and was just an amusing hack). That’s ok though. It’s not really meant to be – when I create a project I actively intend for other people to use I put a bit more effort into it, but for the most part it’s really a case of “I wrote this. I don’t care about it that much. Maybe someone else will find it useful”. Most of the time they don’t, but far be it from me to judge what would and wouldn’t be useful to other people among my random hackings.

There was a nice example of this the other day, when I had the following exchange on twitter with Josh Reich (transcribed in IRC format because I don’t have a better twitter transcript convention):

<@i2p> I'm stealing some java code found on the internets written by one @DRMacIver - thanks!
<@DRMacIver> @i2pi You're welcome. :-) Which code?
<@i2pi> @DRMacIver FlatteningIterator + its recursive cousin
<@DRMacIver> @i2pi Oh,wow. Code from the dawn of time. :-)
<@i2pi> @DRMacIver It's funny - years ago I used your FlatteningIterator, but never realized it was 
             yours until I went to grab the code again today.
<@DRMacIver> @i2pi Glad it was useful! I'm never sure how much, if any, of the random code I put 
             out there is.

That’s all. Nothing earth shattering – but it was nice to get a thank you for a little bit of code I wrote once upon a time, and to know that that code had helped someone out. It made me smile on what was otherwise turning out to be a fairly meh day, so was extra appreciated for that.

The code in question is here. It’s nothing special – it’s an iterator that recursively descends into other iterators in a depth first traversal. It’s one of those things that just about anyone could write, but if someone else has already written it you’d probably want to reuse their version rather than write your own owing to the presence of a few fiddly edge cases.

To be honest I’d forgotten all about it. I actually wrote this code about 8 or so months after I first really learned to program. It’s not bad code given that: I can’t see anything obvious about it that makes me go “Oh my god, I did that?”, though the design of the overall API is a bit icky (you have iterators which return a mix of values, other iterators and collections, and it descends into every iterator or collection it finds and returns every value), but given the specified behaviour the code is ok. Except for the bug I just noticed where the code for supporting arrays is completely wrong. The commenting style is a bit too “I should write comments as much as possible” so a little over-obvious. Still, it was nice to know I wasn’t a complete idiot back then (the Array bug is an instance of stuff I seem to still be guilty of today – not testing enough – so I’m not counting that)

Anyway, I enjoyed being reminded of this code, enjoyed the fact that it was useful to someone and appreciated being thanked for it. I didn’t really have anything more profound to say than that.

This entry was posted in programming and tagged on by .

Crowding the trampoline

As most of you probably know by now, even though I don’t talk about them that much, I work for a company called Trampoline Systems. We’re a startup doing some interesting tech things. That’s not what this post is about.

We’re seeking series B funding at the moment, but it’s a difficult time to be doing it through the normal VC route, so now we’re trying something new: Crowdfunding. Rather than getting a few people to give us lots of money, let’s get lots of people to give us a little money. Alistair knows more about it than me, so I’ll refer you to him if you want to know the details.

There are a bunch of legal difficulties with this in terms of who the FSA will allow us to solicit funding from. In particular I’d be surprised if even 10% of people reading this were on the list. So, this isn’t a “Give us money” request. To be honest, even if it weren’t for the resgulations it probably wouldn’t have been – other people in the company know more about the financial side of things and can say it better than I can.

What’s most interesting to me about the crowd funding isn’t actually the financial aspects. I mean, obviously ensuring the survival of the company is a good thing, but the crowd funding is interesting in a way that merely receiving a big chunk o’ VC funding wouldn’t have been (not that it would have been unwelcome!).

What’s interesting is the additional flexibility it buys. I’m big on the subject of open source and open information (I’m not a GNU style fanatic – I’m absolutely fine with closed source too. I believe in closing as much source as you need to and opening as much source as you can). There’s been a movement amongst the dev team (particularly me and Craig, our CTO) to see what we can extract from SONAR in the way of useful open source tools. Our term extraction code for example (which takes a blob of text and gives you useful fragments of text from it which make sense in isolation) is ripe for open sourcing. Unfortunately we’ve held off on it because it sounds like a much bigger chunk of our IP than it actually is, and we need to be super careful about how things look to our funders. This is understandable from their point of view, but somewhat disheartening from mine.

With crowd funding our hope (or at least my hope. This is all still under discussion) is that the larger group will be much more amenable to a policy of openness than the smaller. In many ways it’s much more in keeping with the style of the thing, and with less invested per person there’s less of a strong financial incentive to be risk averse and more of a reason to trust us with these decisions.

So, from my point of view, I’m quite looking forward to seeing what the future brings and, with any luck, it will include a few shiny new toys for you to play with.

This entry was posted in life, programming on by .

How packages work in Scala

THIS PIECE IS FULL OF LIES DO NOT TRUST IT

More accurately, its information is out of date and no longer valid. This describes the old behaviour of the Scala package system. Its behaviour has been different from this for some years now, as it turned out most people weren’t reaching the “acceptance” stage I describe below and after enough shouting the behaviour got changed. This is preserved solely for posterity. Do not rely on it for accurate information.

Original piece follows:

Every now and then someone discovers how packages work in Scala. This process typically passes through a number of stages.

  1. Confusion: “Hey, guys, I found this weird bug. Can you take a look?”
  2. Surprise: “What? It works like that? Really?”
  3. Denial: “No, I don’t believe you. This has to be a bug.”
  4. Anger: “Dear scala-debate. This is the worst feature in the entire world, and if you don’t agree with me you’re a big poopy head”
  5. Acceptance: “Actually, this is quite a neat feature”

Not everyone reaches step 5. Many stay in step 4 permanently, often because they’ve discovered that this interacts poorly with certain conventions they use.

This behaviour is particularly unfortunate because actually Scala’s package behaviour is quite nice. But people don’t seem to be willing to believe this and instead make up all sorts of behaviour which it doesn’t have and never has had and then get upset when the reality does not correspond to their fiction.

And so, in the hopes of dispelling some of this confusion, I bring to you the reality of how packages work in Scala. Some of this is very basic material, but I’m presenting it in case you’ve not explicitly thought about it in these terms as it will help with the leadup to the actually important part.

Identifiers

You have a bunch of identifiers in scope. These are names for things. It doesn’t matter what they’re names for: They could be vals, defs, packages, objects, etc. So for example suppose I have:

package foo;
object bar;
object baz{
   val kittens = "kittens";
}

within this file, say within the object bar, we’ve got a bunch of identifiers in scope: We have foo, the package we are in, bar, an object, and baz, another object. We don’t have kittens in scope (except within the object baz).

Within the object baz, everything in scope at the outer level is in scope here, but we’ve introduced the additional identifier kittens.

Note that a package conceptually constitutes one “level”. Everything from your current package is in scope, regardless of how you split it up into files – I could have moved some of the objects above into separate files and nothing would have changed.

Top level identifiers

Packages like foo are “top level” – they live in the global scope. Any file can refer to the identifier foo.

Nesting of packages

In the same way we had an object inside a package and introduced a new scope, we can nest a package inside a package.

package mammals;

package rodents{ 
   class Rat;
}

This places the package “rodents” inside the package “mammals”. In exactly the same way the object did, this inherits everything from the outer scope (and remember: the scope of the package is the scope of everything

 

package mammals;

class Cat;

package rodents{ 
   class Rat{
     def flee(moggy : Cat) = println("Help, help! Run away! It's " + moggy)
   }
}

the identifiers of the outer scope are available in the inner one.

But this sort of deeply nested package structure gets very ugly to write, so what one tends to do is seperate it out to one package in a given file, even the nested ones, and so there’s syntax to support it:

package mammals.rodents;

class Rat{
  def flee(moggy : Cat) = println("Help, help! Run away! It's " + moggy)
}

This is exactly the same as the previous example except we’ve moved Cat to another file. It’s still in scope as before.

Members

identifiers can have members. These are other identifiers which live on them and can be accessed with a .

For example, to refer to Rat from the package mammals we would refer to it as rodents.Rat.

Shadowing

You can reintroduce the same identifier at an inner level. Going back to our first example suppose we had written baz as

object baz{
   val bar = "kittens"
   val kittens = bar
}

Then kittens would still contain the string “kittens”, as it refers to the definition of bar in the current scope not the outside one. Outside of baz, bar would still refer to the object.

An important aspect of this: You can shadow packages just like anything else!

Suppose we have

package foo{
   object baz;
   package foo{
     object baz;

     object stuff{
       val it = foo.baz;
     }
  } 
}

Then “it” points to the innermost baz, not the outermost one: We’ve shadowed the definition of foo.

And this is where the problem lies.

Suppose I have

package net.liftweb{
   object AwesomeWebWidget{
      def doStuffWith(url : java.io.File) = ...
   }
}

and someone comes along (remember this doesn’t have to be in the same file – it can even be in a jar) and introduces

package net.java.kittens;

class Kitten;

Now the lift code will no longer work! The problem is that what we have actually looks like this:

package net{
   package java{
     package kittens{
       class Kitten;
     }
   }

   package liftweb{
      object AwesomeWebWidget{
         def doStuffWith(url : java.io.File) = ...
      }
   }
}

the problem is we have a different java identifier in scope than the one we wanted this to mean. It actually refers to the java identifier that we acquire from the net package, rather than the base java that lives in the root as desired. This is the problem that sparked the latest “discussion” in scala-debate on this subject.

The solutions

One thing which everyone immediately leaps to propose is to change the way imports work in Scala. Hopefully the above should have demonstrated that this wouldn’t help: I have not mentioned the word “import” anywhere in this explanation. So we can safely discard this as a non-solution.

The primary current solution is, unfortunately, a bit of an ugly one. When you want to say “the java at the root and I really damn mean it” you can refer to it as _root_.java.io.File. Adding this to your fully qualified names will force it to refer to the right one. Many people have taken to using _root_ on all their imports to fully qualify them. Personally I don’t feel the need (I don’t use Java reverse name conventions though, so I rarely run into the negative aspects of this behaviour).

Some people have taken to fully qualifying all their imports to prevent this sort of accidental shadowing. Personally I find this highly unnecessary. My preferred solution is to avoid the reverse domain name convention: Not having your top level package as something common greatly reduces the ability to accidentally have packages injected into your scope like this.

Other solutions are currently under discussion in scala-debate, so some of this may be prone to change

This entry was posted in programming and tagged on by .