Category Archives: Uncategorized

Blasting patents for fun and profit

My friend Dave Stark has just released his first commercial game.

It’s themed around a subject which I suspect may be dear to many of my audience’s heart: Specifically that of setting on fire, shooting and generally annihilating patents.

The game’s name? Patent Blaster.

It’s a neat little game. It plays a little bit like a particularly bad acid trip, but in a good way. Go check it out.

This entry was posted in Uncategorized on by .

Falsifying hypotheses in python

I haven’t written any Scala in ages. Mostly I’m OK with this – there are still a few language features I miss from time to time, but nothing too serious.

One thing I do often miss though is Scalacheck. It’s really the nicest instance of Quickcheck style random testing I’ve seen (this is admittedly not based on a great deal of experience such libraries other than Quickcheck itself and some exceedingly half-assed ports).

One of its nice features is that as well as randomly generating test data, it also minimizes the examples it generates. So rather than generating really complicated examples as soon as it finds one that fails, it takes those examples and attempts to generate something simpler out of them.

I was thinking about this today and decided to have a play with putting something like this together in python. Rather than build a full testing setup I thought I’d instead just build the basic core algorithm and then figure out how to integrate this with py.test and similar later.

Enter hypothesis, a library for generating and minimizing data and using it to falsify hypotheses.

In [1]: from hypothesis import falsify
 
In [2]: falsify(lambda x,y,z: (x + y) + z == x + (y +z), float,float,float)
Out[2]: (1.0, 2.0507190744664223, -10.940188909437985)
 
In [3]: falsify(lambda x: sum(x) < 100, [int])
Out[3]: ([6, 29, 65],)
 
In [4]: falsify(lambda x: sum(x) < 100, [int,float])
Out[4]: ([18.0, 82],)
 
In [12]: falsify(lambda x: "a" not in x, str)
Out[12]: ('a',)

The current state of this is “working prototype”. It’s reasonably well tested, and the external interface to falsify is probably not going to change all that much, but the internals are currently terrible and liable to change almost entirely, but I’m reasonably happy with it as a proof of concept.

This entry was posted in Hypothesis, Uncategorized on by .

Come work with me!

And some other people too I suppose.

Aframe, the company I work for, are currently in the process of hiring. We’ve brought on an extra front-end developer recently (well, he’ll be starting soon) and are now looking for an extra back-end developer.

Read the job spec and then drop us a line. If you want to know before applying, feel free to ask any questions about the job in the comments.

This entry was posted in Uncategorized on by .

How to do expertise search without NLP

I used to work at a company called Trampoline Systems. At the time one of the key products we were working on was for a problem we called expertise search. The problem description is basically this: I’m a member of a large company, looking to solve a problem. How do I find people to whom I’m not directly connected who might know about this problem?

As a disclaimer: I am no longer affiliated with Trampoline Systems, and Trampoline Systems are no longer trying to solve this problem. I got a brief verbal approval that it was ok to post about this but otherwise this post has nothing to do with Trampoline except that they provide the context for it.

Traditionally this problem is solved with skills databases – people maintain a searchable index of stuff they know about. You input search terms and get people out in a fairly normal full text search.

The problem with this was that people in fact do not maintain their entry in it at all. If you’re lucky they put it in once and never update it. If you’re unlucky they don’t even do that.

Trampoline’s insight was that a huge amount of information of this sort is encoded in corporate email – everyone emails everyone about important stuff, and this encodes a lot of information about what they know. So we developed a system that would take a corporate email system and extract a skills database out of it.

The way it would work is that it would process every email into a bag of n-gram “terms” which it thought might made good topics using a custom term extractor that I wrote, then it would do some complicated network analytics on the emails to highlight what terms it thought might correspond to areas of expertise for a person. It would then email people once a week saying “I think you know about the following new topics. Do you and is it ok if I share this information?”

There were a couple problems with this. The biggest problem (which this post is not about) was getting companies to not freak out about letting third party software read all their company’s email, even if it was firewalled off from the internet at large. But from a technical point of view the biggest problem was quite simply that the NLP side of things was a bit shit. Additionally it would often miss rare events – if you’d only talked about things once then it wouldn’t ever label you as knowledgeable about a subject, even if no one else in the company has talked about it at all. This means that if someone is searching for that then they will miss this information.

I was thinking about a tangentially related problem earlier and had a brain wave: The core problem of building a skills database is not one that it is actually interesting or useful to solve. The email database encodes a much richer body of information than we can extract into a summary profile, and we have direct access to this. Additionally it requires us to guess what people are interested in finding out about, which we don’t need to do because they’re explicitly searching for it in our system!

So here is how my replacement idea works. It does end up building a profile of sorts, but this is somewhat secondary.

The idea is this: We provide only a search interface. For the sake of this description it doesn’t support operators like OR or AND, but it could easily be extended to do so. You enter something like “civil engineering” or “kitten farming” into it. It returns a list of users immediately who have put terms related to this into their public profile (more on how that is built later). It also says “Would you like me to try and find some more people?”. If you say yes, it schedules a background task which does the following:

  1. Searches all email for that term and uses this to generate a graph of people
  2. Runs some sort of centrality measure on this graph – it could just be node degree, it might be an eigenvalue based measure.
  3. Email the top 5 people on this graph saying “Blah di blah wants to find people who know about kitten farming”
  4. If all of them opt out (explained in a moment) email the next 5 people. Wash, rinse, repeat until it doesn’t seem worth going on (e.g. you’ve covered 80% of the centrality measure and are into the long tail)

Note: A crucial feature is that it doesn’t tell you who it’s emailing, or indeed if it’s emailing anyone or that there are any emails that match. This is important for privacy.

The email these people receive lets them do one of several things:

  1. They can say “Yes, I know about this”. It will email the searcher about it and put the search term in their public profile.
  2. They can opt out with “I don’t know about this or don’t want to be listed as knowing about this”. Nothing will happen except that they won’t get emails about this topic again
  3. They can opt out with “I don’t know about this but this person does” in which case the person they nominate will get a similar email.

The result is that the profile is based on what people are actually trying to find out about rather than what we think they might want to find out about. The “no but this person does” also allows you to detect information that is not encoded in the emails, and is particularly useful because a high centrality in the email chain is much more likely to be an indicator that they know who the experts are than it is a measure of true expertise.

In summary: I think this replaces the hard NLP problem at the core of how we were trying to do expertise search with a much simpler text search problem, and does it while increasing the likelihood of finding the right people, retaining all the desirable opt-in privacy features and hopefully not significantly increasing the amount of manual maintenance required.

This entry was posted in Uncategorized on by .

A heuristic for detecting expertise

Earlier tweet:

It seems to have struck a chord, so I thought I’d elaborate slightly.

Firstly, this is of course a heuristic for non-experts to determine expertise in a field with which they are only passingly familiar. If you are an expert then using your own lack of knowledge as a heuristic isn’t very useful, and you probably have much better ones to apply.

The basic idea is this:

It’s usually not obvious what the important features of a problem are. In particular what looks important is often different from what’s actually important, or requires you to address some more fundamental issue as part of dealing with it. Therefore your perception of relevance as a non-expert probably has an awful lot of false negatives.

Someone who has acquired expertise in a subject understands it a lot better, and thus has a much more finely tuned perception of what the relevant problems are. Therefore there are a lot of things they would consider relevant that you would not.

Examples:

  • Why is this developer talking about things like “refactoring” and “testing” which you don’t care about. You just want the software to get new features quickly and not crash!
  • Why is that security person asking all these strange questions about validation? You just want to make it so people can’t hack your site
  • Why is the tennis coach obsessing about where your feet are? You just want to hit the ball!
  • Why is this cook talking about knife skills? You just want to know why your food isn’t cooking evenly!

I’m not 100% sure how useful the heuristic is. I suspect it may be too permissive in the sense that it’s easy to let babble through. e.g. “Why is this guru talking about chakras? You just want your cancer to go away!”. Perhaps it needs the secondary heuristic that they have to be able to explain why the detail is important?

This entry was posted in Uncategorized on by .