David R. MacIver's Blog: How to do expertise search without NLP

How to do expertise search without NLP

20 June 2012

I used to work at a company called Trampoline Systems. At the time one of the key products we were working on was for a problem we called expertise search. The problem description is basically this: I’m a member of a large company, looking to solve a problem. How do I find people to whom I’m not directly connected who might know about this problem?

As a disclaimer: I am no longer affiliated with Trampoline Systems, and Trampoline Systems are no longer trying to solve this problem. I got a brief verbal approval that it was ok to post about this but otherwise this post has nothing to do with Trampoline except that they provide the context for it.

Traditionally this problem is solved with skills databases - people maintain a searchable index of stuff they know about. You input search terms and get people out in a fairly normal full text search.

The problem with this was that people in fact do not maintain their entry in it at all. If you’re lucky they put it in once and never update it. If you’re unlucky they don’t even do that.

Trampoline’s insight was that a huge amount of information of this sort is encoded in corporate email - everyone emails everyone about important stuff, and this encodes a lot of information about what they know. So we developed a system that would take a corporate email system and extract a skills database out of it.

The way it would work is that it would process every email into a bag of n-gram “terms” which it thought might made good topics using a custom term extractor that I wrote, then it would do some complicated network analytics on the emails to highlight what terms it thought might correspond to areas of expertise for a person. It would then email people once a week saying “I think you know about the following new topics. Do you and is it ok if I share this information?”

There were a couple problems with this. The biggest problem (which this post is not about) was getting companies to not freak out about letting third party software read all their company’s email, even if it was firewalled off from the internet at large. But from a technical point of view the biggest problem was quite simply that the NLP side of things was a bit shit. Additionally it would often miss rare events - if you’d only talked about things once then it wouldn’t ever label you as knowledgeable about a subject, even if no one else in the company has talked about it at all. This means that if someone is searching for that then they will miss this information.

I was thinking about a tangentially related problem earlier and had a brain wave: The core problem of building a skills database is not one that it is actually interesting or useful to solve. The email database encodes a much richer body of information than we can extract into a summary profile, and we have direct access to this. Additionally it requires us to guess what people are interested in finding out about, which we don’t need to do because they’re explicitly searching for it in our system!

So here is how my replacement idea works. It does end up building a profile of sorts, but this is somewhat secondary.

The idea is this: We provide only a search interface. For the sake of this description it doesn’t support operators like OR or AND, but it could easily be extended to do so. You enter something like “civil engineering” or “kitten farming” into it. It returns a list of users immediately who have put terms related to this into their public profile (more on how that is built later). It also says “Would you like me to try and find some more people?”. If you say yes, it schedules a background task which does the following:

Searches all email for that term and uses this to generate a graph of people
Runs some sort of centrality measure on this graph - it could just be node degree, it might be an eigenvalue based measure.
Email the top 5 people on this graph saying “Blah di blah wants to find people who know about kitten farming”
If all of them opt out (explained in a moment) email the next 5 people. Wash, rinse, repeat until it doesn’t seem worth going on (e.g. you’ve covered 80% of the centrality measure and are into the long tail)

Note: A crucial feature is that it doesn’t tell you who it’s emailing, or indeed if it’s emailing anyone or that there are any emails that match. This is important for privacy.

The email these people receive lets them do one of several things:

They can say “Yes, I know about this”. It will email the searcher about it and put the search term in their public profile.
They can opt out with “I don’t know about this or don’t want to be listed as knowing about this”. Nothing will happen except that they won’t get emails about this topic again
They can opt out with “I don’t know about this but this person does” in which case the person they nominate will get a similar email.

The result is that the profile is based on what people are actually trying to find out about rather than what we think they might want to find out about. The “no but this person does” also allows you to detect information that is not encoded in the emails, and is particularly useful because a high centrality in the email chain is much more likely to be an indicator that they know who the experts are than it is a measure of true expertise.

In summary: I think this replaces the hard NLP problem at the core of how we were trying to do expertise search with a much simpler text search problem, and does it while increasing the likelihood of finding the right people, retaining all the desirable opt-in privacy features and hopefully not significantly increasing the amount of manual maintenance required.

Comments

Ramesh Nethi on 2012-06-20 13:05:48:

Interesting post. I remember playing with Google’s product Aardvark [1] few years back, now discontinued though. It used to work very similar to the way you articulated. Instead of Email, it was using IM as the mechanism to query, confirm and add that topic as your skill area.

[1] http://en.wikipedia.org/wiki/Aardvark_(search_engine)

david on 2012-06-20 14:06:52:

Yeah, there are definitely similarities to aardvark. I’d say there are some key differences too though.

The main key difference is that this has a *much* larger corpus to draw upon than Aardvark does - it’s not just basing this on self reported expertise and the questions you’ve answered in the past, it’s also able to draw on all the other emails you’ve sent. This gives it a much richer data set to play with.

The second is that because the user is explicitly in searching mode rather than question asking mode (they’re looking for people to talk to about this more than they are looking for a specific question) their input is different. They’re not asking “Why does my bridge keep falling down?” they’re looking for people who know about civil engineering. This should be an easier problem to solve because you don’t have to do any classification on the questions.

In summary: My experience with Aardvark was that it was never very good at matching questions to people, and I think that was partly because it had a hard classification problem at its core and partly because of quite a paucity of data. I think this manages to sidestep both those problems.