Two column morality

I don’t have a convincing and long essay for this. It’s more the “thought for the day”, but it’s something I’ve believed for a while.

A very tempting question to ask about courses of actions is “Does it cause more good than harm?”

I think this may be a wrong question.

It’s nice to be able to look at the world in terms of there being harm and good and one harm point is worth minus one good point. If you score positive, winning! If not, losing!

I think this may be a wrong way of looking at things.

Harm and good are not necessarily paid in the same currency, or by the same people. They don’t neatly balance out. A life saved does not balance a life lost.

A course of action is better (or equivalent) if it causes no more harm and at least as much good. If it causes more harm and more good, or less harm and less good, I think the answer is much muddier.

How do you decide in that case? I don’t know, but I think the answer is inevitably going to be more complicated than a balance sheet.

This entry was posted in rambling nonsense on by .

Sieving out prime factorizations

For a project Euler problem I had an algorithm the first step of which was to find the prime factorization of the number you fed it. This was by far the slowest part of the operation (mostly because the rest of it was pretty heavily cached to a much smaller number of solutions).

I was running this function on the numbers from 1 to 10^8, and it occurred to me that there must be a simpler way to precompute the prime factorizations of all of these without doing the full factorization for each. After some thought I came up with the following solution. It’s in some sense the most obvious thing you can possibly do (at least given a basic knowledge of some of the algorithms around this), but I’d not seen it before and knowing it exists might save you some time in similar scenarios, so I thought I’d post it here.

It’s essentially a modified Sieve of Eratosthenes. The difference is what you store in the sieve and what you do in the crossing off operation. There’s also an additional step.

The idea is this: We maintain a list of prime factors found so far for each number.

We then loop over the numbers in order from 2 upwards. If when we reach a number its list of prime factors is empty then it’s a prime, and we start marking off as follows:

Call the current number p. First we mark off all multiples of p and add p to their list. Then we mark off all multiples of p^2 and add p again. Then p^3, etc. until p^k > n.

Then we’re done, and move on to the next number.

That’s really all there is to it.

Here’s some python code that implements it:

def prime_factorizations(n):
  sieve = [[] for x in xrange(0, n)]
  for i in xrange(2, n):
    if not sieve[i]:
      q = i
      while q < n:
        for r in xrange(q, n, q):
          sieve[r].append(i)
        q *= i
  return sieve

Here it is in action:

>>> prime_factorizations(15)
[[], [], [2], [3], [2, 2], [5], [2, 3], [7], [2, 2, 2], [3, 3], [2, 5], [11], [2, 2, 3], [13], [2, 7]]

Edit: In the comments, Rich points out that you can make this more space efficient by instead storing a single prime for each integer: Specifically, the smallest prime that divides it. You can then reconstruct the desired list by iterated division – if the integer stored is equal to the current index, it’s a prime. If not, it contains one of its prime factors. Then divide by that factor and recurse. This might be a bit slower but is a lot more space efficient. Here’s some more python demonstrating this:

def prime_factorizations(n):
  sieve = [0 for x in xrange(0, n)]
  for i in xrange(2, n):
    if not sieve[i]:
      sieve[i] = i
      for r in xrange(r, n, r):
        sieve[r] = sieve[r] or i
  return sieve
 
def factor_list_from_sieve(sieve, n):
  results = []
  while n > 1:
    p = sieve[n]
    results.append(p)
    if p == n: break
    n = n / p
  return results

The first method returns the described array of primes, the second reconstructs the prime factor list from it.

Example:

>>> sieve = prime_factorizations(1000)
>>> factor_list_from_sieve(sieve, 992)
[2, 2, 2, 2, 2, 31]
This entry was posted in Numbers are hard, programming on by .

Come work with me!

And some other people too I suppose.

Aframe, the company I work for, are currently in the process of hiring. We’ve brought on an extra front-end developer recently (well, he’ll be starting soon) and are now looking for an extra back-end developer.

Read the job spec and then drop us a line. If you want to know before applying, feel free to ask any questions about the job in the comments.

This entry was posted in Uncategorized on by .

Randomly exploring github

I made a thing!

If you go to githubexplorer.com you will get a little toy for visiting random Github repos.

It doesn’t just select them at random. It picks a user, then picks a random repo they like. As a result, popular and interesting repos should occur more often than uninteresting ones.

Right now it’s very basic. I have some additional plans for using it to test bed some ideas I have for UIs for randomly exploring things.

(Also right now it’s a little crashy. Having a problem with the way it’s deployed which I’ve yet to get to the bottom of. Just refresh the page and it should be fine)

This entry was posted in programming on by .

How to do expertise search without NLP

I used to work at a company called Trampoline Systems. At the time one of the key products we were working on was for a problem we called expertise search. The problem description is basically this: I’m a member of a large company, looking to solve a problem. How do I find people to whom I’m not directly connected who might know about this problem?

As a disclaimer: I am no longer affiliated with Trampoline Systems, and Trampoline Systems are no longer trying to solve this problem. I got a brief verbal approval that it was ok to post about this but otherwise this post has nothing to do with Trampoline except that they provide the context for it.

Traditionally this problem is solved with skills databases – people maintain a searchable index of stuff they know about. You input search terms and get people out in a fairly normal full text search.

The problem with this was that people in fact do not maintain their entry in it at all. If you’re lucky they put it in once and never update it. If you’re unlucky they don’t even do that.

Trampoline’s insight was that a huge amount of information of this sort is encoded in corporate email – everyone emails everyone about important stuff, and this encodes a lot of information about what they know. So we developed a system that would take a corporate email system and extract a skills database out of it.

The way it would work is that it would process every email into a bag of n-gram “terms” which it thought might made good topics using a custom term extractor that I wrote, then it would do some complicated network analytics on the emails to highlight what terms it thought might correspond to areas of expertise for a person. It would then email people once a week saying “I think you know about the following new topics. Do you and is it ok if I share this information?”

There were a couple problems with this. The biggest problem (which this post is not about) was getting companies to not freak out about letting third party software read all their company’s email, even if it was firewalled off from the internet at large. But from a technical point of view the biggest problem was quite simply that the NLP side of things was a bit shit. Additionally it would often miss rare events – if you’d only talked about things once then it wouldn’t ever label you as knowledgeable about a subject, even if no one else in the company has talked about it at all. This means that if someone is searching for that then they will miss this information.

I was thinking about a tangentially related problem earlier and had a brain wave: The core problem of building a skills database is not one that it is actually interesting or useful to solve. The email database encodes a much richer body of information than we can extract into a summary profile, and we have direct access to this. Additionally it requires us to guess what people are interested in finding out about, which we don’t need to do because they’re explicitly searching for it in our system!

So here is how my replacement idea works. It does end up building a profile of sorts, but this is somewhat secondary.

The idea is this: We provide only a search interface. For the sake of this description it doesn’t support operators like OR or AND, but it could easily be extended to do so. You enter something like “civil engineering” or “kitten farming” into it. It returns a list of users immediately who have put terms related to this into their public profile (more on how that is built later). It also says “Would you like me to try and find some more people?”. If you say yes, it schedules a background task which does the following:

  1. Searches all email for that term and uses this to generate a graph of people
  2. Runs some sort of centrality measure on this graph – it could just be node degree, it might be an eigenvalue based measure.
  3. Email the top 5 people on this graph saying “Blah di blah wants to find people who know about kitten farming”
  4. If all of them opt out (explained in a moment) email the next 5 people. Wash, rinse, repeat until it doesn’t seem worth going on (e.g. you’ve covered 80% of the centrality measure and are into the long tail)

Note: A crucial feature is that it doesn’t tell you who it’s emailing, or indeed if it’s emailing anyone or that there are any emails that match. This is important for privacy.

The email these people receive lets them do one of several things:

  1. They can say “Yes, I know about this”. It will email the searcher about it and put the search term in their public profile.
  2. They can opt out with “I don’t know about this or don’t want to be listed as knowing about this”. Nothing will happen except that they won’t get emails about this topic again
  3. They can opt out with “I don’t know about this but this person does” in which case the person they nominate will get a similar email.

The result is that the profile is based on what people are actually trying to find out about rather than what we think they might want to find out about. The “no but this person does” also allows you to detect information that is not encoded in the emails, and is particularly useful because a high centrality in the email chain is much more likely to be an indicator that they know who the experts are than it is a measure of true expertise.

In summary: I think this replaces the hard NLP problem at the core of how we were trying to do expertise search with a much simpler text search problem, and does it while increasing the likelihood of finding the right people, retaining all the desirable opt-in privacy features and hopefully not significantly increasing the amount of manual maintenance required.

This entry was posted in Uncategorized on by .