This is a bit of a “Dear lazyweb” post.
I frequently find myself having the following setup, and I’ve never really come up with a solution that I find terribly satisfactory or reusable:
I have a number of events and a number of labels, and a set of tags which associate labels with events that can be added or removed. Each event-label pair is unique – you can’t tag the same thing with the same label multiple times.
I want to support the following queries:
- Given a label, how many events has it occurred in?
- Given two labels, how often did they co-occur?
- Given a label, give me all labels which have co-occurred with it in at least N events, and the number of events they have co-occurred in
- Some sort of “dump” operation which will give me all tuples “label1, label2, label1_occurrences, label2_occurrences, cooccurrences” in a file or through some sort of streaming interface
I need to support quite large numbers of tags and events (the tags are usually words, and so end up eventually using an appreciable proportion of the english language. The events are usually documents, but could easily number in the thousands). I’m ok with a time lag of even a few minutes between data coming in and being visible in the coocurrences “table”.
I’ve previously done this in an SQL database, because there was usually one to hand. It works “ok”. It doesn’t perform brilliantly, and you end up with annoying trade offs between performance, simplicity and consistency (that’s always going to be the case to a certain extent, but I felt that SQL was a particularly unsuitable medium here). Well I’ve got to do it again, and this time there’s not inherently a database built in to the app already, so I’m looking further afield. I’m half-heartedly looking into using redis, but I don’t really expect it will actually be that much help.
Any suggestions for alternatives?
Well, redis would work for keeping track of your statistics. Not sure if it’s the fastest or the best, but it’d be pretty damn easy to implement. I’d probably still keep the raw data in a SQL db, but the summaries would be size O(n_l^2) (n_l = number of labels), which is probably quite doable.
“label1,label2” = [label1_occ, label2_occ, coocc]
You’d probably want to store “label1,label2” and “label2,label1”. Then just update counts when new records are added, edited, deleted. Benefit is you have instantaneous access to stats, downside is it take a little bit longer to insert/update/delete. Adding new labels shouldn’t be too bad either if you assume that a missing “l1,l2” pair means 0 counts. #3 will always be worst case runtime O(n_l it has cooccurred with) which is probably pretty good.
All that said, I haven’t used redis a lot except as a backend to resque, which I have used a lot. So there may be some concurrency gotchas to consider that I’m not sure how redis would handle.
What about storing two sparse matrices? I don’t know of a good sparse matrix lib, but surely there is one.
Yes, the SQL approach you describe is more or less what I’ve done before. :-)
Fortunately it’s not actually the case that the data size is O(n_l^2) – it’s typically much smaller. n_L is large enough that its square starts to be a bit upsetting, but the actual number is more manageable. Still, the solution feels like a hack – it ends up being quite cumbersome to write, surprisingly difficult to reuse and didn’t perform amazingly, so I was shopping around for something better.
The sparse matrix approach might be a good one if there’s a good storage library for them. Or I suppose I could just do it in memory – the dataset isn’t *that* large.
This is a problem which I think a search engine is fairly well placed to solve: at least, I know it could be done pretty efficiently with Xapian, and I imagine Lucene could handle it too.
Each event would be a document, and the labels would be terms.
1) is simply asking “how many documents is this term in”, and the answer is one of the basic statistics stored (in xapian: Database::get_termfreq(term))
2) is simply asking “how many documents match “A AND B””, which can be quickly computed by running a query.
3) is the problem which is solved to implement faceted navigation: the query is the single term representing your label.
On the other hand, 4) might not work so well – while each row could be calculated quite efficiently, there will be a lot of rows, and there’s probably something more efficient which could be done if you do the calculation in bulk, rather than row-by-row.
I would think solr would be able to do something like this fairly trivially. Occurances would be documents, tags can be fields, dynamic ones, or you could likely just have a single text field that has unique tag words within, and facet dismaxqueries would likely be your best bet, performance wise.
The returned facet counts should give you the numbers you’re looking for.