This is a bit of a “Dear lazyweb” post.
I frequently find myself having the following setup, and I’ve never really come up with a solution that I find terribly satisfactory or reusable:
I have a number of events and a number of labels, and a set of tags which associate labels with events that can be added or removed. Each event-label pair is unique – you can’t tag the same thing with the same label multiple times.
I want to support the following queries:
- Given a label, how many events has it occurred in?
- Given two labels, how often did they co-occur?
- Given a label, give me all labels which have co-occurred with it in at least N events, and the number of events they have co-occurred in
- Some sort of “dump” operation which will give me all tuples “label1, label2, label1_occurrences, label2_occurrences, cooccurrences” in a file or through some sort of streaming interface
I need to support quite large numbers of tags and events (the tags are usually words, and so end up eventually using an appreciable proportion of the english language. The events are usually documents, but could easily number in the thousands). I’m ok with a time lag of even a few minutes between data coming in and being visible in the coocurrences “table”.
I’ve previously done this in an SQL database, because there was usually one to hand. It works “ok”. It doesn’t perform brilliantly, and you end up with annoying trade offs between performance, simplicity and consistency (that’s always going to be the case to a certain extent, but I felt that SQL was a particularly unsuitable medium here). Well I’ve got to do it again, and this time there’s not inherently a database built in to the app already, so I’m looking further afield. I’m half-heartedly looking into using redis, but I don’t really expect it will actually be that much help.
Any suggestions for alternatives?