So, I have the following problem:
I have a bunch of terms and a bunch of documents. I want to maintain a database of which terms appear in which documents, and ideally a count of how many times those terms appear in the document. “appear” can mean a literal byte substring – no need for any sort of fuzzy matching. I want the following operations to be fast:
- Adding a term
- Adding a document
- Getting a list of all (Term, Document, Count) pairs
- Getting the count for a specific (Term, Document) pair
- Fast removal of documents and/or terms
- Retrieval of all documents containing a given term or all terms in a given document
It’s clearly relatively easy to construct something like this using a combination of an aho-corasick style search tree (I’m not totally confident such can be built incrementally, but worst comes to worst you can save a bunch of “patches” and amortize the cost at retrieval time), a dbm style database for maintaining the counts and an inverted index on the text, but I’m wondering if there’s some appropriate structure more optimised for this. Anyone know of anything?
Does Apache Lucene fit your requirements?
It might do. I honestly don’t know! It could certainly be used for the inverted index of documents. But I don’t know if it, for example, offers anything that would allow me to quickly add a document to the index in a way that also checked if it contained any of the existing keywords. I’ve used it a little bit in the past and I’ve not seen anything to that effect that would be less work than e.g. using an aho corasick implementation.
Lucene fits the bill.
It’s fast to add documents. It doesn’t allow you to add terms, but I don’t know what you mean by adding a term without a doc. You can get the (term,doc,count) pairs through an iterator of TermDocs, which contain terms, docs and frequency counts.
What’s nice about its implementation is that it compresses the reverse index reasonably well and allows you to run either in memory or using memory mapped disk (on top of their Directory base class), so it’s easy to move between efficient and scalable. (Though on-disk is surprisingly efficient because much of the indexing is in memory.)
Removal is fast, but optimizing the index after a bunch of add/deletes so it’s most space efficient and efficient for search is much slower — it basically applies a merge over the whole on-disk index.
It’s also fielded, but it sounds like you don’t need this, so you could just use one field for all terms.
It obviously also does full doc search with a pretty flexible query language and it’s pretty fast. It also lets you easily retrieve the term vectors for the documents.
Lucene also allows for easily pluggable normalizers/tokenizers by writing an Analyzer implementation or using the existing ones.
Perhaps you could try another of the search libs. such as Lemur (http://www.lemurproject.org/). AFAIK the inverted index also keeps track of the term count in the document as well as position — and so does Lucene methinks. Lemur is written in C++ with interfaces for Java and some other languages and it’s rather fast.