<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments on: Dear lazyweb: A problem on cached string searching</title>
	<atom:link href="http://www.drmaciver.com/2009/10/dear-lazyweb-a-problem-on-cached-string-searching/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.drmaciver.com/2009/10/dear-lazyweb-a-problem-on-cached-string-searching/</link>
	<description></description>
	<lastBuildDate>Mon, 06 Feb 2012 22:56:11 +0000</lastBuildDate>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
	<item>
		<title>By: Bob Carpenter</title>
		<link>http://www.drmaciver.com/2009/10/dear-lazyweb-a-problem-on-cached-string-searching/#comment-1071</link>
		<dc:creator>Bob Carpenter</dc:creator>
		<pubDate>Wed, 10 Feb 2010 18:19:37 +0000</pubDate>
		<guid isPermaLink="false">http://www.drmaciver.com/?p=3277#comment-1071</guid>
		<description>Lucene fits the bill.  

It&#039;s fast to add documents.  It doesn&#039;t allow you to add terms, but I don&#039;t know what you mean by adding a term without a doc.  You can get the (term,doc,count) pairs through an iterator of TermDocs, which contain terms, docs and frequency counts.  

What&#039;s nice about its implementation is that it compresses the reverse index reasonably well and allows you to run either in memory or using memory mapped disk (on top of their Directory base class), so it&#039;s easy to move between efficient and scalable.  (Though on-disk is surprisingly efficient because much of the indexing is in memory.)  

Removal is fast, but optimizing the index after a bunch of add/deletes so it&#039;s most space efficient and efficient for search is much slower -- it basically applies a merge over the whole on-disk index.  

It&#039;s also fielded, but it sounds like you don&#039;t need this, so you could just use one field for all terms. 

It obviously also does full doc search with a pretty flexible query language and it&#039;s pretty fast.  It also lets you easily retrieve the term vectors for the documents.

Lucene also allows for easily pluggable normalizers/tokenizers by writing an Analyzer implementation or using the existing ones.</description>
		<content:encoded><![CDATA[<p>Lucene fits the bill.  </p>
<p>It&#8217;s fast to add documents.  It doesn&#8217;t allow you to add terms, but I don&#8217;t know what you mean by adding a term without a doc.  You can get the (term,doc,count) pairs through an iterator of TermDocs, which contain terms, docs and frequency counts.  </p>
<p>What&#8217;s nice about its implementation is that it compresses the reverse index reasonably well and allows you to run either in memory or using memory mapped disk (on top of their Directory base class), so it&#8217;s easy to move between efficient and scalable.  (Though on-disk is surprisingly efficient because much of the indexing is in memory.)  </p>
<p>Removal is fast, but optimizing the index after a bunch of add/deletes so it&#8217;s most space efficient and efficient for search is much slower &#8212; it basically applies a merge over the whole on-disk index.  </p>
<p>It&#8217;s also fielded, but it sounds like you don&#8217;t need this, so you could just use one field for all terms. </p>
<p>It obviously also does full doc search with a pretty flexible query language and it&#8217;s pretty fast.  It also lets you easily retrieve the term vectors for the documents.</p>
<p>Lucene also allows for easily pluggable normalizers/tokenizers by writing an Analyzer implementation or using the existing ones.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Ulises</title>
		<link>http://www.drmaciver.com/2009/10/dear-lazyweb-a-problem-on-cached-string-searching/#comment-1013</link>
		<dc:creator>Ulises</dc:creator>
		<pubDate>Tue, 27 Oct 2009 14:05:08 +0000</pubDate>
		<guid isPermaLink="false">http://www.drmaciver.com/?p=3277#comment-1013</guid>
		<description>Perhaps you could try another of the search libs. such as Lemur (http://www.lemurproject.org/). AFAIK the inverted index also keeps track of the term count in the document as well as position -- and so does Lucene methinks. Lemur is written in C++ with interfaces for Java and some other languages and it&#039;s rather fast.</description>
		<content:encoded><![CDATA[<p>Perhaps you could try another of the search libs. such as Lemur (<a href="http://www.lemurproject.org/" rel="nofollow">http://www.lemurproject.org/</a>). AFAIK the inverted index also keeps track of the term count in the document as well as position &#8212; and so does Lucene methinks. Lemur is written in C++ with interfaces for Java and some other languages and it&#8217;s rather fast.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: david</title>
		<link>http://www.drmaciver.com/2009/10/dear-lazyweb-a-problem-on-cached-string-searching/#comment-1012</link>
		<dc:creator>david</dc:creator>
		<pubDate>Thu, 15 Oct 2009 16:34:13 +0000</pubDate>
		<guid isPermaLink="false">http://www.drmaciver.com/?p=3277#comment-1012</guid>
		<description>It might do. I honestly don&#039;t know! It could certainly be used for the inverted index of documents. But I don&#039;t know if it, for example, offers anything that would allow me to quickly add a document to the index in a way that also checked if it contained any of the existing keywords. I&#039;ve used it a little bit in the past and I&#039;ve not seen anything to that effect that would be less work than e.g. using an aho corasick implementation.</description>
		<content:encoded><![CDATA[<p>It might do. I honestly don&#8217;t know! It could certainly be used for the inverted index of documents. But I don&#8217;t know if it, for example, offers anything that would allow me to quickly add a document to the index in a way that also checked if it contained any of the existing keywords. I&#8217;ve used it a little bit in the past and I&#8217;ve not seen anything to that effect that would be less work than e.g. using an aho corasick implementation.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Kevin</title>
		<link>http://www.drmaciver.com/2009/10/dear-lazyweb-a-problem-on-cached-string-searching/#comment-1011</link>
		<dc:creator>Kevin</dc:creator>
		<pubDate>Thu, 15 Oct 2009 16:23:08 +0000</pubDate>
		<guid isPermaLink="false">http://www.drmaciver.com/?p=3277#comment-1011</guid>
		<description>Does Apache Lucene fit your requirements?</description>
		<content:encoded><![CDATA[<p>Does Apache Lucene fit your requirements?</p>
]]></content:encoded>
	</item>
</channel>
</rss>

