<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments for David R. MacIver</title>
	<atom:link href="http://www.drmaciver.com/comments/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.drmaciver.com</link>
	<description></description>
	<lastBuildDate>Wed, 10 Feb 2010 18:19:37 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9.2</generator>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
		<item>
		<title>Comment on Dear lazyweb: A problem on cached string searching by Bob Carpenter</title>
		<link>http://www.drmaciver.com/2009/10/dear-lazyweb-a-problem-on-cached-string-searching/comment-page-1/#comment-1071</link>
		<dc:creator>Bob Carpenter</dc:creator>
		<pubDate>Wed, 10 Feb 2010 18:19:37 +0000</pubDate>
		<guid isPermaLink="false">http://www.drmaciver.com/?p=3277#comment-1071</guid>
		<description>Lucene fits the bill.  

It&#039;s fast to add documents.  It doesn&#039;t allow you to add terms, but I don&#039;t know what you mean by adding a term without a doc.  You can get the (term,doc,count) pairs through an iterator of TermDocs, which contain terms, docs and frequency counts.  

What&#039;s nice about its implementation is that it compresses the reverse index reasonably well and allows you to run either in memory or using memory mapped disk (on top of their Directory base class), so it&#039;s easy to move between efficient and scalable.  (Though on-disk is surprisingly efficient because much of the indexing is in memory.)  

Removal is fast, but optimizing the index after a bunch of add/deletes so it&#039;s most space efficient and efficient for search is much slower -- it basically applies a merge over the whole on-disk index.  

It&#039;s also fielded, but it sounds like you don&#039;t need this, so you could just use one field for all terms. 

It obviously also does full doc search with a pretty flexible query language and it&#039;s pretty fast.  It also lets you easily retrieve the term vectors for the documents.

Lucene also allows for easily pluggable normalizers/tokenizers by writing an Analyzer implementation or using the existing ones.</description>
		<content:encoded><![CDATA[<p>Lucene fits the bill.  </p>
<p>It&#8217;s fast to add documents.  It doesn&#8217;t allow you to add terms, but I don&#8217;t know what you mean by adding a term without a doc.  You can get the (term,doc,count) pairs through an iterator of TermDocs, which contain terms, docs and frequency counts.  </p>
<p>What&#8217;s nice about its implementation is that it compresses the reverse index reasonably well and allows you to run either in memory or using memory mapped disk (on top of their Directory base class), so it&#8217;s easy to move between efficient and scalable.  (Though on-disk is surprisingly efficient because much of the indexing is in memory.)  </p>
<p>Removal is fast, but optimizing the index after a bunch of add/deletes so it&#8217;s most space efficient and efficient for search is much slower &#8212; it basically applies a merge over the whole on-disk index.  </p>
<p>It&#8217;s also fielded, but it sounds like you don&#8217;t need this, so you could just use one field for all terms. </p>
<p>It obviously also does full doc search with a pretty flexible query language and it&#8217;s pretty fast.  It also lets you easily retrieve the term vectors for the documents.</p>
<p>Lucene also allows for easily pluggable normalizers/tokenizers by writing an Analyzer implementation or using the existing ones.</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Command line tools for NLP and Machine Learning by Kevin Brubeck Unhammer</title>
		<link>http://www.drmaciver.com/2009/04/command-line-tools-for-nlp-and-machine-learning/comment-page-1/#comment-1069</link>
		<dc:creator>Kevin Brubeck Unhammer</dc:creator>
		<pubDate>Mon, 01 Feb 2010 07:53:42 +0000</pubDate>
		<guid isPermaLink="false">http://www.drmaciver.com/?p=507#comment-1069</guid>
		<description>Mustn&#039;t forget &lt;a href=&quot;http://software.wise-guys.nl/libtextcat/&quot; rel=&quot;nofollow&quot;&gt;libtextcat&lt;/a&gt;! 
It does &quot;N-Gram-Based Text Categorization&quot;, eg for language guessing, &quot;a task on which it is known to perform with near-perfect accuracy&quot;. Wonderful software.

Also, for the more rule-based crowd, the machine translation system &lt;a&gt;Apertium&lt;/a&gt; is all based on the command-line (the main program is just a short shell script calling each module of a language pair pipeline); as is the &lt;a href=&quot;http://beta.visl.sdu.dk/constraint_grammar.html&quot; rel=&quot;nofollow&quot;&gt;vislcg3&lt;/a&gt; Constraint Grammar parser.</description>
		<content:encoded><![CDATA[<p>Mustn&#8217;t forget <a href="http://software.wise-guys.nl/libtextcat/" rel="nofollow">libtextcat</a>!<br />
It does &#8220;N-Gram-Based Text Categorization&#8221;, eg for language guessing, &#8220;a task on which it is known to perform with near-perfect accuracy&#8221;. Wonderful software.</p>
<p>Also, for the more rule-based crowd, the machine translation system <a>Apertium</a> is all based on the command-line (the main program is just a short shell script calling each module of a language pair pipeline); as is the <a href="http://beta.visl.sdu.dk/constraint_grammar.html" rel="nofollow">vislcg3</a> Constraint Grammar parser.</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Tell us why your language sucks by david</title>
		<link>http://www.drmaciver.com/2008/02/tell-us-why-your-language-sucks/comment-page-1/#comment-1067</link>
		<dc:creator>david</dc:creator>
		<pubDate>Thu, 21 Jan 2010 12:27:44 +0000</pubDate>
		<guid isPermaLink="false">http://www.drmaciver.com/wordpress/?p=85#comment-1067</guid>
		<description>Absolutely, but sometimes the specific complaints are interesting. :-)</description>
		<content:encoded><![CDATA[<p>Absolutely, but sometimes the specific complaints are interesting. :-)</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Tell us why your language sucks by M.S. Babaei</title>
		<link>http://www.drmaciver.com/2008/02/tell-us-why-your-language-sucks/comment-page-1/#comment-1065</link>
		<dc:creator>M.S. Babaei</dc:creator>
		<pubDate>Thu, 21 Jan 2010 01:33:23 +0000</pubDate>
		<guid isPermaLink="false">http://www.drmaciver.com/wordpress/?p=85#comment-1065</guid>
		<description>&quot;There are only two kinds of languages: the ones people complain about and the ones nobody uses&quot; --Bjarne Stroustrup

http://www2.research.att.com/~bs/bs_faq.html#really-say-that</description>
		<content:encoded><![CDATA[<p>&#8220;There are only two kinds of languages: the ones people complain about and the ones nobody uses&#8221; &#8211;Bjarne Stroustrup</p>
<p><a href="http://www2.research.att.com/~bs/bs_faq.html#really-say-that" rel="nofollow">http://www2.research.att.com/~bs/bs_faq.html#really-say-that</a></p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Tell us why your language sucks by bgsu_drew</title>
		<link>http://www.drmaciver.com/2008/02/tell-us-why-your-language-sucks/comment-page-1/#comment-1062</link>
		<dc:creator>bgsu_drew</dc:creator>
		<pubDate>Sat, 09 Jan 2010 02:29:29 +0000</pubDate>
		<guid isPermaLink="false">http://www.drmaciver.com/wordpress/?p=85#comment-1062</guid>
		<description>Java:
* Too low level - try doing common tasks such as converting a file to a string or validating XML against a schema.  Java is definitely not meant for scripting.
* Tools are maddeningly slow, like try redeploying to an IBM Websphere server.
* Constantly changing enterprise standards that make it feel like you are trying to hit a moving target... and none of which ever seem to get the API quite right.  Is EJB3 great?  I dunno, cause my company is stuck using EJB2 on WAS6.1 on J2EE v1.4.
* Java code is not elegant or fun.  Yeah, it&#039;s subjective, but there&#039;s no closures (Predicates don&#039;t count) and no literal collection initialization.  
* Many frameworks required to do anything useful and high barrier to entry.  Front End (Struts) + Data Persistence (Hibernate) + Wire it all together (Spring) + Build System (Maven) + App Server (Glassfish).  This is exactly why Rails and Grails so attractive to web developers.</description>
		<content:encoded><![CDATA[<p>Java:<br />
* Too low level &#8211; try doing common tasks such as converting a file to a string or validating XML against a schema.  Java is definitely not meant for scripting.<br />
* Tools are maddeningly slow, like try redeploying to an IBM Websphere server.<br />
* Constantly changing enterprise standards that make it feel like you are trying to hit a moving target&#8230; and none of which ever seem to get the API quite right.  Is EJB3 great?  I dunno, cause my company is stuck using EJB2 on WAS6.1 on J2EE v1.4.<br />
* Java code is not elegant or fun.  Yeah, it&#8217;s subjective, but there&#8217;s no closures (Predicates don&#8217;t count) and no literal collection initialization.<br />
* Many frameworks required to do anything useful and high barrier to entry.  Front End (Struts) + Data Persistence (Hibernate) + Wire it all together (Spring) + Build System (Maven) + App Server (Glassfish).  This is exactly why Rails and Grails so attractive to web developers.</p>
]]></content:encoded>
	</item>
</channel>
</rss>
