Archive for October, 2009

Dear lazyweb: A problem on cached string searching

Thursday, October 15th, 2009

So, I have the following problem:

I have a bunch of terms and a bunch of documents. I want to maintain a database of which terms appear in which documents, and ideally a count of how many times those terms appear in the document. “appear” can mean a literal byte substring – no need for any sort of fuzzy matching. I want the following operations to be fast:

  • Adding a term
  • Adding a document
  • Getting a list of all (Term, Document, Count) pairs
  • Getting the count for a specific (Term, Document) pair

Bonus features:

  • Fast removal of documents and/or terms
  • Retrieval of all documents containing a given term or all terms in a given document

It’s clearly relatively easy to construct something like this using a combination of an aho-corasick style search tree (I’m not totally confident such can be built incrementally, but worst comes to worst you can save a bunch of “patches” and amortize the cost at retrieval time), a dbm style database for maintaining the counts and an inverted index on the text, but I’m wondering if there’s some appropriate structure more optimised for this. Anyone know of anything?

Experiments on the theme of cauliflower and cheese

Tuesday, October 13th, 2009

I had a huge (and I really do mean huge) cauliflower from Abel and Cole that I wasn’t sure what to do with – it’s not a vegetable I normally use for much – so I figured I’d give Cauliflower and Cheese a try. It’s not exactly the most innovative recipe in the world and, while it’s one I like, not something I’m massively excited by. But why not? It’s not something I’d actually ever made before, and it’d keep well as a side dish.

I used the Good Housekeeping cauliflower and cheese recipe as the basis for this, but only loosely followed it. You could probably use any recipe for it that works well. But I tried the following variations on it:

  • I roasted the cauliflower instead of boiled it. Basically all I did was floret it as normal, put it in the dish and roast it covered at 200C for 20-30 minutes with a bit of butter and salt (ok, a lot of butter and a bit of salt). Then I just added the sauce and cheese as normal on top of that.
  • Additionally I added some leeks to the mix, roasting them along with the cauliflower.
  • I added halved cherry tomatoes at the same time as the sauce.

This worked incredibly well. It took a nice dish and made it excellent. I had this as the main meal, served with a whole grain bread, and I’ve been very happily eating leftovers of it since.

Am I being boring?

Sunday, October 11th, 2009

I had an enlightening conversation with Mark Wotton on twitter recently. It started when I gave the following advice:

Writing advice: You should have a constant mental process going asking “Is this bit I’m writing boring?”. If it is, delete or rewrite it.

I’d meant it to apply to prose. I was reading an article which took an important and exciting piece of information and made it deathly dull. I don’t want to link to the article, but I’m sure you can think of examples yourself.

Mark replied:

I thought you were talking about code for a second there – it actually works pretty well there, too.

This wasn’t a context in which I’d thought about it before.

I won’t quote the entire conversation here (you can check it out at the source if you care), but the conclusions from it were interesting.

A lot of code is boring, and sometimes this is ok. Most code does boring things – display a web page, convert data from this format to that format, write out an error message, etc. Boring tasks are ubiquitous and necessary to get things done.

But, like writing, you can write about things which are boring and you can write about things in boring ways.

What does boring writing look like? Well, it could contain a lot of repetition, it could take forever to get to the point making it non-obvious what it’s about, it could include a lot of irrelevancies…

Hmm. None of those sound like very good things to do when coding, do they?

As people are fond of saying, code should really be first about telling other people what it should do and secondarily about getting the computer to execute it. Coding is a form of writing, and as such it needs to keep the reader interested or their boredom will get in the way of their understanding (side note: this is not the same as making the reader work hard to understand it – not being boring is not the same as being overly clever).

So, it’s ok to write code that is boring because what it is trying to do is inherently boring, but you shouldn’t add unnecessary boredom to the code.

Hmm. That sounds familiar.

So, to borrow the terminology from Fred Brooks: There are two types of boredom. Essential boredom, which is inherent to the problem being solved, and accidental boredom, which is introduced by the programmer. Seek to minimize the latter.