Archive for the ‘Code’ Category

Filtering deleted documents with PostgreSQL rules

Friday, November 13th, 2009

I’m currently working on a Mysterious Project (coming soon to an internet near you) which involves a lot of user generated content (Yes, fine, slap a 2.0 on my name and call me “still in beta”). As such, it’s got all the usual problems with user generated content. In particular it has spam.

So, we need some sort of spam filtering in place to make sure we never show spam to users. But we don’t want to delete spam from the database – partly in case of mistakes, partly because we want to use the data for automated classification of spam.

Ok, this is easy enough to do. You add a flag “spam” to the table and don’t show the user anything flagged as spam.

The problem here is that this content gets used in all sorts of contexts, and it’s really annoying to have to add “where != spam” here.

No problem. We create a view. That’s what they’re for.

But this is slightly annoying: Basically all our access to content goes through this view, but modifications to content have to go through the original table. It would be really nice if we could have all updates and inserts going to the same thing we access the data from. “really nice” is partly aesthetic, but there’s also a boring practical reason: We’re using an ORM (ActiveRecord in fact. Sigh), and we’d like the ORM to access the filtered version, but we’d also like to be able to update the same objects.

Hang on. We’re using PostgreSQL. There’s an app… err. feature for that.

PostgreSQL has a feature called rules which allow you to change the meaning of various operations on a table (views in PostgreSQL are also tables). We can use these to make our view updateable. Let’s see how.

We’ll start with a slightly abstracted version of the problem. Instead of thinking about spam filtering we’ll concern ourselves with deleting posts. We want to retain the old posts but not show them:

david=# create sequence post_ids;
CREATE SEQUENCE
david=# create table unfiltered_posts(id int primary key default nextval('post_ids'),
david(#                                        body text,
david(#                                        deleted boolean not null default false);
NOTICE:  CREATE TABLE / PRIMARY KEY will create implicit index "unfiltered_posts_pkey" for table "unfiltered_posts"
CREATE TABLE

So we first create our view that will be the posts which have not been deleted:

david=# create view posts as select * from unfiltered_posts where not deleted;
CREATE VIEW

Making sure everything’s working as expected:

david=# select * from posts;
 id | body | deleted
----+------+---------
(0 rows)

david=# insert into unfiltered_posts(body, deleted) values('I like kittens', false);
INSERT 0 1

david=# insert into unfiltered_posts(body, deleted) values('I don''t like kittens', true);
INSERT 0 1

david=# select * from posts;
 id |      body      | deleted
----+----------------+---------
  1 | I like kittens | f
(1 row)

david=# select * from unfiltered_posts ;
 id |         body         | deleted
----+----------------------+---------
  1 | I like kittens       | f
  2 | I don't like kittens | t
(2 rows)

So all working as expected: unfiltered_posts marked deleted don’t show up in the view.

But of course this was the bit we already knew how to do. What doesn’t work is inserting into the view:

david=# insert into posts(body) values('I am the very model of a modern major general');
ERROR:  cannot insert into a view
HINT:  You need an unconditional ON INSERT DO INSTEAD rule.

Indeed it doesn’t work. But it does give us a nice hint of what to do next.

david=# create or replace rule insert_into_posts as on insert to posts do instead insert into unfiltered_posts(body) values(NEW.body);
CREATE RULE

So, now we can insert into the view:

david=# insert into posts(body) values('I am the very model of a modern major general');
INSERT 0 1
david=# select * from posts;
 id |                     body                      | deleted
----+-----------------------------------------------+---------
  1 | I like kittens                                | f
  4 | I am the very model of a modern major general | f

This works, but I find it a bit ugly. The problem here is that you have to explicitly enumerate the fields in order for this to work. I couldn’t find a terribly satisfactory solution unfortunately. So if someone is reading this who knows more about postgresql than I do I’d love go get some hints.

The following does work as an alternative:

david=# create or replace rule insert_into_posts as on insert to posts do instead insert into unfiltered_posts values(NEW.*);
CREATE RULE

But the problem is that it plays badly with the defaults. If we try this we get:

david=# insert into posts(body) values('I am the very model of a modern major general');
ERROR:  null value in column "id" violates not-null constraint

The problem is that inserting null into a not-null column doesn’t replace null with the default value. It would be nice if it did as that would make this easy, but oh well (this isn’t postgresql specific behaviour. I’m not aware of any database where inserting null into a not null default blah column will work. Certainly MySQL does the same thing). You could probably make this work with a before insert or update trigger, but that’s a little gross.

An alternative version which offers slightly better functionality but still requires you to explicitly enumerate the columns in the rule is the following:

david=# create or replace rule insert_into_posts as on insert to posts do instead insert into unfiltered_posts values(coalesce(NEW.id, nextval('post_ids')), NEW.body, coalesce(NEW.deleted, false));
CREATE RULE
david=# insert into posts(body) values('I''ve information vegetable, animal, and mineral');
INSERT 0 1
david=# select * from posts;
 id |                      body                       | deleted
----+-------------------------------------------------+---------
  1 | I like kittens                                  | f
  3 | I am the very model of a modern major general   | f
  4 | I've information vegetable, animal, and mineral | f

This requires us to duplicate the defaults as well as the columns, which is rather annoying, but at least it works satisfactorily (note: Some of you will complain that I didn’t explicitly enumerate the columns in the insert into. This is deliberate – the view will break if I change the table structure in any interesting way. If I explicitly enumerated the column names it would instead silently do the wrong thing).

So, this works. We can do the same on update:


david=#   create or replace rule update_to_posts
david-#   as on update to posts
david-#   do instead
david-#      update unfiltered_posts
david-#      set id = coalesce(NEW.id, OLD.id),
david-#           body = coalesce(NEW.body, OLD.body),
david-#           deleted = coalesce(NEW.deleted, OLD.deleted)
david-#      where id = OLD.id;
CREATE RULE

david=# update posts set deleted = true where id = 4;
UPDATE 1
david=# select * from posts;
 id |                     body                      | deleted
----+-----------------------------------------------+---------
  1 | I like kittens                                | f
  3 | I am the very model of a modern major general | f
(2 rows)

david=# select * from unfiltered_posts;
 id |                      body                       | deleted
----+-------------------------------------------------+---------
  1 | I like kittens                                  | f
  2 | I don't like kittens                            | t
  3 | I am the very model of a modern major general   | f
  4 | I've information vegetable, animal, and mineral | t
(4 rows)

So now updating things in posts works. Note that if we try to update a filtered post it will not work:

david=# update posts set body = 'kittens' where id = 4;
UPDATE 0
david=# select * from unfiltered_posts ;
 id |                      body                       | deleted
----+-------------------------------------------------+---------
  1 | I like kittens                                  | f
  2 | I don't like kittens                            | t
  3 | I am the very model of a modern major general   | f
  4 | I've information vegetable, animal, and mineral | t
(4 rows)

And, finally, we want to hook deletion into it. Obviously we don’t want deletion to delete things from the underlying table but instead to set their deleted flag to be false:

david=# create or replace rule delete_posts
david-# as on delete to posts do instead
david-# update unfiltered_posts
david-# set deleted = true where id = OLD.id;
CREATE RULE
david=# select * from posts;
 id |                     body                      | deleted
----+-----------------------------------------------+---------
  1 | I like kittens                                | f
  3 | I am the very model of a modern major general | f
(2 rows)

david=# delete from posts where id = 3;
DELETE 0
david=# select * from posts;
 id |      body      | deleted
----+----------------+---------
  1 | I like kittens | f
(1 row)

david=# select * from unfiltered_posts;
 id |                      body                       | deleted
----+-------------------------------------------------+---------
  1 | I like kittens                                  | f
  2 | I don't like kittens                            | t
  4 | I've information vegetable, animal, and mineral | t
  3 | I am the very model of a modern major general   | t
(4 rows)

So there we have it: A view which we can insert into, update and delete. Despite the slight annoyances around default values, this is definitely a really neat feature. I look forward to exploring its use.

If you want to have a play with this, I’ve created a gist containing the table, view and rules.

Dear lazyweb: A problem on cached string searching

Thursday, October 15th, 2009

So, I have the following problem:

I have a bunch of terms and a bunch of documents. I want to maintain a database of which terms appear in which documents, and ideally a count of how many times those terms appear in the document. “appear” can mean a literal byte substring – no need for any sort of fuzzy matching. I want the following operations to be fast:

  • Adding a term
  • Adding a document
  • Getting a list of all (Term, Document, Count) pairs
  • Getting the count for a specific (Term, Document) pair

Bonus features:

  • Fast removal of documents and/or terms
  • Retrieval of all documents containing a given term or all terms in a given document

It’s clearly relatively easy to construct something like this using a combination of an aho-corasick style search tree (I’m not totally confident such can be built incrementally, but worst comes to worst you can save a bunch of “patches” and amortize the cost at retrieval time), a dbm style database for maintaining the counts and an inverted index on the text, but I’m wondering if there’s some appropriate structure more optimised for this. Anyone know of anything?

Am I being boring?

Sunday, October 11th, 2009

I had an enlightening conversation with Mark Wotton on twitter recently. It started when I gave the following advice:

Writing advice: You should have a constant mental process going asking “Is this bit I’m writing boring?”. If it is, delete or rewrite it.

I’d meant it to apply to prose. I was reading an article which took an important and exciting piece of information and made it deathly dull. I don’t want to link to the article, but I’m sure you can think of examples yourself.

Mark replied:

I thought you were talking about code for a second there – it actually works pretty well there, too.

This wasn’t a context in which I’d thought about it before.

I won’t quote the entire conversation here (you can check it out at the source if you care), but the conclusions from it were interesting.

A lot of code is boring, and sometimes this is ok. Most code does boring things – display a web page, convert data from this format to that format, write out an error message, etc. Boring tasks are ubiquitous and necessary to get things done.

But, like writing, you can write about things which are boring and you can write about things in boring ways.

What does boring writing look like? Well, it could contain a lot of repetition, it could take forever to get to the point making it non-obvious what it’s about, it could include a lot of irrelevancies…

Hmm. None of those sound like very good things to do when coding, do they?

As people are fond of saying, code should really be first about telling other people what it should do and secondarily about getting the computer to execute it. Coding is a form of writing, and as such it needs to keep the reader interested or their boredom will get in the way of their understanding (side note: this is not the same as making the reader work hard to understand it – not being boring is not the same as being overly clever).

So, it’s ok to write code that is boring because what it is trying to do is inherently boring, but you shouldn’t add unnecessary boredom to the code.

Hmm. That sounds familiar.

So, to borrow the terminology from Fred Brooks: There are two types of boredom. Essential boredom, which is inherent to the problem being solved, and accidental boredom, which is introduced by the programmer. Seek to minimize the latter.

Open source term extraction

Monday, August 17th, 2009

This is just a quick announcement to let people know that we’ve open sourced our JRuby library for term extraction. You can get the code from my github page.

Unlike a lot of term extraction libraries, this doesn’t take any stance as to the “significance” of the terms it extracts. It’s purely about looking at the syntax and determining where good boundaries for terms are. There are a couple reasons for this, but basically we’ve found that it’s more effective to separate the two steps and makes it easier to tinker around with them independently. The criteria for “interestingness” of terms seem to be largely distinct from those for terms which simply make sense linguistically. So we have a two stage pipeline, one which extracts semantically meaningful terms and one which determines what terms are actually interesting in the context of the document. The second step is much more complicated, and we’re not open sourcing that (yet? probably not any time soon, if ever. Even if we wanted to, it relies on a lot more global information across the document corpus and so is very tied in with how SONAR operates, making it much harder to isolate).

So, how does it work? Black magic and voodoo!

Actually, no. It’s pretty straightforward. It builds on top of the excellent OpenNLP library, using its tools for part of speech tagging, sentence splitting (a much harder problem than you’d imagine) and phrase chunking. It’s currently a rules based system on top of there, as while you’re figuring things out it makes much more sense to stick with something so easily fine tunable. Our expectation is that we’ll gradually start replacing bits of it with machine learning based techniques as we start to hit the limitations of a rules based system, but for now it’s working pretty well.

Let’s have an example. If we feed the second paragraph of this post into the term extractor, we get the following terms back:

term extraction libraries
stance
terms
syntax
good boundaries
couple reasons
two steps
steps
criteria
interestingness
sense
two stage pipeline
stage pipeline
semantically meaningful terms
context
context of the document
document
second step
open sourcing
time
document corpus
SONAR

Hope you find this useful. Let us know if you build anything cool with it!

Open sourcing Pearson’s Correlation calculations

Wednesday, April 22nd, 2009

As you might recall, I did some articles on calculating Pearson’s in SQL.

It turns out that this is a hilariously bad idea. The performance you get for it is terrible when the numbers get large. Switching to PostgreSQL seemed to help a bit here, but even then the numbers are not great (and we still aren’t planning on a port to PostgreSQL anyway). So we needed to find a better solution. Doing it in memory would be fast, but it would just fall over on a large dataset.

Anyway, after some tinkering around I came up with a slightly unholy solution. It’s a mix of bash, awk, standard unix tools and Java (the Java parts may be rewritten in something else later). The design is such that much of the heavy lifting is offloaded to sort, which is offline so doesn’t need to load the whole dataset into memory, and processes things in a line oriented manner. This lets it get by with a very reasonable memory usage and, in my fairly informal tests, to perform about 50 times faster than the SQL version.

We’re releasing the code under a BSD license and making it available on github. It’s in a bit of a rough state at the moment, but is usable as is.