Author Archives: david

I want ONE MEELYUN sentences

I’ve been planning to do some work on my term extractor to make it a bit smarter. It’s currently a rule based system on top of various machine learning tools. This is perfectly legitimate, but it’s starting to hit the limitations of that approach. I’d like to experiment with a more intelligent approach using machine learning more directly.

To do this though I need a training set. My plan is to do this by building a first pass using the existing version on some sentence corpus and then editing that to taste.

Of course, to do this I need a decent sentence corpus. So today I set out to generate one. It was a lot fiddlier than it should have been, but I think in the end I’ve got a decent one.

I’m presumably not the only person to need something like this, so I’m making a largish sample of it available. It’s not hard to generate yourself but it’s something of a pain, so maybe I can save you some effort.

So, here you go. A bzipped list of one million random sentences from wikipedia.

The format is obvious: Plain text, one sentence per line.

I make no guarantees about the quality of the data (there’s definitely some noise), and I definitely don’t claim this to be a statistically fair sample of Wikipedia. But initial impressions are that it’s a reasonable good list. Certainly it should be good enough for my purposes.

This entry was posted in computational linguistics, programming on by .

Shaving yaks and finding feeds

So I had some interesting ideas I wanted to play with to do with keeping on top of streams of information.

Of course, I needed some streams of information to keep on top of in order to do this. I decided to go with my RSS feeds (the other obvious source being twitter).

To do that I needed a database of feed entries. So I created a small program to do that (I really should just have used feed-bag, but there were some things I wanted to tweak and integrate so I didn’t).

Unfortunately for whatever reason I ended up with a lot of URLs that pointed to sites or something invalid in my opml. I’m not sure offhand if this was an import problem or a problem in the google reader export.

So, I thought, let’s do our damnedest to correct URLs: If it points to a site do feed discovery, follow redirects, etc. It can’t be that hard.

Cue me getting very angry. Suffice it to say, if you do what I did and foolishly expect people on the web to follow standards you are very mistaken.

Anyway, after much hacking around trying to get this to work I decided to codify the various tricks into a library so you don’t have to share my anger. I’ve called this library feedify. This is very rude of me as there’s another ruby library called feedify, but given that it hit 0.0.1 in january 2008 and never updated since then I don’t feel too bad about stomping on its namespace.

Additionally I’ve put up an http interface to it. If you go to http://feedify.merobe.com/feed/(some url) then it will try to find a feed associated with that URL and redirect you to it. You can also run this service yourself – it’s included in the github project.

This is all very rough and liable to change at the moment. If you have any bug reports of URLs it misses or gets wrong I’d be very interested to receive them.

This entry was posted in programming and tagged , on by .

Filtering deleted documents with PostgreSQL rules

I’m currently working on a Mysterious Project (coming soon to an internet near you) which involves a lot of user generated content (Yes, fine, slap a 2.0 on my name and call me “still in beta”). As such, it’s got all the usual problems with user generated content. In particular it has spam.

So, we need some sort of spam filtering in place to make sure we never show spam to users. But we don’t want to delete spam from the database – partly in case of mistakes, partly because we want to use the data for automated classification of spam.

Ok, this is easy enough to do. You add a flag “spam” to the table and don’t show the user anything flagged as spam.

The problem here is that this content gets used in all sorts of contexts, and it’s really annoying to have to add “where != spam” here.

No problem. We create a view. That’s what they’re for.

But this is slightly annoying: Basically all our access to content goes through this view, but modifications to content have to go through the original table. It would be really nice if we could have all updates and inserts going to the same thing we access the data from. “really nice” is partly aesthetic, but there’s also a boring practical reason: We’re using an ORM (ActiveRecord in fact. Sigh), and we’d like the ORM to access the filtered version, but we’d also like to be able to update the same objects.

Hang on. We’re using PostgreSQL. There’s an app… err. feature for that.

PostgreSQL has a feature called rules which allow you to change the meaning of various operations on a table (views in PostgreSQL are also tables). We can use these to make our view updateable. Let’s see how.

We’ll start with a slightly abstracted version of the problem. Instead of thinking about spam filtering we’ll concern ourselves with deleting posts. We want to retain the old posts but not show them:

david=# create sequence post_ids;
CREATE SEQUENCE
david=# create table unfiltered_posts(id int primary key default nextval('post_ids'), 
david(#                                        body text, 
david(#                                        deleted boolean not null default false);
NOTICE:  CREATE TABLE / PRIMARY KEY will create implicit index "unfiltered_posts_pkey" for table "unfiltered_posts"
CREATE TABLE

So we first create our view that will be the posts which have not been deleted:

david=# create view posts as select * from unfiltered_posts where not deleted;
CREATE VIEW

Making sure everything’s working as expected:

david=# select * from posts;
 id | body | deleted 
----+------+---------
(0 rows)

david=# insert into unfiltered_posts(body, deleted) values('I like kittens', false);
INSERT 0 1

david=# insert into unfiltered_posts(body, deleted) values('I don''t like kittens', true);
INSERT 0 1

david=# select * from posts;
 id |      body      | deleted 
----+----------------+---------
  1 | I like kittens | f
(1 row)

david=# select * from unfiltered_posts ;
 id |         body         | deleted 
----+----------------------+---------
  1 | I like kittens       | f
  2 | I don't like kittens | t
(2 rows)

So all working as expected: unfiltered_posts marked deleted don’t show up in the view.

But of course this was the bit we already knew how to do. What doesn’t work is inserting into the view:

david=# insert into posts(body) values('I am the very model of a modern major general');
ERROR:  cannot insert into a view
HINT:  You need an unconditional ON INSERT DO INSTEAD rule.

Indeed it doesn’t work. But it does give us a nice hint of what to do next.

david=# create or replace rule insert_into_posts as on insert to posts do instead insert into unfiltered_posts(body) values(NEW.body);
CREATE RULE

So, now we can insert into the view:

david=# insert into posts(body) values('I am the very model of a modern major general');
INSERT 0 1
david=# select * from posts;
 id |                     body                      | deleted 
----+-----------------------------------------------+---------
  1 | I like kittens                                | f
  4 | I am the very model of a modern major general | f

This works, but I find it a bit ugly. The problem here is that you have to explicitly enumerate the fields in order for this to work. I couldn’t find a terribly satisfactory solution unfortunately. So if someone is reading this who knows more about postgresql than I do I’d love go get some hints.

The following does work as an alternative:

david=# create or replace rule insert_into_posts as on insert to posts do instead insert into unfiltered_posts values(NEW.*);
CREATE RULE

But the problem is that it plays badly with the defaults. If we try this we get:

david=# insert into posts(body) values('I am the very model of a modern major general');
ERROR:  null value in column "id" violates not-null constraint

The problem is that inserting null into a not-null column doesn’t replace null with the default value. It would be nice if it did as that would make this easy, but oh well (this isn’t postgresql specific behaviour. I’m not aware of any database where inserting null into a not null default blah column will work. Certainly MySQL does the same thing). You could probably make this work with a before insert or update trigger, but that’s a little gross.

An alternative version which offers slightly better functionality but still requires you to explicitly enumerate the columns in the rule is the following:

david=# create or replace rule insert_into_posts as on insert to posts do instead insert into unfiltered_posts values(coalesce(NEW.id, nextval('post_ids')), NEW.body, coalesce(NEW.deleted, false));
CREATE RULE
david=# insert into posts(body) values('I''ve information vegetable, animal, and mineral');
INSERT 0 1
david=# select * from posts;
 id |                      body                       | deleted 
----+-------------------------------------------------+---------
  1 | I like kittens                                  | f
  3 | I am the very model of a modern major general   | f
  4 | I've information vegetable, animal, and mineral | f

This requires us to duplicate the defaults as well as the columns, which is rather annoying, but at least it works satisfactorily (note: Some of you will complain that I didn’t explicitly enumerate the columns in the insert into. This is deliberate – the view will break if I change the table structure in any interesting way. If I explicitly enumerated the column names it would instead silently do the wrong thing).

So, this works. We can do the same on update:

 


david=#   create or replace rule update_to_posts 
david-#   as on update to posts 
david-#   do instead 
david-#      update unfiltered_posts 
david-#      set id = coalesce(NEW.id, OLD.id), 
david-#           body = coalesce(NEW.body, OLD.body), 
david-#           deleted = coalesce(NEW.deleted, OLD.deleted) 
david-#      where id = OLD.id;
CREATE RULE

david=# update posts set deleted = true where id = 4;
UPDATE 1
david=# select * from posts;
 id |                     body                      | deleted 
----+-----------------------------------------------+---------
  1 | I like kittens                                | f
  3 | I am the very model of a modern major general | f
(2 rows)

david=# select * from unfiltered_posts;
 id |                      body                       | deleted 
----+-------------------------------------------------+---------
  1 | I like kittens                                  | f
  2 | I don't like kittens                            | t
  3 | I am the very model of a modern major general   | f
  4 | I've information vegetable, animal, and mineral | t
(4 rows)

So now updating things in posts works. Note that if we try to update a filtered post it will not work:

david=# update posts set body = 'kittens' where id = 4;
UPDATE 0
david=# select * from unfiltered_posts ;
 id |                      body                       | deleted 
----+-------------------------------------------------+---------
  1 | I like kittens                                  | f
  2 | I don't like kittens                            | t
  3 | I am the very model of a modern major general   | f
  4 | I've information vegetable, animal, and mineral | t
(4 rows)

And, finally, we want to hook deletion into it. Obviously we don’t want deletion to delete things from the underlying table but instead to set their deleted flag to be false:

david=# create or replace rule delete_posts 
david-# as on delete to posts do instead 
david-# update unfiltered_posts 
david-# set deleted = true where id = OLD.id;
CREATE RULE
david=# select * from posts;
 id |                     body                      | deleted 
----+-----------------------------------------------+---------
  1 | I like kittens                                | f
  3 | I am the very model of a modern major general | f
(2 rows)

david=# delete from posts where id = 3;
DELETE 0
david=# select * from posts;
 id |      body      | deleted 
----+----------------+---------
  1 | I like kittens | f
(1 row)

david=# select * from unfiltered_posts;
 id |                      body                       | deleted 
----+-------------------------------------------------+---------
  1 | I like kittens                                  | f
  2 | I don't like kittens                            | t
  4 | I've information vegetable, animal, and mineral | t
  3 | I am the very model of a modern major general   | t
(4 rows)

So there we have it: A view which we can insert into, update and delete. Despite the slight annoyances around default values, this is definitely a really neat feature. I look forward to exploring its use.

If you want to have a play with this, I’ve created a gist containing the table, view and rules.

This entry was posted in Code, SQL on by .

Potato and butternut squash “pizza”

I originally described this dish as somewhere between a gratin and a deep dish pizza. It turns out this description is the result of my massively misunderstanding what a gratin is.

So instead it’s more like a deep dish pizza would be if you repaced the base with thinly sliced potatoes and butternut squash.

As with so much I make, it’s the result of my being bored and wanting to make something nice but having no inclination to shop (also I’m currently visiting my parents, so shopping is a lot harder than it normally would be) or idea what I want to make.

In this case, the result is really good.

how it works:

The Base

We make the dish in a deep dish pizza tray. The base is made out of thinly sliced potato and butternut squash (I use a food processor for the slicing).

Oil the tray (I used olive oil) and cover the bottom with sliced potatoes. Drizzle them with oil and sprinkle with salt and rosemary. Cover with the butternut squash and drizzle that with oil and sprinkle with salt (I think the rosemary will burn if you put it on top, so I didn’t put any on). Roast on a high heat for 25 minutes.

The sauce

The sauce I made is as follows. Really any tomato sauce would work, but this worked particularly well.

Basically: Puree a small onion, fry it with olive oil and about a tbsp of dark brown sugar. Once it’s caramelised a bit, add a can of chopped tomatoes, a dash of red wine vinegar, some paprica and herbs de provence and let it reduce.

The assembly

Once the base has roasted for 25 minutes, take it out of the oven, cover it in the tomato sauce and then cover that in grated cheese. The cheeses I used were mostly Comté, with a little bit of Manchego. Then put it back in the oven for another 15 minutes, still at the roasting temperature.

Then take it out of the oven and eat it. All of it.

There were three of us having this – myself and my two parents. I expected there to be plenty of leftovers. Instead there was a very cleaned out pizza dish.

This entry was posted in Food and tagged , , on by .

Dear lazyweb: A problem on cached string searching

So, I have the following problem:

I have a bunch of terms and a bunch of documents. I want to maintain a database of which terms appear in which documents, and ideally a count of how many times those terms appear in the document. “appear” can mean a literal byte substring – no need for any sort of fuzzy matching. I want the following operations to be fast:

  • Adding a term
  • Adding a document
  • Getting a list of all (Term, Document, Count) pairs
  • Getting the count for a specific (Term, Document) pair

Bonus features:

  • Fast removal of documents and/or terms
  • Retrieval of all documents containing a given term or all terms in a given document

It’s clearly relatively easy to construct something like this using a combination of an aho-corasick style search tree (I’m not totally confident such can be built incrementally, but worst comes to worst you can save a bunch of “patches” and amortize the cost at retrieval time), a dbm style database for maintaining the counts and an inverted index on the text, but I’m wondering if there’s some appropriate structure more optimised for this. Anyone know of anything?

This entry was posted in Code on by .