Category Archives: Code

Sleep as an Droid export data

This is possibly the single most niche piece of software I have ever written…

I use a sleep tracker called “Sleep as an Droid”. I wish to be able to get data out of it. Fortunately, it has a data export feature. Unfortunately, its data export format is a bit bizarre (it’s CSV, Jim, but not as we know it). So I decided to convert it to a tolerable-structured JSON output before using it, and have put it on the web for anyone to use. source is also available on github if you care.

This entry was posted in Code on by .

Playing with a new language: Clay

There’s been a distinct lack of interesting language related posts around here recently.

Why? Well because I’ve been mostly writing Ruby and C. Neither of which languages fills me with an urge to post HEY GUYS, I FOUND THIS COOL THING. Ruby because I vaguely dislike it, so I resent even the kindof cool things I find as I use it, C because my use of C tends towards the extremely straightforward so C posts tend to be about the result rather than the method (arguably that’s what programming posts should be like anyway, but that’s a separate issue).

But anyway, I’ve been window shopping around for a new language to learn. My list of features I wanted in said language were:

  1. Good performance
  2. Higher level than C
  3. Preferably statically typed
  4. Generic collections
  5. Didn’t run on the JVM

The closest to this on the list of languages I already knew was Haskell, and I’ve been using it a little bit, but it and I don’t really get on for a variety of reasons that are beyond the scope of this blog post.

I was pretty much thinking it was going to be O’Caml (which is another language in the “vaguely dislike” category. I want to like it, I really do, but somehow I can never manage) and C++ (oh dear god no).

So, anyway, my ears perked up when clay came up on reddit the other day. I had already been vaguely aware of it, but somehow it dropped off my radar before I did anything about it.

So I’ve been spending today playing with it, and the results are two little projects: An interval tree implementation and a very rudimentary start on some client bindings to PostgreSQL. Neither are written with any particular purpose in mind, ‘though both are something I might want to use at some point, it’s really just a way to get a feel for the language.

I’m pleasantly impressed. Is it going to be the Next Big Language? Who knows. Certainly not me. But despite a lot of stupid questions on my part it was a pleasant experience, and I’m definitely going to continue playing with it.

What follows are some initial impressions. Bear in mind it’s the result of only a day of playing with it, so they may be incomplete or outright wrong:

The Good

Here are some of the things I liked:

Good support for generic programming

Like it says on the label really. This is one of the things the language bills itself as being good at, so if it weren’t it would be dead on launch.

I’ve had a reasonable play around with it, and it works well. The entire system is basically a giant (mostly) statically resolved multimethod system. Given my fondness for multimethods, that’s a plus as far as I’m concerned. Runtime polymorphism comes in through an extensible variant system: You can declare variants in one module and add new ones in another.

Anyway, the result is that there are generic collections and they Just Work, exceptions make use of the extensible variant system and seem well behaved.

There’s no object orientation support and no subtyping. As far as I’m concerned, that’s a good thing.

Decent FP primitives

So I haven’t actually had much of a play with these yet believe it or not, but they’re there and seem to work. lambdas don’t implement non-local return (‘though I suppose you could do this with an exception), but do capture their environment. The main place I’ve used this is in the test system.

clay bind-gen

clay bind-gen takes a C header file and gives you a clay library. It’s a bit finicky, and obviously the code is far from idiomatic, but it does work and works well. It took me about 10 minutes to get this working on the libpq header and port a C example of using it to clay. Hopefully this means that it’s rather easy to build on the available C ecosystem.

Decentish standard library

All quite basic at the moment, but it already annoys me less than Scala’s used to when it was at a much later stage in the game than this. Granted that’s because Scala could get away with it by falling back on the JVM ecosystem and clay can’t, but it’s a good start. It’s pleasantly reminiscent of the sort you’d expect with a high level language, but seems well tuned for clay’s suitability as a systems language.

Valgrind works flawlessly

As per title. Clay is an unsafe language – it has manual memory management (which, thanks to things like stack allocation and deterministic destructors, is much less of a pain than in C, but is still definitely not hard to get wrong), pointer arithmetic, no bounds checking on array access.

The only reason I trust myself to write C is because valgrind exists. I would have had a very hard time trusting myself to write clay if it didn’t work on it. Fortunately, no problems.

Friendly and intelligent IRC channel

I’m a big fan of IRC as a learning environment. I’ve been on IRC on a fairly regular basis in various channels for 10-15 years (‘though rarely any given one for more than 4 or 5. I have community attention span issues), and so the first thing I do when checking out a new tech or language is long onto their IRC channel. Clay’s does not disappoint. It’s small, but there are a bunch of smart people in there, and they were very patient and helpful with my stupid questions.

The bad

Confusing semantics

Generally there’s a good reason for it, and things are still in flux so I think many of these will improve, but there have been a bunch of cases where I’ve gone “Wait, what?”. Some of these are just the result of my never having used a language with non-trivial value semantics before (tree = tree.child generates an invalid read), some of them are gotchas which you’ll probably only be caught by once but are still pretty ugly (it’s very easy to think you’ve redefined a constructor when you haven’t because the type arguments are part of the constructor’s name), and a few others that slip my mind at the moment. The semantics of how copying, moving and assignment work are not so much confusing as very easy to get wrong. I’m still not 100% sure I understand how variants interact with their members.

I think I’m largely on top of it now, and I expect it will become clearer as I write more of it, but I definitely spent a good chunk of today quite bewildered.

Basically zero documentation

Well, what do you expect? The language is young and still massively in flux. I’m not surprised there’s no documentation and, as mentioned, the community is very helpful in making up for its lack, but it definitely makes the learning process much harder.

The language is young and still massively in flux

As mentioned above. This is definitely an experimental language – the compiler seems to currently be undergoing a rewrite which will, amongst other things, be changing the backend target away from LLVM and to C. The syntax is expected to change. The semantics are expected to change. The language designers will, and should, feel free to break backwards compatibility.

Error messages

As well as the usual problems with new languages and quality of error messages, clay’s error messages, both at runtime and compile time, are very short on line numbers. This can make it very hard to track down what’s going wrong. It wasn’t drastically difficult with what I was doing today, but they were both rather small projects. I don’t know how hard it will be on larger projects, but I’m hoping that some of this (particularly compile time error messages) will be fixed by the time I have to worry about that.

Late discovery of errors

It’s not quite as bad as C++ template errors are reputed to be, but generally you don’t discover an error in a function until you try to use that function. This is somewhat deliberate, as it ties in with how the genericity works, but when combined with the line numbers problem can be quite frustrating. Additionally the support for giving the compiler hints to get better error messages is a bit rudimentary: You can define compile time functions which constrain arguments, but that’s about it.

The ugly

Keywords vs operators

Clay doesn’t have much C compatibility in terms of its operators: the bitwise operators (in particular <<, >>, |, & and ^) are replaced by functions. The shortcutting boolean operators are replaces with keywords and and or. It’s not a major problem, but it annoyed me vaguely when porting from C.

Keywords

Clay has a fair few keywords: ref, forward, lvalue, rvalue, lambda, and, or, overload, procedure, variant, record, callbyname… probably more.

It’s not a massive problem, but I tend to regard keyword surplus as a sign of insufficiently well factored features. I’ve yet to form an opinion on whether or not this is the case here.

Everything is by reference

I don’t yet know if I like this or not, but Clay’s calling convention passes everything as a mutable reference. So assigning to a function argument assigns in the calling context. Really. This can be quite useful, particularly given that copying a clay value can be a very expensive operation, but I find it deeply disconcerting. I feel like it may grow on me. We’ll see.

A few minor syntax weirdnesses

e.g. one that bugged me earlier is the syntax for return type declaration on a function: foo(a : Blargh) ReturnType. No colon on the return type, but colon on the arguments.

Conclusion

Overall, definitely more good than bad. I intend to keep playing with it and see how I feel after a while doing so.

This entry was posted in Code on by .

Reading video frame by frame with ffmpeg

So I’ve been playing around with scene detection. It’s really more of a NIH task that I’m doing for my own amusement than it is a serious tool I expect to be used, but it’s a good way to expand my knowledge of video and I have a few good ideas which don’t seem to have been used before, so it’s crazy but it Just Might Work.

One of the things I need to do for scene detection is read a video frame by frame and compare subsequent frames. My initial hacked use ffmpeg to turn the video into a sequence of images on disk, ran through them as they were generated them and deleted old ones.

As you can probably imagine this was slow, cumbersome and remarkably hard to get right.

“Oh, hey” thought I. “ffmpeg makes all this stuff available as libraries: libavformat and libavcodec. That will let me do this efficiently!”.

So I started playing around with examples and reading through the documentation. Excuse me, did I say documentation? I meant header files.

Oh.
My.
God.

I mean no (large amount of) disrespect to the authors when I say this: They have created a piece of software which, by and large, works very well. And I’m sure that a lot of the complexity of the API is essential rather than accidental if you’re, say, writing a video player rather than a dumb frame processor.

But, that being said, the contents of the header files are remarkably like getting a lecture on the botany of trees when what you want is a map out of the forest. Apparently it all makes sense if you’ve seen the mpeg4 spec. Apparently writing actual documentation would be a patent minefield. Certainly I have no clue what’s going on.

I tried basing my code on examples from the internet. Unfortunately it looks like the API has moved under them – the examples have been half patched up by other people around, but in the versions I got closest to working they appeared to be doing the wrong thing. The arguments to certain functions were suspicious, and the results were just wrong. The right thing to do might have been to fix this, but I genuinely had no idea how the code was working, so it would have been far from easy to debug it.

So, at this point I largely considered myself defeated by libav* and started thinking about other ways one could do it.

“What I really want”, I thought, “is some sort of server program where I can just feed it a file and then read the frames off in some sensible binary format. That way I’d be insulated from most of the pain of this”.

“Hey, ffmpeg can write its output to a pipe, can’t it?”

After that, the rest was history:

Step 1: Pick some binary format which is easy to read pixel RGB data out of. It will never live on disk, and ease of use speed of parsing is more important than efficiency. Easy, obvious choice: ppm. It’s basically designed for that.
Step 2: Figure out how to get ffmpeg to write a stream of ppm files to its stdout. This turns out to be easy:

Step 3: Figure out how to read a stream of ppm files from a pipe. libnetpbm to the rescue! The only minor issue I had was determining whether we were at the end of file without stomping on netpbm’s toes, so the code contains a slightly weird step where it does a getc to check if it’s at eof and then does an ungetc if we’re not. Other than that, it’s textbook netpbm processing code taken straight from the examples:

This took me all of about half an hour to figure out, after most of a day wrestling with libavcodec, and it works pretty well. The performance is decent. I don’t know how it compares to using libavcodec directly as I haven’t benchmarked (due to not having a working example with libavcodec), but it’s orders of magnitude faster than my previous file system based hack, and the code is a hell of a lot cleaner, so I’m happy.

This entry was posted in Code on by .

Filtering deleted documents with PostgreSQL rules

I’m currently working on a Mysterious Project (coming soon to an internet near you) which involves a lot of user generated content (Yes, fine, slap a 2.0 on my name and call me “still in beta”). As such, it’s got all the usual problems with user generated content. In particular it has spam.

So, we need some sort of spam filtering in place to make sure we never show spam to users. But we don’t want to delete spam from the database – partly in case of mistakes, partly because we want to use the data for automated classification of spam.

Ok, this is easy enough to do. You add a flag “spam” to the table and don’t show the user anything flagged as spam.

The problem here is that this content gets used in all sorts of contexts, and it’s really annoying to have to add “where != spam” here.

No problem. We create a view. That’s what they’re for.

But this is slightly annoying: Basically all our access to content goes through this view, but modifications to content have to go through the original table. It would be really nice if we could have all updates and inserts going to the same thing we access the data from. “really nice” is partly aesthetic, but there’s also a boring practical reason: We’re using an ORM (ActiveRecord in fact. Sigh), and we’d like the ORM to access the filtered version, but we’d also like to be able to update the same objects.

Hang on. We’re using PostgreSQL. There’s an app… err. feature for that.

PostgreSQL has a feature called rules which allow you to change the meaning of various operations on a table (views in PostgreSQL are also tables). We can use these to make our view updateable. Let’s see how.

We’ll start with a slightly abstracted version of the problem. Instead of thinking about spam filtering we’ll concern ourselves with deleting posts. We want to retain the old posts but not show them:

david=# create sequence post_ids;
CREATE SEQUENCE
david=# create table unfiltered_posts(id int primary key default nextval('post_ids'), 
david(#                                        body text, 
david(#                                        deleted boolean not null default false);
NOTICE:  CREATE TABLE / PRIMARY KEY will create implicit index "unfiltered_posts_pkey" for table "unfiltered_posts"
CREATE TABLE

So we first create our view that will be the posts which have not been deleted:

david=# create view posts as select * from unfiltered_posts where not deleted;
CREATE VIEW

Making sure everything’s working as expected:

david=# select * from posts;
 id | body | deleted 
----+------+---------
(0 rows)

david=# insert into unfiltered_posts(body, deleted) values('I like kittens', false);
INSERT 0 1

david=# insert into unfiltered_posts(body, deleted) values('I don''t like kittens', true);
INSERT 0 1

david=# select * from posts;
 id |      body      | deleted 
----+----------------+---------
  1 | I like kittens | f
(1 row)

david=# select * from unfiltered_posts ;
 id |         body         | deleted 
----+----------------------+---------
  1 | I like kittens       | f
  2 | I don't like kittens | t
(2 rows)

So all working as expected: unfiltered_posts marked deleted don’t show up in the view.

But of course this was the bit we already knew how to do. What doesn’t work is inserting into the view:

david=# insert into posts(body) values('I am the very model of a modern major general');
ERROR:  cannot insert into a view
HINT:  You need an unconditional ON INSERT DO INSTEAD rule.

Indeed it doesn’t work. But it does give us a nice hint of what to do next.

david=# create or replace rule insert_into_posts as on insert to posts do instead insert into unfiltered_posts(body) values(NEW.body);
CREATE RULE

So, now we can insert into the view:

david=# insert into posts(body) values('I am the very model of a modern major general');
INSERT 0 1
david=# select * from posts;
 id |                     body                      | deleted 
----+-----------------------------------------------+---------
  1 | I like kittens                                | f
  4 | I am the very model of a modern major general | f

This works, but I find it a bit ugly. The problem here is that you have to explicitly enumerate the fields in order for this to work. I couldn’t find a terribly satisfactory solution unfortunately. So if someone is reading this who knows more about postgresql than I do I’d love go get some hints.

The following does work as an alternative:

david=# create or replace rule insert_into_posts as on insert to posts do instead insert into unfiltered_posts values(NEW.*);
CREATE RULE

But the problem is that it plays badly with the defaults. If we try this we get:

david=# insert into posts(body) values('I am the very model of a modern major general');
ERROR:  null value in column "id" violates not-null constraint

The problem is that inserting null into a not-null column doesn’t replace null with the default value. It would be nice if it did as that would make this easy, but oh well (this isn’t postgresql specific behaviour. I’m not aware of any database where inserting null into a not null default blah column will work. Certainly MySQL does the same thing). You could probably make this work with a before insert or update trigger, but that’s a little gross.

An alternative version which offers slightly better functionality but still requires you to explicitly enumerate the columns in the rule is the following:

david=# create or replace rule insert_into_posts as on insert to posts do instead insert into unfiltered_posts values(coalesce(NEW.id, nextval('post_ids')), NEW.body, coalesce(NEW.deleted, false));
CREATE RULE
david=# insert into posts(body) values('I''ve information vegetable, animal, and mineral');
INSERT 0 1
david=# select * from posts;
 id |                      body                       | deleted 
----+-------------------------------------------------+---------
  1 | I like kittens                                  | f
  3 | I am the very model of a modern major general   | f
  4 | I've information vegetable, animal, and mineral | f

This requires us to duplicate the defaults as well as the columns, which is rather annoying, but at least it works satisfactorily (note: Some of you will complain that I didn’t explicitly enumerate the columns in the insert into. This is deliberate – the view will break if I change the table structure in any interesting way. If I explicitly enumerated the column names it would instead silently do the wrong thing).

So, this works. We can do the same on update:

 


david=#   create or replace rule update_to_posts 
david-#   as on update to posts 
david-#   do instead 
david-#      update unfiltered_posts 
david-#      set id = coalesce(NEW.id, OLD.id), 
david-#           body = coalesce(NEW.body, OLD.body), 
david-#           deleted = coalesce(NEW.deleted, OLD.deleted) 
david-#      where id = OLD.id;
CREATE RULE

david=# update posts set deleted = true where id = 4;
UPDATE 1
david=# select * from posts;
 id |                     body                      | deleted 
----+-----------------------------------------------+---------
  1 | I like kittens                                | f
  3 | I am the very model of a modern major general | f
(2 rows)

david=# select * from unfiltered_posts;
 id |                      body                       | deleted 
----+-------------------------------------------------+---------
  1 | I like kittens                                  | f
  2 | I don't like kittens                            | t
  3 | I am the very model of a modern major general   | f
  4 | I've information vegetable, animal, and mineral | t
(4 rows)

So now updating things in posts works. Note that if we try to update a filtered post it will not work:

david=# update posts set body = 'kittens' where id = 4;
UPDATE 0
david=# select * from unfiltered_posts ;
 id |                      body                       | deleted 
----+-------------------------------------------------+---------
  1 | I like kittens                                  | f
  2 | I don't like kittens                            | t
  3 | I am the very model of a modern major general   | f
  4 | I've information vegetable, animal, and mineral | t
(4 rows)

And, finally, we want to hook deletion into it. Obviously we don’t want deletion to delete things from the underlying table but instead to set their deleted flag to be false:

david=# create or replace rule delete_posts 
david-# as on delete to posts do instead 
david-# update unfiltered_posts 
david-# set deleted = true where id = OLD.id;
CREATE RULE
david=# select * from posts;
 id |                     body                      | deleted 
----+-----------------------------------------------+---------
  1 | I like kittens                                | f
  3 | I am the very model of a modern major general | f
(2 rows)

david=# delete from posts where id = 3;
DELETE 0
david=# select * from posts;
 id |      body      | deleted 
----+----------------+---------
  1 | I like kittens | f
(1 row)

david=# select * from unfiltered_posts;
 id |                      body                       | deleted 
----+-------------------------------------------------+---------
  1 | I like kittens                                  | f
  2 | I don't like kittens                            | t
  4 | I've information vegetable, animal, and mineral | t
  3 | I am the very model of a modern major general   | t
(4 rows)

So there we have it: A view which we can insert into, update and delete. Despite the slight annoyances around default values, this is definitely a really neat feature. I look forward to exploring its use.

If you want to have a play with this, I’ve created a gist containing the table, view and rules.

This entry was posted in Code, SQL on by .

Dear lazyweb: A problem on cached string searching

So, I have the following problem:

I have a bunch of terms and a bunch of documents. I want to maintain a database of which terms appear in which documents, and ideally a count of how many times those terms appear in the document. “appear” can mean a literal byte substring – no need for any sort of fuzzy matching. I want the following operations to be fast:

  • Adding a term
  • Adding a document
  • Getting a list of all (Term, Document, Count) pairs
  • Getting the count for a specific (Term, Document) pair

Bonus features:

  • Fast removal of documents and/or terms
  • Retrieval of all documents containing a given term or all terms in a given document

It’s clearly relatively easy to construct something like this using a combination of an aho-corasick style search tree (I’m not totally confident such can be built incrementally, but worst comes to worst you can save a bunch of “patches” and amortize the cost at retrieval time), a dbm style database for maintaining the counts and an inverted index on the text, but I’m wondering if there’s some appropriate structure more optimised for this. Anyone know of anything?

This entry was posted in Code on by .