So I’ve been playing around with scene detection. It’s really more of a NIH task that I’m doing for my own amusement than it is a serious tool I expect to be used, but it’s a good way to expand my knowledge of video and I have a few good ideas which don’t seem to have been used before, so it’s crazy but it Just Might Work.
One of the things I need to do for scene detection is read a video frame by frame and compare subsequent frames. My initial hacked use ffmpeg to turn the video into a sequence of images on disk, ran through them as they were generated them and deleted old ones.
As you can probably imagine this was slow, cumbersome and remarkably hard to get right.
“Oh, hey” thought I. “ffmpeg makes all this stuff available as libraries: libavformat and libavcodec. That will let me do this efficiently!”.
So I started playing around with examples and reading through the documentation. Excuse me, did I say documentation? I meant header files.
Oh.
My.
God.
I mean no (large amount of) disrespect to the authors when I say this: They have created a piece of software which, by and large, works very well. And I’m sure that a lot of the complexity of the API is essential rather than accidental if you’re, say, writing a video player rather than a dumb frame processor.
But, that being said, the contents of the header files are remarkably like getting a lecture on the botany of trees when what you want is a map out of the forest. Apparently it all makes sense if you’ve seen the mpeg4 spec. Apparently writing actual documentation would be a patent minefield. Certainly I have no clue what’s going on.
I tried basing my code on examples from the internet. Unfortunately it looks like the API has moved under them – the examples have been half patched up by other people around, but in the versions I got closest to working they appeared to be doing the wrong thing. The arguments to certain functions were suspicious, and the results were just wrong. The right thing to do might have been to fix this, but I genuinely had no idea how the code was working, so it would have been far from easy to debug it.
So, at this point I largely considered myself defeated by libav* and started thinking about other ways one could do it.
“What I really want”, I thought, “is some sort of server program where I can just feed it a file and then read the frames off in some sensible binary format. That way I’d be insulated from most of the pain of this”.
…
“Hey, ffmpeg can write its output to a pipe, can’t it?”
After that, the rest was history:
Step 1: Pick some binary format which is easy to read pixel RGB data out of. It will never live on disk, and ease of use speed of parsing is more important than efficiency. Easy, obvious choice: ppm. It’s basically designed for that.
Step 2: Figure out how to get ffmpeg to write a stream of ppm files to its stdout. This turns out to be easy:
Step 3: Figure out how to read a stream of ppm files from a pipe. libnetpbm to the rescue! The only minor issue I had was determining whether we were at the end of file without stomping on netpbm’s toes, so the code contains a slightly weird step where it does a getc to check if it’s at eof and then does an ungetc if we’re not. Other than that, it’s textbook netpbm processing code taken straight from the examples:
This took me all of about half an hour to figure out, after most of a day wrestling with libavcodec, and it works pretty well. The performance is decent. I don’t know how it compares to using libavcodec directly as I haven’t benchmarked (due to not having a working example with libavcodec), but it’s orders of magnitude faster than my previous file system based hack, and the code is a hell of a lot cleaner, so I’m happy.