Archive for December, 2010

Reading video frame by frame with ffmpeg

Tuesday, December 28th, 2010

So I’ve been playing around with scene detection. It’s really more of a NIH task that I’m doing for my own amusement than it is a serious tool I expect to be used, but it’s a good way to expand my knowledge of video and I have a few good ideas which don’t seem to have been used before, so it’s crazy but it Just Might Work.

One of the things I need to do for scene detection is read a video frame by frame and compare subsequent frames. My initial hacked use ffmpeg to turn the video into a sequence of images on disk, ran through them as they were generated them and deleted old ones.

As you can probably imagine this was slow, cumbersome and remarkably hard to get right.

“Oh, hey” thought I. “ffmpeg makes all this stuff available as libraries: libavformat and libavcodec. That will let me do this efficiently!”.

So I started playing around with examples and reading through the documentation. Excuse me, did I say documentation? I meant header files.

Oh.
My.
God.

I mean no (large amount of) disrespect to the authors when I say this: They have created a piece of software which, by and large, works very well. And I’m sure that a lot of the complexity of the API is essential rather than accidental if you’re, say, writing a video player rather than a dumb frame processor.

But, that being said, the contents of the header files are remarkably like getting a lecture on the botany of trees when what you want is a map out of the forest. Apparently it all makes sense if you’ve seen the mpeg4 spec. Apparently writing actual documentation would be a patent minefield. Certainly I have no clue what’s going on.

I tried basing my code on examples from the internet. Unfortunately it looks like the API has moved under them – the examples have been half patched up by other people around, but in the versions I got closest to working they appeared to be doing the wrong thing. The arguments to certain functions were suspicious, and the results were just wrong. The right thing to do might have been to fix this, but I genuinely had no idea how the code was working, so it would have been far from easy to debug it.

So, at this point I largely considered myself defeated by libav* and started thinking about other ways one could do it.

“What I really want”, I thought, “is some sort of server program where I can just feed it a file and then read the frames off in some sensible binary format. That way I’d be insulated from most of the pain of this”.

“Hey, ffmpeg can write its output to a pipe, can’t it?”

After that, the rest was history:

Step 1: Pick some binary format which is easy to read pixel RGB data out of. It will never live on disk, and ease of use speed of parsing is more important than efficiency. Easy, obvious choice: ppm. It’s basically designed for that.
Step 2: Figure out how to get ffmpeg to write a stream of ppm files to its stdout. This turns out to be easy:

Step 3: Figure out how to read a stream of ppm files from a pipe. libnetpbm to the rescue! The only minor issue I had was determining whether we were at the end of file without stomping on netpbm’s toes, so the code contains a slightly weird step where it does a getc to check if it’s at eof and then does an ungetc if we’re not. Other than that, it’s textbook netpbm processing code taken straight from the examples:

This took me all of about half an hour to figure out, after most of a day wrestling with libavcodec, and it works pretty well. The performance is decent. I don’t know how it compares to using libavcodec directly as I haven’t benchmarked (due to not having a working example with libavcodec), but it’s orders of magnitude faster than my previous file system based hack, and the code is a hell of a lot cleaner, so I’m happy.

Dear Commenters: Frankly, I’m tired of you

Friday, December 24th, 2010

Hi there,

I see you want to write a comment on my blog. That’s awesome! It’s really great that after reading the article you have interesting and constructive feedback to deliver.

…oh, you didn’t read the article? Just the title? That’s fine I suppose. I’m sure you’ve got something useful to contribute anyway.

Not so much, huh?

Every time I get a new email saying “There’s a new comment on your blog” I wince and think “I wonder what this guy has got wrong”. Maybe some of it is my fault – there have definitely been cases where people who I believe have actually read the article have been confused as to the point – but I’ve seen enough of it on other peoples’ blogs to know that actually for the most part this is just what people on the internet are like.

I don’t care if people disagree with me if they do so reasonably. Bob knows I’m often wrong. But every time I have to deal with someone who hasn’t read the article, or who is more interested in setting up strawmen, or who will come up with an entirely new “reason” why I am wrong which contradicts all the previous ones he gave that I have just spent time refuting, I feel sad inside and become a little less likely to write articles.

So, I’m done. New posts will not have comments enabled. Old posts will have comments disabled as and when I have reason to do so. If you want to respond to my post, you can do it by one of the myriad communication and discussion mediums the internet has to offer: I’m easily available on twitter or email, and you can show the entire world how clever you are in the discussions on reddit, ycombinator or the like if it’s there.

To those of you who wrote worthwhile comments, of which there were a small but vocal minority, thanks. It was appreciated. Sorry to take this away from you.

Removing silent tracks from a video: A bug and the hack that killed it

Saturday, December 18th, 2010

So, as you might have gathered, I now work for a company called Aframe (lower case f. Very fussy about that. As if I didn’t spend enough time correct people’s usage of my name). We do video stuff.

Of course, muggins here gets to be the one responsible for an awful lot of that video stuff. We have in house expertise, but they’re largely people with a huge amount of video experience and no programming experience. So this has been a bit of a learning experience for me, and it’s far from over yet.

Anyway, one of the things we do is convert all the video we get into a version you can view on the website. It’s not really our major selling feature, but it’s the basic feature on which we build a hell of a lot of other stuff.

We’ve had this collection of video from a particular customer for a while with the embarrassing feature that the version we were producing from the website had no sound. I couldn’t figure it out at all – the original had 8 audio channels, which seemed likely to be the source of the problem, but no matter which one I played in vlc it was always silent.

Actually… turns out that was just vlc lying to us and not switching channels. Sigh. (I’d file a bug report, but channel switching clearly works in other contexts and we can’t share the video, so I don’t know how best to do that).

We revisited this yesterday and verified that actually there are perfectly good audio tracks on channels 3 and 4, but the other 6 are silent and we were defaulting to using tracks 1 and 2. The camera in question has a variety of different ways of taking audio in, and which tracks are non-silent depends on which audio inputs you use. Of course, all the channels are there regardless of what you do, and there’s no metadata (that I could find) marking which ones contain sound.

It’s easy enough for our transcode process to select the right tracks, but combining the audio tracks is harder (some transcoders support it, some don’t). So ideally what we would do is just remove the silent audio tracks (you probably guessed this from the title).

But how do you do that? There’s no metadata saying which are silent, so you have to figure out which audio tracks are silent and which contain sound. In an automated way that is – listening to it is no good – and with the audio tracks in an arbitrary codec.

I pondered this for a little while and came up with a solution. It’s duct tape programming in the extreme, but actually it works very nicely:

We have no metadata. So we have to use the data. We don’t know what the format is, so we have to convert it to a standard one. What standard format would make it really easy to detect whether a track is silent? Well… an uncompressed one. Pick a random simple uncompressed audio format? Let’s try wav.

It turns out that detecting a silent wav file is trivially easy: It consists overwhelmingly of 0 bytes. There’s a brief header at the beginning, and then the rest of the file is just a big long chain of 0s. So for each audio track we generate a wav, check what percentage of it is 0 bytes, and if it’s >= 99.9% we declare it to be silent.

Once we’ve determined which tracks are silent it’s then a simple matter (admittedly a simple matter which took me a reasonable amount of wrestling with command line options and asking for help on IRC) to use ffmpeg to cut out the silent tracks. The invocation looks approximately like:

ffmpeg -i original.mov -y -ac 2 -map 0.0 -map 0.3 -map 0.4 -acodec copy -vcodec copy output.mov -newaudio

-ac tells the number of audio channels the output should have, the map options say which channels from the source to use (the order is significant) and -newaudio says “Don’t fail in mysterious and incomprehensible ways as a result of my having tried to change the audio channels”.

In the unlikely event that anyone who isn’t us will find this useful, I’ve started collecting the results of experiments like this into a repo on github. The code for this post is there as “removesilent”. It’s not the exact code we’re using at work (as that’s better integrated into our system), but it’s pretty close.