I’ve been hacking on a little side project this weekend. It’s still very much a work in progress, but I’d like to tell you a bit about the process of making it, because it’s always fun to show off your baby even if it’s an ugly one. Also, some of the implementation details may be of independent interest.
What is it? It’s a set of bindings for calling jq from python. I’m a huge fan of jq, I quite like python, and I thought I’d see if I could get my two friends to be buddies. Well, their friendship is still a pretty tentative one, but they seem to be getting along.
Unfortunately I discovered amidst writing this blog post that a similar thing already exists. It takes quite a different approach, and I think I prefer mine (although I confess the cython does look rather nicer than my C binding). At any rate, I learned a moderate amount about jq and writing more complicated C bindings by writing this, so I’m perfectly OK with having done the work anyway.
This is especially true because my actual initial interest was in seeing if I could use jq inside of postgres, but as I’m unfamiliar with both the jq internals and writing postgres C extensions I figured that something that would allow me to familiarise myself with one at a time might be helpful. I’ve been writing a bunch of python code that used ctypes to call C libraries I’ve written recently, so this seemed like a natural choice.
I initially prototyped this by just calling the executable. This worked a treat and took me about 10 minutes. Binding to the C library took… longer.
The problem is that libjq isn’t really designed as a library. It’s essentially the undocumented extracted internals of a command line program. As such, this is more or less how I’ve had to treat it. In particular my primary entry point essentially emulates the command line program.
There are three interesting files in my bindings:
- pyjq/__init__.py contains the high level API for accessing jq from python. Basically you can define filters which transform iterables and that’s it.
- pyjq/binding.py contains the lower level bindings to jq
- pyjq/compat.c is where a lot of the real work is. I’ve essentially created a wrapper that emulates the jq command line utility’s main function.
The basic implementation strategy is this:
We define a jq_compat object which maintains a jq_state plus a bunch of other state necessary for the binding. We’re essentially defining a complete wrapper API in C which we then write as light as possible an interaction with in python.
Among the things the compat object is responsible for is maintaining two growable buffers, output and error. These are (slightly patched) vpools which are used in a way that looks suspiciously like stdout and stderr. When the binding emits data it writes lines of JSON to its output buffer. When errors occur it writes error messages to its error buffer (It also sets a flag saying that an error has occurred).
There isn’t a vpool for stdin – I initially had one but then realised that it was completely unnecessary and took it out. In the jq CLI what it essentially does is loop over stdin in blocks and feed them to a parser instance, which may then write to stdout and stderr. In the binding we invert the control of this: We maintain a parser on the compat object which we feed with the method jq_compat_write (unlike in the jq CLI we always write complete blocks to it). This then causes it to immediately immediately parse input values, which get passed to the jq_state. We then check to see if this causes the state to emit any objects and if so write them to the output buffer.
On the python side what happens is we call jq.write(some_object). This gets serialised to JSON (I’d like to provide a more compact representation for passing between the python and C sides of the binding, but given that jq is primarily about JSON and the python JSON library is convenient, this seemed a sensible choice). This is written to the output buffer with newlines separating it.
From the python side when we want to iterate we first read everything pending from the compat’s output buffer. We split this by lines, JSON serialise it and put it in a FIFO queue implemented as a simple rotating array (this is so that we can resume iteration if we break midway through).
That’s pretty much it from an implementation point of view, but there’s one clever (and slightly silly) thing that I’d like to mention: The bindings allow you to intercept calls to the C library and generate a C program from them. This is used by the test framework. First the python tests run. As part of this they generate a C program which does the same sequence of C calls. This is then compiled and run under valgrind. Why? Well, mostly because it makes it much easier to isolate errors – I’ve never had much luck with running python under valgrind, even with suppressions, and this circumvents that. It also helps to confirm if the problem is caused crossing the python/C boundary.
I don’t know if this is a project I’m going to continue with – I don’t know if it’s especially useful, even without the existing bindings, but it’s been an interesting experiment.