David R. MacIver's Blog: Updates to intmap

Updates to intmap

17 December 2013

So in my previous post on intmap I said that I thought that the API was likely to be pretty stable as I’d put a bunch of thought in it. Then I completely changed the memory ownership model. Lol, programmers, eh?

Previously when the API returned an intmap that intmap was now your problem. You had to explicitly release it and nothing else would do it for you. I’ve changed this - now in fact everything will do that for you and you have to explicitly ask it not to. Basically every operation that will create a new intmap from other intmaps will release its arguments and you have to explicitly acquire it before passing it to the function if you want to still have a copy of it left around.

Why does it do this?

Well, two reasons. My main reason for wanting to do this is that it allows the code to tell if it owns its arguments and can mutate them without anyone noticing - if the refcount on an argument is 1 then this function owns it and can sneakily update it behind the scenes. This is pretty cool, and is at least partly a shameless attempt to try to find things that the Haskell implementation can’t easily do to make up for my lack of GC level performance at small allocations.

I haven’t actually implemented that bit yet. I’ve just changed the API. But in implementing that I found a surprising thing: The code that actually calls the intmaps becomes a lot nicer.

Previously it was basically impossible to build up intmaps as compound expressions because you had to be sure you were going to release the sub-expressions. Now you can write things like

  intmap q = intmap_union(
    a,
    intmap_singleton(a, 17909118457709389247UL, 8UL, "5bb951ab"),
    intmap_singleton(a, 10680653996055428951UL, 2UL, "01")
  );

and the memory management will be taken care of for you - you still have to free q, but the singletons that went into its construction are sorted out for you. That’s not a huge win in this example, but for more complicated examples it seems to help a lot.

In the interests of disclosure, I should say that I got this idea from jq’s model - its jv internal values (these are basically any json value) are reference counted and have this ownership model. This is at least partly because it considers arrays and objects to be immutable and treats them as copy on write, but when it owns the value like this it doesn’t actually copy them.

Implementing this was interesting. And by interesting I mean painful. It turns out that changing the memory ownership model of a C API is at least as hard as getting it right in the first place. All I can say is, thank god for tests.

i’ve changed my testing model slightly. testgen still has a big place in it, but there’s also test.c which is a set of deterministic tests that run first and are slightly easier to deal with. Whenever there’s a testgen failure I extract a test from that and add it to test.c. I’ve also simplified the output from testgen - to my great surprise it turns out that 4000 line C programs which are a nested mess of 50 dependent variables are not that easy to minimize by hand, and I didn’t really want to write a minimizer. I’ve changed the result so that it mostly just builds up a single example through a combination of operations and then makes a bunch of asserts about that one example. Each individual test tests a bit less, but the result seems to be that it finds errors around run 20 rather than run 2. Given that it runs a thousand times I’m OK with that. Of course, I wrote that and just noticed that it had failed on run 519, so maybe there’s still a bit of work to go there.