Category Archives: Code

Compressed ranges for intmap

A feature I’ve been wondering about which is an extension to the original datastructure that intmap is based on is whether one can reasonably do it in a way that has a more efficient representation for contiguous ranges – rather than having multiple very large objects for every single work in the range, maybe we could just store the start and end of the range? Essentially, a hybrid between this very trivial integer set idea and intmap’s patricia trie, which added an additional node type for a contiguous range.

A complicating factor for this is that intmap has not just keys but also values, and you don’t want to store a contiguous range with distinct values as an array, because then you lose a lot of the cheap update properties that are nice about intmap in the first place – e.g. insertion becomes O(n) instead of O(log(n)).

But there turns out to be a relatively easy and nice compromise: Storing compressed ranges mapping to a single value. For the use case where intmap is basically an intset this works out pretty well, and for use cases where you have a lot of distinct values there is essentially no performance degradation (you get a little bit due to the extra checks, but not much).

The way ranges are incorporated into intmap is relatively simple: Most operations have some special treatment for ranges. For example, if you intersect two ranges with the same value then the result is immediately a new range with that value. Similarly for unioning two ranges if they overlap. Subtracting ranges from things is usually much more efficient because it’s very easy to tell if all keys in a map lie in a range (the prefix of the patricia trie defines an upper and lower bound), and a whole bunch more operations.

And when there’s nothing sensible to do with a range – e.g. we’re removing a hole from the middle of a range, or unioning together two ranges that won’t coalesce, etc – we just turn it into a split node with two ranges as children. Often we’ll then want to recurse and will end up expanding more range nodes, but for many operations if we start with a range we’ll still only end up with a tree with O(log(n)) nodes for O(n) keys – e.g. if we insert or delete a key from a range.

The code for this has ended up a bit hairy in places (honestly, intmap could use a bit of a cleanup already I think), but it appears to work rather well, so I’m quite pleased with the result.

In writing this testmachine has been an absolute godsend. I found some many bugs in the implementation with it.

The process for writing things against testmachine is that I provide a parallel implementation of the function in python, only that implementation can be as basic as I like. In the parallel python representation, intmaps are just dicts mapping ints to strings, and the operations are  correspondingly simple, with little concern for the efficiency of the operations.

Then it finds bugs. Almost inevitably. This is a combination of two factors:

  1. intmap code is quite tricky to get right. I’m emulating a lot of things C isn’t designed to do in C, and C is tricky to get right in the first place, and frankly datastructure implementations are always a bit of a pain.
  2. testmachine is almost magically good at finding bugs.

I’m not kidding about point 2. I’ve more than doubled the size of the deterministic test suite as part of this chunk of work, and they’re all from things testmachine has found. As I opined on Twitter earlier: When you first encounter Quickcheck style testing it feels like you’ve been given a magical bug finding tool. This is like that experience all over again where the baseline has been moved to quickcheck.

This entry was posted in Code on by .

Introducing THE TEST MACHINE

(cue evil laughter)

At this point I’m mostly just using intmap as a breeding ground for new testing strategies. And it’s working.

Today I’ve just created a testing strategy that I suspect to be entirely novel, if possibly because it was too ludicrous for anyone to try. Specifically, I wrote a programming language to test my code, and I’m generating random programs in that language.

OK, that’s a total exaggeration. What I really wrote was a stack machine with a simple text format to serialize programs against it.

Here’s an example program:

empty
singleton	7703584101918615599	'9ab29f4316'
check_equal
assert_not_check

I introduced a bug where empty intmaps would always compare equal and this is the program that demonstrates it (produced entirely automatically).

The stack machine is simply a stack of intmaps plus a check bool. It’s defined in C, along with a whole bunch of operations on it. This definition is then exported to python via a shared object, and the python code drives it with random programs.

Why a stack machine you ask?

Well, for one very simple reason: Minimizing the C programs output by testgen.py is a nightmare. I’m going to be looking into better tools (e.g. C-reduce) to minimize those test cases, but I wanted to try something with with a higher level representation that was just easier to minimize in the first place.

And it works. Minimizing a testmachine program is really easy. So easy I was able to write a program to do it in less time than it normally takes me to hand minimize a single testgen test case.

The reduction process is as follows:

  1. Delete some instructions
  2. Simulate the program that remains.
    1. Any instructions that are no longer valid for the state the stack is in when they’re reached are dropped
    2. Any checks for a particular state are corrected for the new program

Currently in step 1 we try deleting single instructions, and if that doesn’t work adjacent pairs of instructions. If we succeed at reducing a program to another failing program we then recursively try to minimize that one. It’s a very simple greedy process that typically reduces the length of the failing programs by a factor of 10 or more (most failing programs fail in the first 100 steps. The minimizer typically produces programs of fewer than ten steps).

Does it work? Well it’s found two non-trivial bugs in intmap already, one a memory leak the other a correctness bug, and it seems to catch every bug I’ve tried deliberately introducing, and produces extremely short, clear, examples to demonstrate them.

It’s also fast compared to testgen, which is a big benefit because it means it can run  more tests. It turns out that not having a C compiler and valgrind in the inner loop means your tests run much faster. Who knew?

All told, I think this idea is a serious win. I may be trying to generalise it and introduce something similar into hypothesis in the not too distant future.

This entry was posted in Code on by .

Domain specific evaluation orders

Advance warning: I’m super pleased with this idea. I apologise if I’m unnecessarily smug about it as a result.

My extremely basic benchmark for intmap is inserting 10 million integers counting down and then printing out its length. I’ve been trying to get it to beat a similar benchmark on Haskell code. I was doing pretty well getting incremental improvements with custom memory allocators (I’m about 80% sure the main benefit Haskell has here is its garbage collector) and would probably have eventually got it down to about the same speed as Haskell by more or less emulating its memory allocation patterns (and yes I’m aware of the humour of trying to optimise C code to be as fast as Haskell code)

Then I had a clever idea. Screw “If you can’t beat ’em, join ’em”. If you can’t beat ’em, cheat.

Fundamentally the problem with this benchmark is that building up an intmap through progressive insertions is the wrong evaluation order (note though that it’s the evaluation order of IntMap.fromList in Haskell, so this isn’t as contrived as it sounds). Ideally what you want to do is always be merging small maps together.

I decided to steal a merge strategy from timsort. The way timsort handles merges is that it maintains a stack of things to merge, x1 through xn, which maintain the following two invariants:

  1. \(|x_i| \geq |x_{i+1}|\)
  2. \(|x_i| \leq |x_{i+1}| + |x_{i+2}|\)

Whenever pushing something onto the stack would violate these rules we perform stack collapsing operations by merging the head (if the first is violated we merge the head into its predecessor. If the second is violated we merge the 2nd and 3rd element together and move the head up one).

This means that the size of the largest element is at least the Fibonacci number of the stack height, which means that you only need quite a small sized stack to have more than you can possibly fit. I use a stack of size 128, which is excessive (this could fit an intmap of size up to \(2^88\) and intmap only supports 64-bit keys), but that’s partly because I want to be able to build up larger intermediate stacks.

I’ve introduced an intmap_builder type which basically maintains this stack so you can build up a whole bunch of unions like this without worrying about the evaluation order. Changing my performance benchmark so that it used this strategy immediately blew the Haskell implementation out of the water by a factor of 2 (I was previously about 60% slower).

But that’s properly cheating. I’m now winning the benchmark because I’m not executing the same program! That’s not remotely reasonable.

But… what if I could make it so that when I do execute the same program, behind the scenes it executes the faster version with intmap_builder.

How could I perform this black magic?

It turns out that I have an advantage that the Haskell version doesn’t: I can tell when I don’t care about intermediate results.

There are two features of the intmap API which are quite nice: Firstly, everything is reference counted. Secondly, all union operations implicitly decref their result (the way you’re supposed to treat this is that every intmap has a uniqueness type which is absorbed by arithmetic operations and that if you want to reuse a value you must pass it to copy. The abstraction is a little leakier than this).

Why is this relevant? Because it means that if an intmap with a refcount of 1 is passed to a union operation we know that we do not care about this intmap for any other reason than the result of this union. In particular we are free to mutate it arbitrarily because any other references to it are now invalid. Woo!

And this means that if we can disguise an intmap_builder as an intmap then when we know that it’s an owned argument of a union we can just push the right hand side onto it and the evaluation order would be sorted out. That’s pretty cool.

We then need to make sure that as soon as you incref it or need to do anything to it that observes its actual value it turns into a real intmap and not a stack of intmaps, but that’s not too bad. There’s a bunch of book-keeping to do around copying and redirects (we can’t always replace an intmap with the result of its union, because we might not own the result of its union). If you care about the details you can read the commit.

It works, too. It’s not quite as fast as just using the intmap_builder directly, mainly because there’s a lot of additional checking that has to be performed (I expect to be able to optimise this somewhat, though it will always be slower), but it’s still significantly faster than the Haskell version.

But details aside, this is I think an instance of a very powerful general technique: When you have a uniqueness type system and a bunch of algebraic rules, you can tell when you don’t care about the result of a sub-expression and then you can use your algebraic rules to massively rearrange the expression for a more efficient evaluation order. This is essentially a query optimiser: We have an algebraic expression which we want to evaluate in the most efficient way possible and we apply a bunch of rules to achieve this effect. The neat thing is that because of the way the API is structured we can apply this query optimiser without the user having to know about it: Every intmap value can be either a (somewhat reduced) abstract set of operations for combining intmaps, or it can be a concrete intmap with all the benefits thereof. It remains the former until you need it to be the latter, at which point the query is fully and transparently evaluated.

I’m not sure this idea works without a great deal of low-level access to the your implementation of laziness. If every operation is destructive and the copy operation returns a pair rather than leaving the original untouched then you can make it work, but the fact that you can force within a non-destructive function like size or equal is quite powerful and makes the API much nicer to use. I’d very much like to see an implementation of this sort of idea in a more functional language, as I’m currently not really sure how it could work.

In this case the algebraic rule I used was the associativity of union (note that order is preserved: intmap unions are not commutative because they have values and the union is right biased), but there are other ones. In particular I plan to implement similar strategies for intersection and difference: Intersection is associative, but additionally in a collection of n things being intersected, all but the left hand most one can be commuted (because we don’t care about their values, but we do care about the values of the left hand most one). This in particular lets us do intersection by adopting a smallest first evaluation order – we always intersect the left hand item with the smallest of the remaining items. This is both intrinsically more efficient and potentially lets us shortcut much earlier because the smallest one may reduce the left hand item to empty at which point we just release the remaining values, return empty, and delight in our O(lots) speed-up.

I haven’t thought through the consequences of this in detail yet, but I’m pretty excited about this idea. It’s essentially a general purpose query optimiser for a certain class of purely functional datastructures, and I think it might have the potential to be a seriously good one.

This entry was posted in Code on by .

Updates to intmap

So in my previous post on intmap I said that I thought that the API was likely to be pretty stable as I’d put a bunch of thought in it. Then I completely changed the memory ownership model. Lol, programmers, eh?

Previously when the API returned an intmap that intmap was now your problem. You had to explicitly release it and nothing else would do it for you. I’ve changed this – now in fact everything will do that for you and you  have to explicitly ask it not to. Basically every operation that will create a new intmap from other intmaps will release its arguments and you have to explicitly acquire it before passing it to the function if you want to still have a copy of it left around.

Why does it do this?

Well, two reasons. My main reason for wanting to do this is that it allows the code to tell if it owns its arguments and can mutate them without anyone noticing – if the refcount on an argument is 1 then this function owns it and can sneakily update it behind the scenes. This is pretty cool, and is at least partly a shameless attempt to try to find things that the Haskell implementation can’t easily do to make up for my lack of GC level performance at small allocations.

I haven’t actually implemented that bit yet. I’ve just changed the API. But in implementing that I found a surprising thing: The code that actually calls the intmaps becomes a lot nicer.

Previously it was basically impossible to build up intmaps as compound expressions because you had to be sure you were going to release the sub-expressions. Now you can write things like

  intmap q = intmap_union(
    a,
    intmap_singleton(a, 17909118457709389247UL, 8UL, "5bb951ab"),
    intmap_singleton(a, 10680653996055428951UL, 2UL, "01")
  );

and the memory management will be taken care of for you – you still have to free q, but the singletons that went into its construction are sorted out for you. That’s not a huge win in this example, but for more complicated examples it seems to help a lot.

In the interests of disclosure, I should say that I got this idea from jq‘s model – its jv internal values (these are basically any json value) are reference counted and have this ownership model. This is at least partly because it considers arrays and objects to be immutable and treats them as copy on write, but when it owns the value like this it doesn’t actually copy them.

Implementing this was interesting. And by interesting I mean painful. It turns out that changing the memory ownership model of a C API is at least as hard as getting it right in the first place. All I can say is, thank god for tests.

i’ve changed my testing model slightly. testgen still has a big place in it, but there’s also test.c which is a set of deterministic tests that run first and are slightly easier to deal with. Whenever there’s a testgen failure I extract a test from that and add it to test.c. I’ve also simplified the output from testgen – to my great surprise it turns out that 4000 line C programs which are a nested mess of 50 dependent variables are not that easy to minimize by hand, and I didn’t really want to write a minimizer. I’ve changed the result so that it mostly just builds up a single example through a combination of operations and then makes a bunch of asserts about that one example. Each individual test tests a bit less, but the result seems to be that it finds errors around run 20 rather than run 2. Given that it runs a thousand times I’m OK with that. Of course, I wrote that and just noticed that it had failed on run 519, so maybe there’s still a bit of work to go there.

This entry was posted in Code on by .

You probably shouldn’t do this, but it works really well

I just added the following config to a ruby test suite:

RSpec.configure do |config|
  config.before :each do
    GC.disable
  end
 
  config.after :each do
    GC.enable
  end
end

It’s pretty tragic how well this works. It knocked about 40% off the runtime of the test suite for our Padrino + DataMapper app running on 1.8.7 (yes, I know. I’m sorry. I actually discovered this as part of my work on getting it off 1.8.7), and a friend is reporting a more than 50% drop for his Rails app running on 1.9.3.

Why? Well, I added this as an experiment when I noticed how much time the test suite was spending in the GC (I initially ran with GC turned off entirely. This ate about as much memory as you’d expect). I’m assuming it’s a combination of MRI’s GC being really quite shit and the fact that test setup tends to produce a lot of garbage.

This is kinda an awful solution to this problem, but I must admit I’m finding it hard to argue with the results. If you can afford to throw memory at the problem it might actually be worth doing.

But still… ouch.

This entry was posted in Code on by .