Author Archives: david

A personal history of interest in programming

Apparently some guy steeped in silicon valley startup culture has made some ignorant comment about how everyone who is good at programming started from a young age. This is my surprised face.

If you don’t know who I’m talking about, don’t worry about it. He’s not important. This is a pretty common narrative though, and I thought it might be worth holding up my personal journey to being interested in programming as a piece of anecdata against it.

I’m basically the kind of obsessive programmer this criteria is supposed to select for. I’ve spent my Christmas holidays working on persistent functional data-structures in C, I spend a huge amount of time and thought on the practice and theory. Also I’ve got all the privilege such criteria implicitly select for – I’m white, cis male, and painfully middle class.

You don’t need any of these things to be a good programmer. I mean, you sure as hell don’t need the pile o’ privilege I’ve got, but you also don’t need the obsession. It can be helpful, but it can also be harmful – you put a group of obsessive people together and often what you end up with is hilarious tunnel vision which loses sight of the real problem. I mention these things not because I think they’re important, but to demonstrate that even if you buy into the dominant narrative I am a counter-example. I look and act like a “real” programmer.

Want to know when I got started programming? Well, that’s hard to say exactly. I did little bits and pieces of it as a kid (more on that later), but I wasn’t really very interested in it. More on this later.

Want to know when I got interested in programming? That’s much easier. 2007. I was 23, about 10 years older than I’m “supposed” to have been interested in programming in order to be good at it.

How did this happen?

Well I’ve always liked computers. My family have had computers since quite a young age (see “privilege” for details) – I remember having a windows 3.1 computer before I moved to the UK, so looking at the release date I must have had one almost as soon as it came out. Before that we had a DOS computer.

I really enjoyed these, but not really for programming. I played games and used the word processor. We had some games written in qbasic and I vaguely poked at the code to see if I could make changes but I didn’t really know what I was doing and it was pretty much just editing at random and seeing what happened. I got bored quickly and didn’t pursue it further.

Years later at school we did some logo. It sure was fun drawing pretty pictures on the screen, and the little turtle was adorable.

Later yet we got the internet. This was pretty amazing! You could talk to people! Who were far away! And write stuff and people would read it! Also there were multi-player games, those were pretty cool. I spent a lot of time on various MUDs. I thought about learning to program so I could make a MUD or to build add-ons to one. I think I briefly even started contributing to a friend’s nascent MUD, but it all seemed like hard work and again I got bored pretty quickly.

Eventually it came time to choose my A-levels. I thought about doing a computing A-level, but my chemistry teacher persuaded me that I could learn to computer at any time (he was right) and that I should do chemistry instead (oh ye gods was he wrong). So I managed to go another couple years without ever learning to program.

At some point during this time I created a website. That isn’t to say I learned HTML and CSS mind you. I used some sort of WYSIWYG website creation tool from Mozilla and uploaded it to my btinternet account. I can’t really remember what I put there – I know I published some really terrible transhumanist fiction on it, but there were a bunch of other things too. I do remember the look though: White text on some sort of horrendous patterned background. Classy web design it was not.

At some point our btinternet account got cancelled and I lost the website (no backups of course). When I looked a few years ago bits of it were still available through the internet archive, but I honestly don’t remember the URL at this point.

I got to university and somehow found myself in a maths degree (which is another story). I now had my own computer which my parents had bought me (*cough* privilege *cough*)!

It had windows ME on it. :-(

After much frustration with ME, my friend Tariq Khokhar persuaded me to give this Linux thing a try in preference to, erm, “acquiring” a copy of windows 2000 and using that instead.

This worked out pretty well. I’m not going to say it was a flawless experience, but as a cocky teenager I was totally up for doing low-effort things that made me feel much cleverer than I actually was, and running this frustrating OS on the desktop certainly achieved that. This was back in 2001. 12 years later, I’m still being frustrated by this decision but find all the other options equally frustrating.

Then, finally, I was forced to actually learn to program.

Cambridge has (had?) as part of its maths degree a thing called “CATAM”. What this was was basically they gave you  your choice of three problems to do and you had to go learn to computer and then do them.  I don’t actually remember what my CATAM projects were – I know the first one was something to do with matrix maths, but I’m not really sure what.

You could do them in whatever computer you liked – they vaguely encouraged C (sadists), but they didn’t really mind if you did it in something else – I know at least one person did it in Excel.

I briefly looked into programming in C, but Tariq persuaded me that I would much rather do it in ML, which is what the computer scientists were learning for their intro to CS course. I went along to a few of their lectures, skimmed some of the notes, and bought a copy of ML for the working programmer and learned enough from it to solve the CATAM problems.

To be clear: At this point programming to me meant that I wrote up some code in a text editor (gedit I think. I don’t remember when I started using vim) and then cut and pasted it into the REPL (I think I was using Moscow ML because it was the one the course recommended), at which point I would copy the answer out of the repl and put it in my coursework.

That CATAM project finished, I did fine, and I immediately stopped programming again. I mean why would I keep it up? As far as I was concerned this was just a thing I learned to do to solve some maths problems. None of the maths problems I had to solve needed programming, so I wasn’t going to be programming.

I picked it up again for my second CATAM project the next year, but I didn’t really learn anything new about programming, I just did the same thing I did last time: Write in a text editor, copy and paste into the REPL. I might have switched to vim by then, but probably not.

At some point during this period I made a second website. My friend Michael, who I knew from IRC, persuaded me to do it “properly” this time. I learned what HTML was, copied some CSS from someone’s free designs elsewhere on the internet, and wrote about 5 lines of PHP (which I probably copied from somewhere) to build a hilarious navigation by get variables so I didn’t have to write the same header and footer everywhere and could use includes. I was supposed to use mod rewrite or something to get proper URLs but I never really bothered. Given my experiences with mod rewrite since, it’s possible I tried to make it work and couldn’t. I uploaded this to my account on the student computing society computers. At this point I’d learned enough about the command line from my forced immersion in it that I was able to figure out SSH and stuff, but I don’t think I did much more with it than moving some files around on the server.

Then I got to the end of my degree and suddenly didn’t know what to do with my life.

Everyone tells you that a maths degree is super desirable and is great for getting jobs. This is a lie. If you have other useful skills then people get very excited about the fact that you’re also a mathematician. If you don’t, no-one really cares.

Except banks. Finance was very keen to hire mathematicians. I, on the other hand, was very keen not to work in finance (because dad was a banker and it didn’t look like much fun. My political objections came later).

I shopped around Cambridge for a job for about 6 months (moving back in with my parents about halfway through that) but found that for some reason Cambridge had quite a surplus of overqualified people with no useful skills and I was one of them. Finding a job didn’t go so well.

At some point a friend mentioned that his company were hiring for developers and didn’t care if you actually knew what you were doing because they were pretty sure they could teach you. They were London based, and I wasn’t very keen on London, and I wasn’t really sure I wanted to be a developer, but I’d been job hunting for 6 months and it’d got pretty depressing so I figured it couldn’t hurt to talk to them. They were going to be at a careers fair in the computer science department, so I went along to that with a copy of my hilariously empty CV to talk to them.

Nothing came out of talking to that company, but while I was there I figured I might as well talk to a bunch of the other people there. I ended up talking to two other companies, both London based and hiring for developers but happy to teach people, so I interviewed with them.

Apparently I interviewed pretty well – the ML I’d learned and my maths degree were enough that although I didn’t really know what a binary search was I was able to figure it out with a little prompting, and I somehow managed to muddle my way through the other problems. One of them mostly tested problem solving, which I was good at, and the other had some toy programming problems, which I managed to do OK on.

One of them offered me a job off the back of this. The other one was a little more hesitant about hiring someone who basically couldn’t program, so asked me to write them a program to prove that I could. I went with the company that didn’t require me to demonstrate that I knew what I was doing, mostly because I didn’t.

So, well, at that point I was moving to London to become a software developer. Woo? I wasn’t really that interested in programming, and I rather hated London, but hey it was a job.

I wasn’t very good at it at first, unsurprisingly.

I started doing some basic front-end stuff. I was thoroughly overconfident in my abilities with HTML and CSS off the back of the small amount I’d learned already. I also got tasked with setting up a server, because I knew Linux so I must know what I was doing, right? I had no idea what I was doing and wasn’t very good at asking for help and after about 4 days of butchering it we got a server that… probably did the right thing? While this was going on I was learning to write Java (which the project I was working on was written in). It seemed pretty easy – the syntax was weirdly cumbersome, and it seemed a bit limited in comparison to ML, but it wasn’t hard to get things done in it. I doubt I really knew what I was doing, but this didn’t seem to matter terribly. (Fragment of conversation I remember from this time period: “Hey, can I use recursion in Java?” “Yes, but you probably shouldn’t”).

But I stumbled my way through into competence, and then necessity forced me into actually figuring out what was going on and starting to think about how to do things better. At some point my friend Dave joined the company I was working at and we bonded over talking about work and computer science while waiting for our 10 minute builds to run so we could observe the tiny change to the HTML we’d just made (sadly I’m not exaggerating that).

At some point the persistent nagging feeling that things could be done better than this mixed with talking about CS with Dave and I grudgingly started to realise that this programming thing was actually quite interesting.

Concurrently with this, some friends on IRC had persuaded me to learn Haskell. I forget what their reasoning was, but I mostly went along with it. I already knew ML, which is basically a simpler and less puritanical version of Haskell (the module system is better, but remember that my experience of ML was from copying and pasting stuff into the REPL. I didn’t even know what a module was), so learning Haskell was pretty straightforward.

At this point I had basically two bodies of knowledge that were failing to connect – I had these cool languages I did maths programming in and this Java thing which I used to get stuff done. I started hunting around in my spare time for a middle-ground, and after a false start with Nice ended up in Scala.

But work was at this point pretty frustrating. For a variety of reasons I wasn’t really enjoying it much, so I decided to look for a new job. I ended up at a company called Trampoline Systems, where I got to work with good people on interesting problems. The company eventually went a bit wrong and most of us left (it’s still around, albeit in a very different form, but only one person from when I was there is still around), but by then it was enough: I’d become quite interested in the practice of programming, and I’d discovered I could find interesting jobs doing it.

The rest is, as they say, history. Despite my long history of not getting around to learning to program, I’d become a software developer anyway. I discovered I was quite good at it and it was quite interesting, so I’ve stuck around since.

Would things have worked out better if I’d learned earlier? I don’t know. I feel like the practice of solo programming is very different from the practice of programming in a team. What helped me was that I’d already developed a lot of habits of thought and problem solving before I ever really brought them to bear on the practice of software development. I’m sure if I’d learned to program earlier it wouldn’t have hurt, but I think I’d probably have a very different perspective on things today, and it’s unclear if it would be a better one.

I think the thing to bear in mind is that software development isn’t really about programming. It’s about solving problems with computers. Programming is generally the tool you use to achieve that goal, and for some tasks you need to be better at it than others, but ultimately what you need is to be good at problem solving (not silly abstract logic problems, but actual problems that people have that computers can help solve). This is a skill you can develop doing almost anything, and if you’re already good at it then you can probably become a good software developer remarkably quickly even if you’re as disinterested in programming as I used to be.

This entry was posted in How do I computer?, life on by .

Reevaluating some testing philosophy

Over the past year or so I’ve started to have serious doubts about some of my previous attitudes on testing. I still think it’s good, I still don’t particularly believe in TDD, but I also think some of my previous approaches and opinions are a bit misguided.

This is about one I just encountered today which is causing me to re-evaluate some things. I previously strongly held the two following opinions:

  1. You should test to the public API rather than having your tests depending on internals.
  2. Randomized quickcheck style testing is awesome. Although you probably want to turn failing quickcheck tests into deterministic tests, and sometimes it’s worth writing tests that are hard to write in this manner, quickcheck will probably do a better job of testing your code than you will.

These two stances turn out to be in conflict. It’s not impossible to reconcile them, but it requires a significant amount of work. I think that amount of work might be worth doing because it makes your code better, but it’s worth keeping an eye on.

As I’ve mentioned previously, my testing strategy for intmap is as follows:

  1. I have a randomized test case generator that models the behaviour the library should have and generates test cases to check that it does.
  2. I have a set of deterministic tests that run before the randomized ones. The main benefit of these is they’re reproducible and a hell of a lot faster. They’re mostly extracted from failing randomized test cases.

Today I was writing some exceedingly trivial performance tests so that I could do some profiling and testing of whether some of the optimisations I was performing were actually wins (at least in these benchmarks the answer is that some of them really are, some of them really aren’t). Then I wrote a new benchmark and it segfaulted. Given that the test coverage was supposed to be pretty comprehensive, and the test suite was passing, this was pretty disappointing.

How did this happen?

Well the proximate cause of the segfault was that my allocator code was super buggy because I’ve never written an allocator before and it turns out that writing allocators is harder than I thought. If you’re interested in the specifics, here is the commit that fixes it. But why didn’t the tests catch it?

Randomized testing essentially relies on two things in order to be work.

  1. It has a not-tiny probability of triggering any given bug
  2. It runs enough times that that decent probability is inflated to a significant probability of triggering it.

What does this end probability of triggering a bug look like? Lets do some maths!

Define:

  • q(b) is the probability of a single run triggering bug b
  • t is the average amount of time it takes for a run
  • T is the total amount of time we have to devote to running randomized tests
  • p(b) is the probability of a set of test runs finding the bug.

If we have T time and each run takes t, then we make approximately \(\frac{T}{t}\) runs (this isn’t really right, but assume low-variance in the length of each run). Then the probability of finding this bug is \(p(b) = 1 – (1 – q(b))^{\frac{T}{t}} \approx q(b) \frac{T}{t}\).

This formula basically shows the two reasons why testing only the public API is sometimes going to produce much worse results than testing internal APIs.

The first is simple: In most cases (and very much in this one), testing the public API is going to be slower than just testing the internal API. Why? Well, because it’s doing a lot of stuff that isn’t in the internal API. Not doing stuff is faster than doing stuff. If the difference isn’t huge, this doesn’t matter, but in my case I was doing a phenomenal amount of book-keeping and unrelated stuff, so the number of iterations I was performing was much lower than it would have been if I’d just been testing the allocator directly.

The second is somewhat more subtle: Testing the public API may substantially reduce \(q(b)\). If it reduces it to 0 and your testing coverage is good enough that it can trigger any conceivable usage of the public API, who cares. A bug in the internal API that can never be triggered by the public API is a non-issue. The danger case is when it reduces it to small enough that it probably won’t be caught in your testing, because things which aren’t caught in testing but are not impossible will almost certainly happen in production – in the most harmless case, your users will basically fuzz-test your API for you by throwing data you never expected at it, in the most harmful case your users are actively adversarial and are looking for exploits.

How does this happen?

Essentially it happens because q(b) is intimately depending on both b and the shape of your distribution. The space of all valid examples is effectively infinite (in reality it’s limited by computer memory, but it’s conceptually infinite), which means that it’s impossible to have a uniform distribution, which means that your distribution is going to be peaky – it’s going to cluster around certain smaller regions with high probability.

In fact, this peakiness is not just inevitable but it’s desirable, because some regions are going to be more buggy than others: If you’ve tested what happens on a few thousand random integers between 0 and a billion, testing a few thousand more random ones is probably not going to be very useful. But you probably want to make sure you test 0 and 1 too, because they’re boundary cases and thus more likely trigger bugs.

So this is what you basically want to do with randomized testing: Arrange it so that you have lots of peaks in places that are more likely to trigger bugs. Most randomized testing does this by basically just generating tests that cluster around edge cases with relatively high probability.

The problem is that edge cases in your public API don’t necessarily translate into edge cases in your private API. In my case, I was doing lots of intmaps unions and intersections, which is really good for triggering edge cases in the basic intmap logic, but this was mostly just translating into really very dull uses of the allocator – it rarely created a new pool and mostly just shuffled stuff back and forth from the free list.

If I had been testing the allocator directly then I would have tuned a test generator that exercised it more thoroughly – by not restricting myself to the sort of allocation patterns I can easily generate from intmaps I would have found these interesting bugs much sooner.

In the short-term I’ve solved this by simply writing some deterministic tests for exercising the allocator a bit better.

In the long-term though I think the solution is clear: the allocator needs to be treated in every way as it if it were a public API. It may not really be public – its intent and optimisations are so tailored I’m not expecting it to be useful to anyone else – but any bugs lurking in it are going to eventually make their way into the public API, and if I don’t test it directly hard to trigger ones are just going to lurk undiscovered until the worst possible moment.

Fortunately I’d already factored out the pool code into its own thing. I hadn’t done this for any especially compelling reasons – it’s just the code was already getting quite long and I wanted to break it up into separate files – but it’s going to be very useful now. Because this is the sort of thing you need to do in order to reconcile my original two beliefs: factor any code you want to test on its own out into its own library. This is generally a good design principle anyway.

Does this mean that the two principles are compatible after all as long as you’re writing good code in the first place? Well… kinda. But only if you define “good code” as “code that doesn’t have any internal only APIs”. At this point the first principle is satisfied vacuously – you’re not testing your internal APIs because you don’t have any. I’m not sure that’s wrong, but it feels a bit extreme, and I think it only works because I’ve changed the definition of what an internal API looks like.

This entry was posted in Uncategorized on by .

Domain specific evaluation orders

Advance warning: I’m super pleased with this idea. I apologise if I’m unnecessarily smug about it as a result.

My extremely basic benchmark for intmap is inserting 10 million integers counting down and then printing out its length. I’ve been trying to get it to beat a similar benchmark on Haskell code. I was doing pretty well getting incremental improvements with custom memory allocators (I’m about 80% sure the main benefit Haskell has here is its garbage collector) and would probably have eventually got it down to about the same speed as Haskell by more or less emulating its memory allocation patterns (and yes I’m aware of the humour of trying to optimise C code to be as fast as Haskell code)

Then I had a clever idea. Screw “If you can’t beat ’em, join ’em”. If you can’t beat ’em, cheat.

Fundamentally the problem with this benchmark is that building up an intmap through progressive insertions is the wrong evaluation order (note though that it’s the evaluation order of IntMap.fromList in Haskell, so this isn’t as contrived as it sounds). Ideally what you want to do is always be merging small maps together.

I decided to steal a merge strategy from timsort. The way timsort handles merges is that it maintains a stack of things to merge, x1 through xn, which maintain the following two invariants:

  1. \(|x_i| \geq |x_{i+1}|\)
  2. \(|x_i| \leq |x_{i+1}| + |x_{i+2}|\)

Whenever pushing something onto the stack would violate these rules we perform stack collapsing operations by merging the head (if the first is violated we merge the head into its predecessor. If the second is violated we merge the 2nd and 3rd element together and move the head up one).

This means that the size of the largest element is at least the Fibonacci number of the stack height, which means that you only need quite a small sized stack to have more than you can possibly fit. I use a stack of size 128, which is excessive (this could fit an intmap of size up to \(2^88\) and intmap only supports 64-bit keys), but that’s partly because I want to be able to build up larger intermediate stacks.

I’ve introduced an intmap_builder type which basically maintains this stack so you can build up a whole bunch of unions like this without worrying about the evaluation order. Changing my performance benchmark so that it used this strategy immediately blew the Haskell implementation out of the water by a factor of 2 (I was previously about 60% slower).

But that’s properly cheating. I’m now winning the benchmark because I’m not executing the same program! That’s not remotely reasonable.

But… what if I could make it so that when I do execute the same program, behind the scenes it executes the faster version with intmap_builder.

How could I perform this black magic?

It turns out that I have an advantage that the Haskell version doesn’t: I can tell when I don’t care about intermediate results.

There are two features of the intmap API which are quite nice: Firstly, everything is reference counted. Secondly, all union operations implicitly decref their result (the way you’re supposed to treat this is that every intmap has a uniqueness type which is absorbed by arithmetic operations and that if you want to reuse a value you must pass it to copy. The abstraction is a little leakier than this).

Why is this relevant? Because it means that if an intmap with a refcount of 1 is passed to a union operation we know that we do not care about this intmap for any other reason than the result of this union. In particular we are free to mutate it arbitrarily because any other references to it are now invalid. Woo!

And this means that if we can disguise an intmap_builder as an intmap then when we know that it’s an owned argument of a union we can just push the right hand side onto it and the evaluation order would be sorted out. That’s pretty cool.

We then need to make sure that as soon as you incref it or need to do anything to it that observes its actual value it turns into a real intmap and not a stack of intmaps, but that’s not too bad. There’s a bunch of book-keeping to do around copying and redirects (we can’t always replace an intmap with the result of its union, because we might not own the result of its union). If you care about the details you can read the commit.

It works, too. It’s not quite as fast as just using the intmap_builder directly, mainly because there’s a lot of additional checking that has to be performed (I expect to be able to optimise this somewhat, though it will always be slower), but it’s still significantly faster than the Haskell version.

But details aside, this is I think an instance of a very powerful general technique: When you have a uniqueness type system and a bunch of algebraic rules, you can tell when you don’t care about the result of a sub-expression and then you can use your algebraic rules to massively rearrange the expression for a more efficient evaluation order. This is essentially a query optimiser: We have an algebraic expression which we want to evaluate in the most efficient way possible and we apply a bunch of rules to achieve this effect. The neat thing is that because of the way the API is structured we can apply this query optimiser without the user having to know about it: Every intmap value can be either a (somewhat reduced) abstract set of operations for combining intmaps, or it can be a concrete intmap with all the benefits thereof. It remains the former until you need it to be the latter, at which point the query is fully and transparently evaluated.

I’m not sure this idea works without a great deal of low-level access to the your implementation of laziness. If every operation is destructive and the copy operation returns a pair rather than leaving the original untouched then you can make it work, but the fact that you can force within a non-destructive function like size or equal is quite powerful and makes the API much nicer to use. I’d very much like to see an implementation of this sort of idea in a more functional language, as I’m currently not really sure how it could work.

In this case the algebraic rule I used was the associativity of union (note that order is preserved: intmap unions are not commutative because they have values and the union is right biased), but there are other ones. In particular I plan to implement similar strategies for intersection and difference: Intersection is associative, but additionally in a collection of n things being intersected, all but the left hand most one can be commuted (because we don’t care about their values, but we do care about the values of the left hand most one). This in particular lets us do intersection by adopting a smallest first evaluation order – we always intersect the left hand item with the smallest of the remaining items. This is both intrinsically more efficient and potentially lets us shortcut much earlier because the smallest one may reduce the left hand item to empty at which point we just release the remaining values, return empty, and delight in our O(lots) speed-up.

I haven’t thought through the consequences of this in detail yet, but I’m pretty excited about this idea. It’s essentially a general purpose query optimiser for a certain class of purely functional datastructures, and I think it might have the potential to be a seriously good one.

This entry was posted in Code on by .

Updates to intmap

So in my previous post on intmap I said that I thought that the API was likely to be pretty stable as I’d put a bunch of thought in it. Then I completely changed the memory ownership model. Lol, programmers, eh?

Previously when the API returned an intmap that intmap was now your problem. You had to explicitly release it and nothing else would do it for you. I’ve changed this – now in fact everything will do that for you and you  have to explicitly ask it not to. Basically every operation that will create a new intmap from other intmaps will release its arguments and you have to explicitly acquire it before passing it to the function if you want to still have a copy of it left around.

Why does it do this?

Well, two reasons. My main reason for wanting to do this is that it allows the code to tell if it owns its arguments and can mutate them without anyone noticing – if the refcount on an argument is 1 then this function owns it and can sneakily update it behind the scenes. This is pretty cool, and is at least partly a shameless attempt to try to find things that the Haskell implementation can’t easily do to make up for my lack of GC level performance at small allocations.

I haven’t actually implemented that bit yet. I’ve just changed the API. But in implementing that I found a surprising thing: The code that actually calls the intmaps becomes a lot nicer.

Previously it was basically impossible to build up intmaps as compound expressions because you had to be sure you were going to release the sub-expressions. Now you can write things like

  intmap q = intmap_union(
    a,
    intmap_singleton(a, 17909118457709389247UL, 8UL, "5bb951ab"),
    intmap_singleton(a, 10680653996055428951UL, 2UL, "01")
  );

and the memory management will be taken care of for you – you still have to free q, but the singletons that went into its construction are sorted out for you. That’s not a huge win in this example, but for more complicated examples it seems to help a lot.

In the interests of disclosure, I should say that I got this idea from jq‘s model – its jv internal values (these are basically any json value) are reference counted and have this ownership model. This is at least partly because it considers arrays and objects to be immutable and treats them as copy on write, but when it owns the value like this it doesn’t actually copy them.

Implementing this was interesting. And by interesting I mean painful. It turns out that changing the memory ownership model of a C API is at least as hard as getting it right in the first place. All I can say is, thank god for tests.

i’ve changed my testing model slightly. testgen still has a big place in it, but there’s also test.c which is a set of deterministic tests that run first and are slightly easier to deal with. Whenever there’s a testgen failure I extract a test from that and add it to test.c. I’ve also simplified the output from testgen – to my great surprise it turns out that 4000 line C programs which are a nested mess of 50 dependent variables are not that easy to minimize by hand, and I didn’t really want to write a minimizer. I’ve changed the result so that it mostly just builds up a single example through a combination of operations and then makes a bunch of asserts about that one example. Each individual test tests a bit less, but the result seems to be that it finds errors around run 20 rather than run 2. Given that it runs a thousand times I’m OK with that. Of course, I wrote that and just noticed that it had failed on run 519, so maybe there’s still a bit of work to go there.

This entry was posted in Code on by .

New project: IntMap implementation in C

So I’ve been hacking on a new project recently. It’s an implementation of “Fast Mergeable Integer Maps” (known to Haskell and Scala programmers as IntMap) in C. The implementation is available on Github under a 2-clause BSD license. I’ve not done any work on packaging it yet, mostly because I’ve no idea how to package a C library (I suspect the answer is something like sacrifice a goat to acquire knowledge of how autotools works).

It’s currently pretty minimal. The core of the API is there with support for lookup, insert, union and enumeration, but I’ve put some thought into how it looks and what’s there should be pretty stable in terms of keeping the same API around.

Developing it has been… interesting. It’s rather shaken my confidence in my ability to write non-trivial C programs, truth be told. I’ve made pretty much every mistake I possibly could have – it’s mostly been trivial stuff (using the wrong variable, getting checks the wrong way around, etc) with a few interesting things (getting pointer addition wrong due to incorrect pointer types), but my testing process has been full of segfaults.

In the end I’m pretty confident I’ve caught most of the bugs in it, due to the testing process I’ve adopted (more on that later), but I’m a little worried about the error density – C isn’t a language that is forgiving of programmer error and apparently I rely a lot more on that in my normal programming than I’d previously been aware. I think clearly if I want to be writing C code on a regular basis I’m going to have to up my testing game if nothing else.

Part of the problem here is that this library ends up emulating both garbage collection (through reference counting) and pattern matching (through unions and a type tag) in C, both of which are fiddly and error prone to do, and even in Haskell the algorithms used are conceptually clear but a little tricky to implement correctly. All this adds up to a pile of edge cases that are easy to get wrong, and apparently I wasn’t so good at not getting them wrong.

So I turned to my standard tool for getting data-structures correct when there are lots of fiddly edge cases: QuickCheck style testing. Initially I did this using python bindings to intmap and hypothesis. The problem with doing so I found was that  the result is really quite slow so it was difficult to get thorough enough coverage. This was especially so under valgrind (python it turns out does not run well under valgrind). I could have solved this the same way I did in pyjq by generating a C program for each test case, but it wasn’t going to interact well with hypothesis and doesn’t solve the slowness problem.

So what do? Well I thought about using quickcheck style testing in C – there is an existing C quickcheck implementation but frankly that sounds pretty hellish.

So I ended up with a different approach. Rather than generating random data I generate random programs which should all be valid uses of the library. The essential idea of it is to generate a collection of intmaps, with a parallel implementation in python that allows to determine what should be in them. This then allows us to test a wide variety of assertions based on what we know should be in there.

The generated programs aren’t pretty. Here’s an example. However even an individual one produces pretty good coverage, and randomly generating 1000 of them should be more than enough to find most bugs. I’m sure there are bits it misses (for example it’s not currently very good at testing all the branches of equality – it in principle can do so, but I think some of the edge cases are too low probability), and it’s not impossible that there are still some subtle ones lurking in there, but since writing this I’m an awful lot more confident about the quality of this code than I was before I wrote this program generator.

It’s currently still quite slow, but that’s because it’s doing a thousand runs of compiling 4000 line C programs (twice – one in debug, the other not) and running each of those programs (twice – once in valgrind, the other not). With all that work, it should hopefully be reasonably good at making up for my failings as a C programmer.

This entry was posted in Uncategorized on by .