A parable about problem solving in software development

I’ve told a lot of people this story over the years. Mostly whilst drunk. The responses are usually pretty similar – hilarity, incredulity and just a little bit of “There but for the grace of god go I” style sympathy. I’ve had multiple requests to write it up, so I’ve finally acquiesced.

It’s a parable about what happens when you’re always solving the problem right in front of you rather than questioning whether this is a problem you actually need to be solving. It’s unfortunately mostly true. I’ve anonymized and fictionalized bits of it, mostly to protect the innocent and the guilty (and occasionally to make it a better story), but 90% of what I describe in the following really happened, and 90% of that happened more or less as I described it (as best as I can remember). If you know me, you can probably figure out where it happened. If you don’t know me, I’m not going to tell you.

At the beginning of our story we had one central API off which maybe a dozen smaller apps and services hung. The API controlled our data storage, the operations you can reasonably perform on it, and generally encapsulates our model. It, and each of the apps, lived in their own source control repo and were deployed separately.

This API was implemented via JSON-RPC over HTTP. It wasn’t RESTful, but maybe it was a bit RESTy. RESTish perhaps.

It kinda worked. It wasn’t perfect, but it was at least vaguely functional.

We essentially had two problems with it:

Each of the apps talking to it had written its own client library (or was just including raw HTTP calls straight in the code)
It was quite slow

As well as the the core API, we also had a message queuing system. It was pretty good. We didn’t use it for a lot – just some job queueing and notifications to send to the users – but it worked well for that. We’d had a few problems with the client libraries, but they were easy to fix.

At some point it occurred to one of us that the reason our HTTP API was slow was of course that HTTP was slow. So clearly the best solution was to replace our slow shitty HTTP RPC with our hot new message queue based RPC. What could go wrong!

Well, you know, it didn’t really go wrong. It mostly worked. It was… a bit strange, but it basically worked. We wrote an event driven server which implemented most of what we were doing with the HTTP API (including all the blocking calls we were making to our ORM. Oops). It polled a message queue, clients would create their own message queue to receive responses on. Then a client would post a message to the server which would reply on the client’s message queue (I think there were some tags added to the messages to make sure things lined up. I hope there were, otherwise this all sounds horribly precarious).

This was basically unproblematic. It might even have been a slight improvement on our previous system. RPC over message queue is a legitimate tactic after all. We of course didn’t have any benchmarks because why would you benchmark sweeping changes you make in the name of performance, but it was at the very least not obviously worse than the previous system.

Our next problem was the various client libraries that we were reimplementing everywhere. This was obviously stupid. Code reuse is good, right?

So we rationalized them, pulled them all out into their own repo, and produced a client package. You used it by installing it on your system (which was just a single command using the packaging system we were using), and then you could talk to an API server. It was straightforward enough.

So we’d solved our reimplementing problems, and we were at least claiming we’d solved our performance problems (and maybe we even had. At this late stage I honestly couldn’t tell you).

Thing is… it turned out that this was actually quite irritating to develop against.

It had already been a little painful before, but now in order to add a feature you had to do all the following steps:

Make a change to the server code
Make a change to the client library code
Make a change to the application code
Restart the server (no code reloading in our custom daemon)
Install the client library on your system
Restart your application (no code reloading when a system package changes)

We decided to solve the first two problems first.

We noticed that a lot of the code between the client and the server was duplicated anyway (similar structure on each side after all). So we ended up commonizing it and putting the client library in the repo with the server. Not all of the server code was needed in the client library obviously, but it was much easier to just put it all in one directory and have a flag that let you test if you were running in client or server mode. So now at least all the changes you had to make to both client and server were in one repo, and they might even have been the same code.

At the moment what we have here is a slightly baroque architecture, but it’s fundamentally not that much worse than many you’d encounter in the wild. It’s not good, but looking at it from the outside you can sortof see where we’re coming from. What follows next is the point at which it all starts to go completely tea party.

You see, code duplication between client and server was still a problem.

In particular, data model duplication. If you had a Kitten model, you needed a Kitten model in both the client and the server and you needed to maintain both. This was quite a nuisance.

At this point some bright spark (it wasn’t me, I swear) realised something: Our ORM supported highly pluggable backends. They didn’t even need to be SQL – there were examples of people using it for document storage databases, even REST APIs. We had this API server, why not make it an ORM backend?

And if we’re doing that, can we do it in a way that reuses the models we’re already using? We’re already detecting if we’re running in client or server mode, can’t we just have it use a different backend in the two cases?

Well, of course we can.

Of course, the really nice thing about having an ORM is how you can chain things and build rich queries. So we do want to support the full range of query syntax for the ORM.

A weekend of caffeine fuelled development from this one guy later, we all arrived on a Monday morning to find a grand new vision in place. Here’s how it worked:

We have the same ORM models on both client and server
If we are in client mode, our backend uses the JSON-RPC server rather than talking to the database
Given a query object, we do a JSON RPC call to the corresponding backend methods on the server. This returns a bunch of models

Simple, right?

I’m going to unpack that.

I make a bunch of method calls to the ORM
This generates a Query object
We pass this Query object to a custom JSON serializer that has to support the full range of subtypes of Query
We send that JSON over a message queue
Our server pops the JSON off a message queue, deserializes it and calls a custom method to build a Query object
This Query object is passed to the ORM backend
The ORM backend converts the query object into an SQL query
The database adapter executes that SQL query and returns a bunch of rows
Those rows get wrapped as model objects
Those model objects get serialized as JSON and passed across the client message queue
The client pops the model JSON from the message queue
The client parses the JSON and wraps the resulting array of hashes as models

…yeah.

Anyway, we arrive on a Monday morning to find this all in place and broadly working (“There are just a few details to polish”).

And, you know what? We decided to roll with it. We were quite irritated with the status quo, and this clearly would make our lives easier – there was an awful lot less code to write when we wanted to add a feature and boy did we need to add features. So although we were probably a little suspicious, we decided to let that slide.

Of course… you see that long pipeline over there? Lot of moving parts isn’t it? Many of them, custom crap we’ve written. I bet that’s going to break, don’t you?

Of course it broke. A lot.

And naturally, as seems to happen, muggins here gets to be the guy in charge of fixing those bugs (how did this happen? I don’t know. I think the problem is that I don’t step back fast enough when the call for volunteers arrives. Or maybe people have an uncanny knack for spotting I’m actually quite good at it despite my best efforts to pretend I’m not).

One of the most common sources of bugs was user error. Specifically it was user error that was made really easy by our setup.

It required three steps to push a change to the code to your application: You had to restart the server, you had to install the package, you had to restart your application. If you forgot any one of those three steps, your client and server code would be out of sync (and remember how much of this was shared) and the resulting errors would be subtle and confusing. This frequently drove people to despair.

Remember how I believe in “Fail early, fail often“? It turns out I’ve believed this for some time (the first evidence I can find of my thinking along these lines comes from 2007. That would have been within about a year of my learning to program).

So the solution I hit upon for the problem was “Well, don’t do that then”. When a server or a client started up, it would create a signature that was a (MD5 I think) hash of all its code. This would then be transmitted along with every RPC call, and if the server detected that the client’s hash differed from its own it would instead respond with an error saying “No, you’re running the wrong client code. I’m not going to talk to you”. Unsubtle, but effective in making the error clear.

This solved the immediate problem, and we decided it was good enough.

Most of the next six months (when I wasn’t doing feature dev) I was fixing bugs with the pipeline – this particular obscure query was crashing our deserializer. This one query was somehow generating 17MB of JSON data and the parser didn’t like that very much. That sort of thing.

During this time people were getting increasingly irritated with the dev process. It was all very well having those errors be detected, but what you really wanted was for those errors to be fixed. And to not have to do three slow steps to make a simple change.

This was when my true contribution to our little Lovecraftian beauty came in.

“Well”, I reasoned, “the server has all the code, right? And the client needs all the code? And the server is already sending data to the client…”

So.

The package remained as a tiny shim library that needed to be installed to talk to the server, but it include really very little code (it still checked the code md5, but this now basically never changed).

Here is the code loading protocol:

On startup, the client would make its first RPC call. This was a “Hey, give me the code” call. The server would reply with a list of file paths and their source code
The client would create a temporary directory and write all the files into that temporary directory
The client would add that temporary directory to the load path and require the entry point to the library

This removed the install step: The client would forever and always be running the latest version of the code, because it fetched it from the server at start up. We still had to restart the server and the client, but at least one of the more irritating and easy to forget steps was removed.

I don’t think we ever implemented code reloading, though it’s obvious how we could have – on code changes, the server would just have to broadcast the changed files, which could again be written to the file system and reloaded.

Fortunately better judgement prevailed before we hit that point.

We were coming up to the first major release we’d have with all this infrastructure in place.

It was obviously not going to go well.

The site was dramatically slow in comparison to its previous “This is too slow!” HTTP incarnation. Why? Because it turns out that serializing and deserializing lots of ORM queries and models is really fucking slow! When we had the HTTP implementation in place we were a bit more careful about what we were doing, but this was all behind the scenes and invisible to us and mostly out of our hands.

It was also still quite buggy. Despite my best efforts to keep the whole thing reliable and functioning – I’d patched a lot of bugs – we kept finding new ones. The problem wasn’t in fixing individual bugs, it was that the core architecture was basically a disaster.

One night while wrestling with insomnia I had a revelation.

“OH MY GOD. IT’S JUST A LIBRARY”.

A weekend of caffeine fuelled development from me later, everyone arrived on a Monday morning to find a grand new vision in place. Here’s how it worked:

Everything lived in a single repo.
Everything that was previously server code was now just sitting in a single library that everything put directly on their load path.
Everything talked to the database directly, via that library.

That’s. It.

It took a little bit of time to get it stable after that – there were a lot of places where our bug workarounds now became bugs in their own right. There were a few days where it was touch and go – this was about a month before release and there was some serious head scratching and concerned moments where we thought we were going to have to release it in its previous form after all. But we got there, and the result was unsurprisingly both faster and more reliable than what came before it.

Obviously this is how we should have done it in the first place. It’s not just obvious in retrospect, it should have been obvious in the beginning. We were just too focused on fixing this one problem with our current system rather than calling the system itself into question to see it.

The project structure changed a bit over the time since then, but as far as I know this is still essentially how it looks, and I imagine how it will to continue to look indefinitely.

Unless someone decided that what was really needed is to abstract out some part of the database access into an RPC server. I hope no one did that, but I’m a little afraid to ask and find out.

9 thoughts on “A parable about problem solving in software development”

Paul Chiusano March 7, 2013 at 12:13 pm

Related to this, there’s a post I really like discussing why it is usually a bad idea to try to achieve modularity and information hiding via _runtime_ isolation: http://fare.livejournal.com/142410.html There’s a bunch of other stuff in the post, so quoting the relevant bit here:

“Brilliant operating system designers have argued that microkernels can simplify software development because factoring an operating system into chunks that are isolated at runtime allows to make each component simpler. But the interesting constant when you choose between ways to factor your system and compare the resulting complexity is not the number of components, but the overall functionality that the system does or doesn’t provide. Given the desired functionality, run-time isolation vastly increases the programmer-time and run-time complexity of the overall system by introducing context switches and marshalling between chunks of equivalent functionality across the two factorings. Compile-time modularity solves the problem better; given an expressive enough static type system, it can provide much finer-grained robustness than run-time isolation, without any of the run-time or programmer-time cost. And even without such a type system, the simplicity of the design allows for much fewer bugs, whereas the absence of communication barriers allows for higher-performance strategies. ”

Basically, that convoluted pipeline you described was a really inefficient way of achieving some modularity and information hiding that could have been done more with properly factored regular ol’ code, living in the same process.
1. david Post authorMarch 7, 2013 at 12:17 pm
  
  Basically, that convoluted pipeline you described was a really inefficient way of achieving some modularity and information hiding that could have been done more with properly factored regular ol’ code, living in the same process.
  
  Which indeed was the end result once we saw the error of our ways. :-)
  1. Paul Chiusano March 7, 2013 at 12:36 pm
    
    Yes, I hope that didn’t come across like I was suggesting you didn’t get to that same conclusion yourselves… :)
    
    Anyway, it’s funny how things like this seem very obvious in retrospect, but meanwhile many man months of effort get wasted laboring away at less effective ways of expressing some piece of functionality. It has happened to me repeatedly that I’ll have a ‘Aha! Wish I’d thought of that six months ago’ moments. There’s the realization I have wasted literally months of work and would have been better off simply not going to work for a month or so, and having the crucial insight occur to me while hiking in the mountains or something. If only I could control when that happens better… :) Experience definitely helps, and I also think it helps to have a diverse team of opinionated people to work closely with, so that someone is more likely to pipe up with a “wait, why the fuck aren’t we just doing X?”
Nate F March 7, 2013 at 1:47 pm

Oh my god, when did you start working at my company?

We’ve gone through pretty much this exact set of steps, except we actually did write the dynamic code loading part (not that we loaded everything that way, we just used it to issue patches to production without needing to upgrade the whole client and server). I’d actually proposed something quite similar to changing the client to run directly against the server’s D…. but instead we’ve decided to rewrite everything from scratch to be (use your best reverberating God-voice here) ***In The Cloud***, because that’s what people do these days. Whee!
Pingback: An audible experiment | David R. MacIver
Pingback: Best of drmaciver.com | David R. MacIver
Pingback: A personal history of interest in programming | David R. MacIver
Pingback: Write libraries, not services | David R. MacIver
Pingback: The horror lurking at the heart of the new hypothesis | David R. MacIver

Comments are closed.