David R. MacIver's Blog: Programmer at Large: What's that noise?

Programmer at Large: What's that noise?

20 January 2017

After Sam and I had chatted for a while I returned to a pod to resume work. I settled in and brought up my work space and began to review where I was.

My particular troubles of the moment had all started with a bug report:

Reported-By: jad-nic [上子/Kamiko]
Title: What's that noise??
Bounty: 400 millivote.
Body: Every now and then there's a weird noise coming from behind the interior wall
in Pod 11. It's a sort of WHOOSH followed by a flushing noise and a gurgle. I've heard
it three times now, about a megasecond apart each time [timestamps attached].

It's not a big deal but I keep worrying about it. Could someone check it out and
figure out what it is? Fix it if it's easy or important, but I really just want to
know what's going on.

The bounty wasn’t huge, but it was a pretty decent amount. It was nice to encounter people who actually cared about plumbing.

While reviewing I made a note that I should say hi when we were both awake at the same time. It would get ship off my back a bit, and they worked in hydroponics so we’d have plenty of things I was actually good at to talk about.

Anyway, nobody from the official plumbing team was ever going to look at this - the ship is full of strange noises, and they’ve got much higher priority things to worry about. I don’t know offhand what but it’s basically guaranteed that if you work in plumbing you have high priority things to worry about.

But it sounded interesting (to me) and weird, which makes it exactly the sort of problem I get to work on. I love being a programmer at large, and not just because I have to talk to fewer people this way.

So before I last went to sleep I’d set up some monitoring on the microphones in the wall near the pod to listen for the noise then sample and bucket all the semi-relevant data before and after it. After a fifty megasecond nap I woke up to sift through what it had gathered.

During this time about 70 gigaprocesses in the plumbing system had crashed. A bit on the high side, but well within normal variation.

I’d picked the most plausible candidate - a process category with about ten thousand members, all of which crashed shortly before the noise and which normally didn’t crash in quite such a correlated way - and was now digging in to why it was failing.

Which was easy enough: It was appending data to some buffer on its local disk. The buffer was capped at quite a reasonable size - 512MiB - but the process just tried to grow the buffer past that and crashed when it couldn’t rather than doing anything sensible. Nothing very surprising so far, and the system handled it automatically by restarting the process from a fresh state with an empty buffer, at which point the whole cycle began again.

Unfortunately that left me none the wiser about what this process did, which is why I was now having to learn enough C++ to figure that out.

Subject: C++
Category: Programming language, text based.
Lineage: Pre-diaspora, began as a dialect of C in 1983.
Common Tags: Archaic, Esoteric, Low Complexity, Annoying.
Normalised Rating: Best not if you can avoid it.

The wiki entry on it was less than encouraging. Hopefully I wasn’t going to have to learn very much of it.

The relevant line where the process crashed was the following:

client->sock = socket(AF_INET, SOCK_DGRAM, IPPROTO_UDP);
if(-1 == client->sock) {
    LOG_ERROR << "failed to connect to StatsD";
}

“Wiki, what’s statsd?”

“Historical records suggest that it was a centralised service for metrics aggregation popular in pre-diaspora software. The term subsequently became generic and many cultures use it or a derived word for the general concept.”

OK. So we have two questions now:

Firstly, why can’t it connect to statsd?

Secondly, what does logging an error have to do with appending to a buffer on disk?

Let’s focus on the proximate cause for now and start with the second.

“Ide, show me what this line with LOG_ERROR on it does.”

It expanded the results of the call. Ah. It wasn’t really a log at all.

“Ide, translation note: LOG_ERROR in this context may actually mean that it appends error data to a disk buffer and may not result in any logging occurring.”

“Noted”

“Wiki, what’s a file?”

“In the traditional software practices of pre-diaspora cultures, it was common to arrange a system’s data in a single named hierarchy of disk buffers. The term ‘file’ was used interchangeably for names in this hierarchy and the buffers themselves.”

“What? Why did they organise data that way? Didn’t it get confusing?”

“Historical consensus is that it probably seemed like a good idea at the time and that yes it did.”

“OK. So why is this logging call adding data to a file?”

“Many traditional software practices consider this to be how logging is performed. When problems occurred a human would inspect the contents manually to determine what had happened.”

“Did that work?”

“No.”

“Ugh. Is there a policy on what we do about this sort of archaic logging?”

“Either a separate process converts the buffer into log events periodically or we intercept the calls to write to the buffer and log them directly instead of writing them. Recent consensus is that the latter is preferable.”

The buffer was getting written to, so if we were doing anything about it at all then it had to be the first one.

“OK. Ide, do we have any other processes inspecting this buffer?”

“No, the buffer only grants access rights to the current process.”

I was starting to suspect I was fighting a zombie - a process that had no useful purpose any more but was still shambling around getting in people’s way and occasionally eating their brains.

“Ide, when was the last time someone looked at this process?”

“About 9.7 gigaseconds ago”

9.7 gigaseconds ago... oh no.

“You mean ship launch?”

“No, about 5 megaseconds before ship launch.”

“Show me any annotation they left.”

From: nod-sid 1 [Tulela]
Subject: What is this?
Body: I don't know what this is doing. It looks like a zombie? Whatever,
the relevant systems all seem to be working fine and there's like a billion
things still to do for launch. Add figuring this out to the backlog.

Did you just tell me to go waste myself, Tulela? I believe I did, Arthur.

“OK, give me a snapshot of the buffer contents at point of crash.”

As I’d been starting to suspect, the contents looked like the following:

failed to connect to StatsD
failed to connect to StatsD
...
fail

fail indeed.

“Ide, can you automatically convert this code to do actual logging?”

“Yes.”

“OK, do so.”

“Done.”

I sighed. OK, back to the first question. Why can’t it connect to statsd?”

A socket was apparently another word for a network connection. So it was trying to create an outgoing network connection. The most likely explanation was obvious.

“Ide, does this process have rights to create network connections?”

“No.”

Right.

“Ide, show me what’s left if you remove all code that has no effect other than writing to a private disk buffer or can’t run without the process creating a network connection.”

#include <iostream>

int main(){
    std::cout << "Starting up event monitoring system!" << std::endl;
    return 0;
}

Right. Good. It was definitely a zombie and I was going to get to shoot it in the head. That was always satisfying.

So that left just one question: Why was this zombie self-destructing right before the actual event I was interested in?

“Ide, show me the most unusual function on the call stack at point of crash.”

void logAnomalousTemperatureEvent(double t_delta, double time_delta){
    ...
}

So the process crashing was never a cause of the problem at all. It was a symptom.

“Is there a corresponding set of interesting temperature events in our log roughly at the same time as these calls?”

“Yes”

OK. So, in conclusion I had learned two things:

The first was that this was definitely a zombie and was not giving us a useful signal we didn’t otherwise have here, but at least I could kill it.

The second was that this whole thing was otherwise a complete waste of time.

From: vic-taf [Arthur]
Subject: Zombie monitoring service in the plumbing system
Priority: Low
Proposed action: Staged shutdown of service '077a58e24e34b6543da8100c8541a8dd'
Body: This service doesn't do anything and and as far as I can tell has
never done anything except create noise. Development log attached as evidence.

Next chapter: Are you serious?

This chapter is also mirrored at Archive of Our Own. If you liked this and want to read more like it, support my writing on Patreon! Patreon supporters get access to early drafts of upcoming chapters. Although honesty compels me to admit that right at this minute there are no drafts of upcoming chapters available. I’ll probably write more this weekend)

Comments

Alex Willmer on 2017-01-20 22:05:54:

Regarding C++, did you mean “Hopefully I was_n’t_ going to have to learn very much of it.”?

david on 2017-01-20 22:29:44:

Yes, thanks.

Irenes on 2017-01-21 09:39:13:

So uh, still no answer to the plumbing question (except that it’s a temperature event).

Beautiful. :)

david on 2017-01-21 10:12:04:

Answering the plumbing question will span more or less the entirety of the first “book”. It turns out that debugging a 10k year old software system that in large part communicates with itself solely through interactions with the physical environment is a time consuming task.

(Yes, it’s an entire book where the major antagonists are the plumbing system and the protagonist’s social anxiety. I’m not kidding around with this slice of life thing).

Irenes on 2017-01-21 11:20:17:

That’s awesome. Very much looking forward to it.

Andrew Suffield on 2017-01-21 16:22:24:

I’m a software engineer who works on large systems with a high level of legacy code, and this story is an experience I’ve had many times.

pozorvlak on 2017-01-26 12:01:51:

If the wiki lists C++ (C++2089, no less) as “low complexity”, I dread to think what a “high complexity” language looks like.

david on 2017-01-26 12:18:56:

Someone’s line about this story was “I enjoy how consistently dystopic all the software in your world building is”.

It turns out that quite a lot of complexity can accrue when you’ve had 10k years of software development and have an awful lot of automation to lean on.