David R. MacIver's Blog: Against human readability (part 1 of many?)

Against human readability (part 1 of many?)

27 May 2012

I had the following conversation with Tony Garnock-Jones on twitter:

I still don’t think I can put together a coherent argument about this, but I figure I’ll give it a shot. I may expand on this with a better post later once I’ve thought about it some more and can be bothered to write some code for it. Right now this is basically a brain dump rather than an attempt at a reasoned argument.

Firstly: This is about specifically formats for data interchange between programs. Instant messages, emails, document formats, etc. It is explicitly not about things for which direct human editing is an actual explicit requirement (config files, markup, programming languages, etc).

Secondly: I don’t actually know what I’m talking about, in the sense that I’ve done relatively little work in defining data formats. This is mostly from the point of view of looking in from the outside.

Anyway, onwards to the actual point.

The thesis here is this: Human readability of protocols should be a non-goal in protocol design and indeed should probably be considered actively harmful.

Why?

Basically I’d claim that the following is the only thing you should be optimising for: The ability to create quality implementations.

Quality is obviously a useless weasel word. So to be more precise I would consider the following to be the primary features of a quality implementation, in order of decreasing importants:

Correctness
Consistency
Performance
Simplicity

Depending on the use-case the order of the last two might swap. Also the three are not independent - simplicity helps with both correctness and performance.

Consistency needs some explanation: It’s specifically about different implementations behaving the same way. If there is ambiguity in your specification it is very easy for every implementation to end up doing its own thing. You really don’t want that.

By imposing additional constraints you are limiting your ability to optimise - unless it happens that the optimal solution (or even a near-optimal one) is human readable, it’s very likely that by imposing the constraint of human readability you are going to hurt one or more of the above.

Vague generalities aside, now for some specifics.

I’d like to consider the following example formats for representing an N-ary tree structure with each node being a string or an n-way branch.

The first is an S expression. (something (like this) where (branches are (represented by brackets)).

The second is a tagged prefix notation. A node is represented as: [Tag byte | Integer Length | Payload ]. Tag byte is 0 for a branch and 1 for a string. The integer length is an integer (how large depends on requirements. 32-bit is probably enough. 64-bit is certainly adequate) encoded in network order which represents either the number of child nodes or the number of characters in the string. Payload then stores either the child nodes or the characters.

Compare the difference between the two:

The first is designed for a human to read. It’s pretty easy for a computer to parse - you can hand roll a parser in half an hour or so (made up number caused by the fact that I couldn’t be bothered to write code for this post).

The second is not human readable. You could trivially convert it to a plain text format: Say by using space separation rather than byte adjacency and encoding the integer in decimal. e.g. you could write (foo (bar baz)) as “branch 2 string 3 foo branch 2 string bar string baz”. But let’s be honest, it may be text but it’s not exactly human readable.

What’s the advantage of this way? Well, firstly, S expressions may be easy to parse but this is trivial. In pseudo code:

   readNode(in): 
     tag = readByte(in)
     length = readInt(in)
     if tag == BRANCH_TAG
       return Branch(readNode(in) length times)
     else if tag == STRING_TAG
       return String(readBytes(in, length))
     else
       raise "Bad tag", tag

There, done.

Encoding things in prefix notation is ridiculously easy for a computer to parse. Additionally, from a performance point of view, it lets you determine the size of your data structures before you read them in their entirety. Does this matter? Well, it might make some performance difference in any language, but one place where it does make a big difference is when you’re implementing your parser in C where it’s a bit of a pain in the ass to keep resizing things and where the consequences of getting the resizing wrong can be so unpredictable.

Does this matter? I mean, obviously it matters if you’re writing your implementation in C, but if you’re not should you care about how easy it is to parse in C?

Sadly, I would argue that the answer is emphatically yes and would like to propose the following maxim:

Any successful data format will acquire at least one parser written in C. At least one of these parsers will see widespread adoption.

Why? Well, if your data format is widely used people will want to use it from different languages. This might be C, but it’ll more likely be something like Ruby or Python (likely Java too, but that’s not a relevant example for this argument), and at some point someone will decide performance matters. And when that happens, they will write an implementation in C and bind to it from their language.

So now we have a C implementation of your format. Great! Right?

Not so fast.

Here’s the thing to bear in mind:

Your data format is an attack surface

Any user input is an attack surface of course, but anything where you’re doing complex processing on the input is especially so. When the parser is written in an unsafe language like C it is especially vulnerable. Obviously the person who is writing the parser needs to bear this in mind, but importantly so do you. The simpler you make parsing in C, the easier you make it to avoid vulnerabilities.

This is what Language-theoretic Security is about, and they can tell you far more about it than I can because they actually know what they’re talking about.

Now let’s consider another advantage of the binary format.

Consider the following problem:

I have an encoding of a tree. I now want to take the contents of this encoding and write it as a string node inside another tree. How do I encode the result?

In the binary format the answer is simple: The same way you would anything else. You write a string tag. You write the length of the string. You write the contents of the string. There is literally nothing you need to think about that makes this complicated.

In the S-expression format? Well, I’ve actually not specified it well enough that this is even possible. You need to define string escaping rules. So now we need to expand our format. Say by allowing quoting. So we’d encode (foo bar) as the string node “(foo bar)”.

Now how do I encode the format of this? Is ““(foo bar)”” a single string node or a string node followed by a branch followed by a string node? How about “” (foo bar) ““?

Ok. So we need escaping. This isn’t so bad. This just becomes “\” (foo bar) \““. We can live with that. And one more time we encode it as”\“\\\” (foo bar) \\\“\”“, because we also had to define a backslash escaping rule.

Is this terrible? No. It’s kinda annoying though. It significantly impacts the property we wanted in the first place - human readability - and complicates our parser a fair bit. As well as requiring extra code for the escaping, we take a performance hit because we need to add a significant chunk of extra logic per character we process.

Additionally, we’ve made our code more complicated around the very area that is the biggest attack surface: User submitted data. The rest of the code is probably things which are mostly under our control (which doesn’t make them not problems, but merely less of one), but the payload contents of our format is the bit that most likely contains things random users have submitted. A malicious user can take advantage of bugs in your escaping implementation and use it to craft messages you didn’t mean to send.

Here’s another example where this is problematic:

Given a list of files, encode the list is a tree containing one node per file, each containing the file name and the contents of the file as children.

In our binary format, what do we do? Well, absolutely the obvious thing. There’s nothing to think about here.

In our S-expression format, what do we do? Well, we simply run the file contents through our escaping scheme and oh hang on, some of the characters in this jpg file are non-printing because it’s not actually text.

So our escaping scheme needs to now worry about binary data. Say by base 64 encoding anything that contains non-printing characters. More complication, more of a performance hit. Worse: The encoded data is now potentially substantially larger (well, 37% larger) than the original data.

I know this is scope creep, but it’s relevant scope creep: Consider email, or http. Both of these were not originally designed for sending binary data, both are used for that and suffer from the base 64 problem. By using a binary protocol up front you’ve completely future proofed yourself from this problem: Literally nothing had to be done to take into account different payload types.

This leads to an additional maxim:

If your data format has to care about what the contents of its payload look like then you are creating problems for yourself

If you want to be able to embed arbitrary binary data in your format then your format is by construction not human readable.

I have some more issues in mind, but I need to think about them some more in order to produce something which is even half coherent, so I’ll leave you with these for now.

Comments

@ndy on 2012-05-28 21:58:16:

Here are my more-or-less coherent notes from IRC. I’ve certainly written more eleqouently in other instances. ;-)

(21:38:14) The topic for #againstreadability is: Binary vs Text: Round 1, fight! | /2012/05/against-human-readability-part-1-of-many/
(21:44:51) andyjpb: good blog post
(21:44:53) andyjpb: well argued
(21:44:58) andyjpb: “As well as requiring extra code for the escaping, we take a performance hit because we need to add a significant chunk of extra logic per character we process.”
(21:45:00) andyjpb: dingding!
(21:45:24) andyjpb: you’ve covered the rule you need for escaping
(21:45:49) andyjpb: another rule you might need is a field delimiter such as : (but you’re using sexprs, so less so)
(21:46:13) andyjpb: you don’t take a performance hit for this as it detecting those 2 characters take an if statement each
(21:46:22) andyjpb: and the branch predictor is pretty good at optimising that
(21:47:11) andyjpb: have you read the parts of TAOUP where ESR argues for text formats
(21:47:30) andyjpb: I also note that text/binary !=== human readable / not human readable
(21:47:55) andyjpb: ones view on the text/binary thing tends to hinge on the tools one uses or doesn’t use
(21:48:03) andyjpb: if one uses tools then it tends to be a text editor
(21:48:45) andyjpb: if one doesn’t use tools then one tends to prefer binary formats because they’ve never really leveraged the flexibility of having a text (but not necessarily particularly human readable : see xml) representation
(21:49:12) andyjpb: both formats are vunerable to complexity balloons
(21:49:21) andyjpb: it’s much more obvious in a txt format
(21:50:44) andyjpb: at genie we had a similar serialisers/deserialiser as you talk about in the “binary” protocol
(21:51:06) andyjpb: this interfaced to one bit that didn’t deal with counted strings (but was otherwise binary safe)
(21:51:13) andyjpb: so there’s lots of aspects to consider
(21:52:40) andyjpb: it’s also important to have that serialisation/deserialisation boundary: without it the tempation is to just do “absolutely the obvious thing. There’s nothing to think about here.”
(21:53:01) andyjpb: then you end up with tight coupling between your internal datastructures and your longer term storage structures
(21:53:33) andyjpb: and by “long term” I don’t just mean “disk”... i mean anything that involves comms between two different versions of the binary
(21:53:42) andyjpb: so it could be over a wire
(21:54:30) andyjpb: if you serialise/deserialise then you end up with a well specified interface... if you don’t end up with something well specified then i guess that’s your problem: if you try to avoid it then it bites you later anyway
(21:55:18) andyjpb: things like http and smtp and the other rfc822 protocols have surived remarkably well in the face of many hundreds of implementations on many and varied different designs of computer and OS
(21:55:44) andyjpb: it’s too easy to design a “binary” serialiser that has, for example, word size issues.