I had the following conversation with Tony Garnock-Jones on twitter:
I still don’t think I can put together a coherent argument about this, but I figure I’ll give it a shot. I may expand on this with a better post later once I’ve thought about it some more and can be bothered to write some code for it. Right now this is basically a brain dump rather than an attempt at a reasoned argument.
Firstly: This is about specifically formats for data interchange between programs. Instant messages, emails, document formats, etc. It is explicitly not about things for which direct human editing is an actual explicit requirement (config files, markup, programming languages, etc).
Secondly: I don’t actually know what I’m talking about, in the sense that I’ve done relatively little work in defining data formats. This is mostly from the point of view of looking in from the outside.
Anyway, onwards to the actual point.
The thesis here is this: Human readability of protocols should be a non-goal in protocol design and indeed should probably be considered actively harmful.
Basically I’d claim that the following is the only thing you should be optimising for: The ability to create quality implementations.
Quality is obviously a useless weasel word. So to be more precise I would consider the following to be the primary features of a quality implementation, in order of decreasing importants:
Depending on the use-case the order of the last two might swap. Also the three are not independent – simplicity helps with both correctness and performance.
Consistency needs some explanation: It’s specifically about different implementations behaving the same way. If there is ambiguity in your specification it is very easy for every implementation to end up doing its own thing. You really don’t want that.
By imposing additional constraints you are limiting your ability to optimise – unless it happens that the optimal solution (or even a near-optimal one) is human readable, it’s very likely that by imposing the constraint of human readability you are going to hurt one or more of the above.
Vague generalities aside, now for some specifics.
I’d like to consider the following example formats for representing an N-ary tree structure with each node being a string or an n-way branch.
The first is an S expression. (something (like this) where (branches are (represented by brackets)).
The second is a tagged prefix notation. A node is represented as: [Tag byte | Integer Length | Payload ]. Tag byte is 0 for a branch and 1 for a string. The integer length is an integer (how large depends on requirements. 32-bit is probably enough. 64-bit is certainly adequate) encoded in network order which represents either the number of child nodes or the number of characters in the string. Payload then stores either the child nodes or the characters.
Compare the difference between the two:
The first is designed for a human to read. It’s pretty easy for a computer to parse – you can hand roll a parser in half an hour or so (made up number caused by the fact that I couldn’t be bothered to write code for this post).
The second is not human readable. You could trivially convert it to a plain text format: Say by using space separation rather than byte adjacency and encoding the integer in decimal. e.g. you could write (foo (bar baz)) as “branch 2 string 3 foo branch 2 string bar string baz”. But let’s be honest, it may be text but it’s not exactly human readable.
What’s the advantage of this way? Well, firstly, S expressions may be easy to parse but this is trivial. In pseudo code:
readNode(in): tag = readByte(in) length = readInt(in) if tag == BRANCH_TAG return Branch(readNode(in) length times) else if tag == STRING_TAG return String(readBytes(in, length)) else raise "Bad tag", tag
Encoding things in prefix notation is ridiculously easy for a computer to parse. Additionally, from a performance point of view, it lets you determine the size of your data structures before you read them in their entirety. Does this matter? Well, it might make some performance difference in any language, but one place where it does make a big difference is when you’re implementing your parser in C where it’s a bit of a pain in the ass to keep resizing things and where the consequences of getting the resizing wrong can be so unpredictable.
Does this matter? I mean, obviously it matters if you’re writing your implementation in C, but if you’re not should you care about how easy it is to parse in C?
Sadly, I would argue that the answer is emphatically yes and would like to propose the following maxim:
Any successful data format will acquire at least one parser written in C. At least one of these parsers will see widespread adoption.
Why? Well, if your data format is widely used people will want to use it from different languages. This might be C, but it’ll more likely be something like Ruby or Python (likely Java too, but that’s not a relevant example for this argument), and at some point someone will decide performance matters. And when that happens, they will write an implementation in C and bind to it from their language.
So now we have a C implementation of your format. Great! Right?
Not so fast.
Here’s the thing to bear in mind:
Your data format is an attack surface
Any user input is an attack surface of course, but anything where you’re doing complex processing on the input is especially so. When the parser is written in an unsafe language like C it is especially vulnerable. Obviously the person who is writing the parser needs to bear this in mind, but importantly so do you. The simpler you make parsing in C, the easier you make it to avoid vulnerabilities.
This is what Language-theoretic Security is about, and they can tell you far more about it than I can because they actually know what they’re talking about.
Now let’s consider another advantage of the binary format.
Consider the following problem:
I have an encoding of a tree. I now want to take the contents of this encoding and write it as a string node inside another tree. How do I encode the result?
In the binary format the answer is simple: The same way you would anything else. You write a string tag. You write the length of the string. You write the contents of the string. There is literally nothing you need to think about that makes this complicated.
In the S-expression format? Well, I’ve actually not specified it well enough that this is even possible. You need to define string escaping rules. So now we need to expand our format. Say by allowing quoting. So we’d encode (foo bar) as the string node “(foo bar)”.
Now how do I encode the format of this? Is “”(foo bar)”” a single string node or a string node followed by a branch followed by a string node? How about “” (foo bar) “”?
Ok. So we need escaping. This isn’t so bad. This just becomes “\” (foo bar) \””. We can live with that. And one more time we encode it as “\”\\\” (foo bar) \\\”\””, because we also had to define a backslash escaping rule.
Is this terrible? No. It’s kinda annoying though. It significantly impacts the property we wanted in the first place – human readability – and complicates our parser a fair bit. As well as requiring extra code for the escaping, we take a performance hit because we need to add a significant chunk of extra logic per character we process.
Additionally, we’ve made our code more complicated around the very area that is the biggest attack surface: User submitted data. The rest of the code is probably things which are mostly under our control (which doesn’t make them not problems, but merely less of one), but the payload contents of our format is the bit that most likely contains things random users have submitted. A malicious user can take advantage of bugs in your escaping implementation and use it to craft messages you didn’t mean to send.
Here’s another example where this is problematic:
Given a list of files, encode the list is a tree containing one node per file, each containing the file name and the contents of the file as children.
In our binary format, what do we do? Well, absolutely the obvious thing. There’s nothing to think about here.
In our S-expression format, what do we do? Well, we simply run the file contents through our escaping scheme and oh hang on, some of the characters in this jpg file are non-printing because it’s not actually text.
So our escaping scheme needs to now worry about binary data. Say by base 64 encoding anything that contains non-printing characters. More complication, more of a performance hit. Worse: The encoded data is now potentially substantially larger (well, 37% larger) than the original data.
I know this is scope creep, but it’s relevant scope creep: Consider email, or http. Both of these were not originally designed for sending binary data, both are used for that and suffer from the base 64 problem. By using a binary protocol up front you’ve completely future proofed yourself from this problem: Literally nothing had to be done to take into account different payload types.
This leads to an additional maxim:
If your data format has to care about what the contents of its payload look like then you are creating problems for yourself
If you want to be able to embed arbitrary binary data in your format then your format is by construction not human readable.
I have some more issues in mind, but I need to think about them some more in order to produce something which is even half coherent, so I’ll leave you with these for now.