Category Archives: programming

Against human readability (part 1 of many?)

I had the following conversation with Tony Garnock-Jones on twitter:

I still don’t think I can put together a coherent argument about this, but I figure I’ll give it a shot. I may expand on this with a better post later once I’ve thought about it some more and can be bothered to write some code for it. Right now this is basically a brain dump rather than an attempt at a reasoned argument.

Firstly: This is about specifically formats for data interchange between programs. Instant messages, emails, document formats, etc. It is explicitly not about things for which direct human editing is an actual explicit requirement (config files, markup, programming languages, etc).

Secondly: I don’t actually know what I’m talking about, in the sense that I’ve done relatively little work in defining data formats. This is mostly from the point of view of looking in from the outside.

Anyway, onwards to the actual point.

The thesis here is this: Human readability of protocols should be a non-goal in protocol design and indeed should probably be considered actively harmful.

Why?

Basically I’d claim that the following is the only thing you should be optimising for: The ability to create quality implementations.

Quality is obviously a useless weasel word. So to be more precise I would consider the following to be the primary features of a quality implementation, in order of decreasing importants:

  1. Correctness
  2. Consistency
  3. Performance
  4. Simplicity

Depending on the use-case the order of the last two might swap. Also the three are not independent – simplicity helps with both correctness and performance.

Consistency needs some explanation: It’s specifically about different implementations behaving the same way. If there is ambiguity in your specification it is very easy for every implementation to end up doing its own thing. You really don’t want that.

By imposing additional constraints you are limiting your ability to optimise – unless it happens that the optimal solution (or even a near-optimal one) is human readable, it’s very likely that by imposing the constraint of human readability you are going to hurt one or more of the above.

Vague generalities aside, now for some specifics.

I’d like to consider the following example formats for representing an N-ary tree structure with each node being a string or an n-way branch.

The first is an S expression. (something (like this) where (branches are (represented by brackets)).

The second is a tagged prefix notation. A node is represented as: [Tag byte | Integer Length | Payload ]. Tag byte is 0 for a branch and 1 for a string. The integer length is an integer (how large depends on requirements. 32-bit is probably enough. 64-bit is certainly adequate) encoded in network order which represents either the number of child nodes or the number of characters in the string. Payload then stores either the child nodes or the characters.

Compare the difference between the two:

The first is designed for a human to read. It’s pretty easy for a computer to parse – you can hand roll a parser in half an hour or so (made up number caused by the fact that I couldn’t be bothered to write code for this post).

The second is not human readable. You could trivially convert it to a plain text format: Say by using space separation rather than byte adjacency and encoding the integer in decimal. e.g. you could write (foo (bar baz)) as “branch 2 string 3 foo branch 2 string bar string baz”. But let’s be honest, it may be text but it’s not exactly human readable.

What’s the advantage of this way? Well, firstly, S expressions may be easy to parse but this is trivial. In pseudo code:

   readNode(in): 
     tag = readByte(in)
     length = readInt(in)
     if tag == BRANCH_TAG
       return Branch(readNode(in) length times)
     else if tag == STRING_TAG
       return String(readBytes(in, length))
     else
       raise "Bad tag", tag

There, done.

Encoding things in prefix notation is ridiculously easy for a computer to parse. Additionally, from a performance point of view, it lets you determine the size of your data structures before you read them in their entirety. Does this matter? Well, it might make some performance difference in any language, but one place where it does make a big difference is when you’re implementing your parser in C where it’s a bit of a pain in the ass to keep resizing things and where the consequences of getting the resizing wrong can be so unpredictable.

Does this matter? I mean, obviously it matters if you’re writing your implementation in C, but if you’re not should you care about how easy it is to parse in C?

Sadly, I would argue that the answer is emphatically yes and would like to propose the following maxim:

Any successful data format will acquire at least one parser written in C. At least one of these parsers will see widespread adoption.

Why? Well, if your data format is widely used people will want to use it from different languages. This might be C, but it’ll more likely be something like Ruby or Python (likely Java too, but that’s not a relevant example for this argument), and at some point someone will decide performance matters. And when that happens, they will write an implementation in C and bind to it from their language.

So now we have a C implementation of your format. Great! Right?

Not so fast.

Here’s the thing to bear in mind:

Your data format is an attack surface

Any user input is an attack surface of course, but anything where you’re doing complex processing on the input is especially so. When the parser is written in an unsafe language like C it is especially vulnerable. Obviously the person who is writing the parser needs to bear this in mind, but importantly so do you. The simpler you make parsing in C, the easier you make it to avoid vulnerabilities.

This is what Language-theoretic Security is about, and they can tell you far more about it than I can because they actually know what they’re talking about.

Now let’s consider another advantage of the binary format.

Consider the following problem:

I have an encoding of a tree. I now want to take the contents of this encoding and write it as a string node inside another tree. How do I encode the result?

In the binary format the answer is simple: The same way you would anything else. You write a string tag. You write the length of the string. You write the contents of the string. There is literally nothing you need to think about that makes this complicated.

In the S-expression format? Well, I’ve actually not specified it well enough that this is even possible. You need to define string escaping rules. So now we need to expand our format. Say by allowing quoting. So we’d encode (foo bar) as the string node “(foo bar)”.

Now how do I encode the format of this? Is “”(foo bar)”” a single string node or a string node followed by a branch followed by a string node? How about “” (foo bar) “”?

Ok. So we need escaping. This isn’t so bad. This just becomes “\” (foo bar) \””. We can live with that. And one more time we encode it as “\”\\\” (foo bar) \\\”\””, because we also had to define a backslash escaping rule.

Is this terrible? No. It’s kinda annoying though. It significantly impacts the property we wanted in the first place – human readability – and complicates our parser a fair bit. As well as requiring extra code for the escaping, we take a performance hit because we need to add a significant chunk of extra logic per character we process.

Additionally, we’ve made our code more complicated around the very area that is the biggest attack surface: User submitted data. The rest of the code is probably things which are mostly under our control (which doesn’t make them not problems, but merely less of one), but the payload contents of our format is the bit that most likely contains things random users have submitted. A malicious user can take advantage of bugs in your escaping implementation and use it to craft messages you didn’t mean to send.

Here’s another example where this is problematic:

Given a list of files, encode the list is a tree containing one node per file, each containing the file name and the contents of the file as children.

In our binary format, what do we do? Well, absolutely the obvious thing. There’s nothing to think about here.

In our S-expression format, what do we do? Well, we simply run the file contents through our escaping scheme and oh hang on, some of the characters in this jpg file are non-printing because it’s not actually text.

So our escaping scheme needs to now worry about binary data. Say by base 64 encoding anything that contains non-printing characters. More complication, more of a performance hit. Worse: The encoded data is now potentially substantially larger (well, 37% larger) than the original data.

I know this is scope creep, but it’s relevant scope creep: Consider email, or http. Both of these were not originally designed for sending binary data, both are used for that and suffer from the base 64 problem. By using a binary protocol up front you’ve completely future proofed yourself from this problem: Literally nothing had to be done to take into account different payload types.

This leads to an additional maxim:

If your data format has to care about what the contents of its payload look like then you are creating problems for yourself

If you want to be able to embed arbitrary binary data in your format then your format is by construction not human readable.

I have some more issues in mind, but I need to think about them some more in order to produce something which is even half coherent, so I’ll leave you with these for now.

This entry was posted in programming on by .

The proper approach to fixing bugs

This is a concept I’ve been preaching recently. It’s very simple.

A bug indicates a failure to understand something. It might be a failure to understand what some code does, or it might be a failure to understand what the code should do.

If you simply try to make the bug go away, that failure to understand something will likely remain, and it will likely lead to new bugs.

What you must instead do is figure out what that failure of understanding was and rectify it. This will not only fix the bug, it will also prevent future bugs that would have emerged from that failure.

This entry was posted in programming on by .

pageme: A gem for invoking pagers from ruby

I got annoyed with being unable to page large amounts of data inside IRB, so I thought “Surely there must be a gem for this”. It turns out there isn’t. I mean there are extended versions of IRB that satisfy this specific need, but there’s no gem for “pipe this output to a pager”. Various gems do their own paging in various ways.

So I wrote one. It’s got a pretty simple API, but it seems to nicely do what I want: You can page strings to it, you can page from a block, you can redirect STDOUT and STDERR to a pager, etc.

It works on JRuby, which is apparently novel. Apparently no one has actually got that to work properly before. I can’t take too much credit for it – I merely observed that the posix-spawn gem works well enough in JRuby to launch a pager on a file then applied sufficient creative evil to bootstrap that into paging arbitrary things. The way it works is rather a hack – instead of piping directly to the less process it uses a unix domain socket to communicate with a netcat process piped into it. Really. And yes, a FIFO would be better, but for one reason or another I had a hell of a time getting it to work.

Charles Nutter has got an experimental version working with Java 7’s process builder which I hope to be able to incorporate, but I haven’t yet and will still have to leave a fall back approach in place for dealing with Java pre 7.

So far my cross-platform testing has been extremely thin on the ground. I’ve tried on MRI 1.8.7, MRI 1.9.2, JRuby 1.6.2 and JRuby-head, but only on my (linux) laptop. It probably doesn’t work entirely correctly under OSX (I’d expect the non JRuby version to, as it doesn’t do anything terribly controversial, but it might not). There’s no chance at all that it will work under windows.

My current plans which would allow me to take it to 1.0 are basically to get anyone who is interested in using it to do the cross platform testing for me, see if anything jumps out at me as wrong with the API, and figure out a way to do automated testing of this (I haven’t yet, because it’s by its nature mostly about interaction. I can definitely test some bits of it though).

Patches and suggestions welcome.

This entry was posted in programming on by .

“Agda is now mainstream”

Various people on the internets have been pointing out that the results for this is a mainstream language on hammer principle are a bit weird. In particular Agda was showing up as number 10 for it which if you know of Agda you’ll realise how ridiculous it is and if you don’t, well exactly.

It’s not there any more, as I tweaked the algorithm slightly to prune statement/item pairs with very little data, but the basic problem of the confusingness of this particular statement remains. The problem is alluded to on the text of the page, but I expect no one reads that text anyway. Oh well. So, quoted here:

For items where we’ve got a lot of responses the numbers should be pretty good. For the rest you can probably just expect them to hang around the middle of the rankings wondering what they’re doing here.

This is because the way the ranking algorithm works is that it sorts by an initial score and then looks for evidence to move items from that point. If there’s little data that score will be about in the middle and there’ll be little evidence to move it.

This is true on all the pages, but the problem with this particular one is that in this case lack of data is correlated with the correct position of the language in the ranking. This means that particularly non mainstream languages will lack data and thus will be sitting there in the middle, while as only mildly non mainstream languages will be parked down at the bottom because there was a lot of evidence that they weren’t mainstream.

The main problem here is not really the ranking algorithm (although it could be improved. I have an improvement in the works, but it’ll take a while to plumb in), it’s that the site doesn’t make it clear when an item’s position in a list is really more of a rough guess than a reality. I don’t currently have any good ideas as to how to indicate that, so the status quo will probably remain for now.

It’s also worth noting that the aggregate rankings are really more an amusement than they are a reliable source of decision making (especially now that the comparison pages provide you with more hard data). They should inspire, and they should lead your line of inquiry, but certainly don’t take them as rock solid fact. Even if they captured and presented the data perfectly you’re still just looking at a bunch of answers to subjective question from a bunch of randoms on the internet. Fun and interesting, but not necessarily representative of The Truth.

This entry was posted in programming on by .

A green build on my weighted FAS solver

A momentous occasion has just, err, occurred. I got a green build on my weighted feedback arc set solver. This has never happened before.

This might seem like a confusing concept. How can it never have passed its own test suite?

Well, the test suite has always been more aspirational than reality based. The way it effectively works is this: It has a lot of test data and it tracks the best solution found for each file, along with its score. The quality test is that the score of the solution found is within 5% of the best known solution (side-note: part of why the tests are now green is that I did raise this threshold a bit, it was originally at 1%. Most tests still hit that better threshold). A side-effect of this is that improvements to the solver will often make the tests harder to pass because they will find better solutions (and, surprisingly, this is still happening – I’ve spent enough CPU cycles and algorithmic tweaks that I keep thinking I’ve found all the optimal solutions only to be surprised by an improvement. The improvements are rarely large, but they still keep coming)

The result was that I would often make improvements to the FAS solver and then back them out because they were too slow, but leaving the test data behind, and the test would be failing as the FAS solver dreamed of the long lost days when it was cleverer.

Additionally, all the tests are capped at the fairly arbitrary maximum acceptable runtime of one minute.

At this point I should mention that finding even an approximation (I forget to within what bound – I think it’s something like half) of the optimal score for weighted FAS is an NP-hard problem, and that even calculating the score of a given instance is \(O(n^2)\). So this isn’t an easy problem to solve.

I’d largely stalled on development on it, but today I decided to go back to it and after a day of tinkering I’ve managed to make a surprising amount of progress. At the start of the day it was painfully slow on several test cases (several minutes), non-deterministic and badly failing several of its quality goals. Now it is deterministic, runs each test case in under a minute and gets within 2.5% of the best known result on all of them. For added pleasantry, the latest version has even provided the best known result for several of them, and the code is shorter and clearer than it was at the start of the day.

How does it work?

It’s pretty simple. In the end it’s ended up looking up hilariously close to the algorithm I threw together in an hour from a random paper for The Right Tool (now Hammer Principle) a few years ago.

The process is basically a mix of the following techniques:

  1. Initial scoring. This is basically a Markov chain scoring system as before. The details are slightly different, but not really in an interesting way – the idea is still to transition to better nodes and use an approximation of the stable distribution as a score
  2. What I refer to in the code as “forcing connectivity”. This is today’s clever idea, and it works really well, but it probably shouldn’t and the fact that it appears to might be a test data artifact. If you’re feeling generous it could be described as a heuristic approach, but a fairer description would perhaps be an unprincipled hack. It’s based on the observation that in sparse data sets you often end up with long chunks of ties, and these almost completely defeat local searches and make it hard to improve things. The force runs through the array one element at a time and if it’s unconnected to the next element pulls forward the next element to which it is connected (it doesn’t concern itself with ordering beyond that – the element might be strictly greater or less than it, it doesn’t care)
  3. Local sort, as before. Basically an insertion sort on the hopelessly inconsistent ordering the tournament induces. This had largely disappeared from the code – it was implicitly happening anyway, but there was no explicit local sort stage – and bringing it back in was a surprisingly big win
  4. Window optimisation. Basically, you can brute force the problem for small arrays. Window optimisation is brute forcing every subrange of some length. This is applied at various lengths in order to get mixing
  5. Single move optimisation. This is a really expensive process, but wins big enough that it’s worth it. Basically we try every possible move of a single element from one point to another and apply any that improve the score

It’s surprising how well this combination works – I’ve tried a whole bunch of more clever things, between various random search approaches, simulated annealing, block sorting (basically: build a new tournament with chunks of the existing one and recurse) and a variety of local search techniques, but they generally seemed to be significantly slower and/or less effective.

Update: The test suite is now failing again. This is partly because I’ve tightened the requirements (down to within 3% rather than 5% of best known result), partly because after some tinkering with code I’ve managed to improve the best known quality of several test cases. This is actually quite pleasant, as it shows that there are some nicely tunable parameters in here.

This entry was posted in Code, programming on by .