# A (rewritten) manifesto for error reporting

So I wrote A manifesto for error reporting. I stand by it entirely, but it did end up more of a diatribe than a manifesto, and it mixed implementation details with end results. This post contains largely the same information but with less anger and hopefully clearer presentation.

### The Manifesto

This is a manifesto for how errors should be reported by software to technical people whose responsibility it is to work with said software – it is primarily focused on the information that programmers need, but it’s going to be a lot of help to ops people as well. The principles described in here apply equally whether you’re writing a library or an application. They do not apply to how you should report errors to a non-technical end-user. That’s an entirely different problem.

This is primarily about how errors appear when represented in text formats – either through some sort of alerting mechanism or logs. It doesn’t cover more advanced tools like debuggers and live environments. Textual reports of errors are a lowest common denominator across multiple languages and are important to get right even if you have better tools.

The guiding principle you should follow is that the way you report errors is important, and that you should think as carefully about how you convey information in failure cases as how you behave in non-failure cases. A moderate amount of careful forethought at this point can prevent a vast amount of effort and frustration at a later date.

In particular, when crafting software you should think about what information the person who is attempting to debug the problem is going to need. This information primarily takes three forms:

1. What, specifically, is the problem that occurred?
2. What has triggered this problem?
3. Where in the code has this problem occurred?

If you bear these three questions in mind, and make sure to provide enough information to answer them, you will be in good stead. What follows is some specific advice for helping people answer these questions.

#### Be as specific as you can in your error messages

Your error messages should not be too long – a sentence is typically more than enough. They should however be descriptive, and tell you what happened.

A bad example of an error message is:

Invalid state


Better is:

Transaction aborted


Better yet:

Cannot commit an aborted transaction


Rather than merely telling you that the state is invalid, the error message tells you which invalid state you were in and what it is preventing you from doing.

#### Error messages should contain pertinent information about the values that produced them

This is not a good error message:

Index out of range


This is:

Index 8 is out of range for array of length 7


You could also do

Index 8 is out of range for array [1,2,3,4,5,6,7]


the problem with this is that if the array gets very large then so does the error message. So while error messages should contain information about the values that generated them, they do not need to contain the entire value: Only enough information about it to say why it triggered this error.

Another error message you shouldn’t generate from the exact value:

Failure to process credit card number XXXX XXX XXX XXX


Even ignoring the specific laws around processing credit card numbers, you should obviously not be logging confidential or secret information about users like this.

So there are reasons why your error messages can’t always sensibly contain the full values that triggered them. That being said, it’s much easier to recreate a problem if you can recreate the exact value, so it’s a good default to include more rather than less, and you should certainly be including some.

#### Error messages should locate where in the code they occur

In an ideal world, every error message would come with a complete stack trace that says exactly the chain of calls that it went through to get there. If absolutely necessary, and if you’re generating good and expressive error messages, it’s sufficient to include just the file and line number where the error occurred, but it’s not perfect and gives you much less information about how the problem was triggered.

The reason this is so important is that determining where the problem occurs in code is one of the first steps of any debugging process, so you can save a lot of time and effort for the person debugging by doing this for them at the point of the error.

In most languages if you are using exceptions, you get pretty close to this by default. On POSIX systems in C or C++ you can apparently also do this with the backtrace function.

Additionally, you should make a best effort to include stack traces when crossing process boundaries through RPC mechanisms: If a remote procedure can reasonably report a stack trace, it should report a stack trace and you should include that in your error report.

#### You should not mask lower level errors

It is common to wrap lower level errors in high level ones. It is also common to alter the display of errors in code you’re calling – e.g. in testing frameworks.

When you do either of these things the golden rule you must follow is that you should not remove information from the lower level errors, as they may be the most informative information the developer debugging the problem has about what actually went wrong.

In particular, if you are rethrowing exceptions you need to take steps to ensure that you include the original stack trace and error message (in many languages it is possible to alter the stack trace of the exception you’re throwing, and you can use this to chain the stack traces together).

Additionally you should never remove stack trace elements for display (it is acceptable to e.g. compress adjacent lines into a single one with a counter for repetitions. It’s OK to change the display, but not to remove information).

#### Error conditions should not be covered up

It is often tempting to believe that it is the code’s responsibility to attempt to cover for an error and keep on working regardless. Sometimes this is even viable and true. Sometimes however an error is more likely to be a sign of developer error which should be addressed sooner rather than later, and even when it is not an obvious developer error it is likely a symptom of something going genuinely wrong.

As a consequence unless an error condition is genuinely routine (a rough rule of thumb here would be “Can reasonably be expected to happen multiple times a day and we’re not going to do anything about that” it should be reported. It is fine for the code to recover from the error and attempt to proceed regardless, but the error needs to be logged. Even if it’s not a problem that needs fixing, it may end up as symptomatic of other problems.

#### Errors should be reported when you enter an invalid state, not just when you attempt to operate whilst in one

One of the most common errors to see in a Java application is a NullPointerException. In Ruby it’s similarly common to see a NameError or a NoMethodError.

Inevitably this is because a value has been allowed to enter somewhere that it shouldn’t be permitted.

Other forms of invalid state are also possible, but they basically all come down to the same thing: Your error is not caused by what you are currently doing, it is caused by what has come before. Your debugging now has to back track to find the point at which the object was put into an invalid state, because where the error appears to be occurring is of no help to you.

The solution to this is to validate your state when it changes: If data is only permitted to be within a certain range of values, check that it belongs to that range of values when you set or change it. This means that the problem will be caught at the point where it occurs rather than the point where it causes problems.

#### Recap

In summary:

2. Be as specific as you can in your error messages
3. Error messages should contain pertinent information about the values that produced them
4. Error messages should locate where in the code they occur
5. You should not mask lower level errors
6. Error conditions should not be covered up
7. Errors should be reported when you enter an invalid state, not just when you attempt to operate whilst in one

If you do all these things, your applications and libraries will be much easier to debug and maintain, and the people who have to do so will thank you.

# A parable about problem solving in software development

I’ve told a lot of people this story over the years. Mostly whilst drunk. The responses are usually pretty similar – hilarity, incredulity and just a little bit of “There but for the grace of god go I” style sympathy. I’ve had multiple requests to write it up, so I’ve finally acquiesced.

It’s a parable about what happens when you’re always solving the problem right in front of you rather than questioning whether this is a problem you actually need to be solving. It’s unfortunately mostly true. I’ve anonymized and fictionalized bits of it, mostly to protect the innocent and the guilty (and occasionally to make it a better story), but 90% of what I describe in the following really happened, and 90% of that happened more or less as I described it (as best as I can remember). If you know me, you can probably figure out where it happened. If you don’t know me, I’m not going to tell you.

At the beginning of our story we had one central API off which maybe a dozen smaller apps and services hung. The API controlled our data storage, the operations you can reasonably perform on it, and generally encapsulates our model. It, and each of the apps, lived in their own source control repo and were deployed separately.

This API was implemented via JSON-RPC over HTTP. It wasn’t RESTful, but maybe it was a bit RESTy. RESTish perhaps.

It kinda worked. It wasn’t perfect, but it was at least vaguely functional.

We essentially had two problems with it:

1. Each of the apps talking to it had written its own client library (or was just including raw HTTP calls straight in the code)
2. It was quite slow

As well as the the core API, we also had a message queuing system. It was pretty good. We didn’t use it for a lot – just some job queueing and notifications to send to the users – but it worked well for that. We’d had a few problems with the client libraries, but they were easy to fix.

At some point it occurred to one of us that the reason our HTTP API was slow was of course that HTTP was slow. So clearly the best solution was to replace our slow shitty HTTP RPC with our hot new message queue based RPC. What could go wrong!

Well, you know, it didn’t really go wrong. It mostly worked. It was… a bit strange, but it basically worked. We wrote an event driven server which implemented most of what we were doing with the HTTP API (including all the blocking calls we were making to our ORM. Oops). It polled a message queue, clients would create their own message queue to receive responses on. Then a client would post a message to the server which would reply on the client’s message queue (I think there were some tags added to the messages to make sure things lined up. I hope there were, otherwise this all sounds horribly precarious).

This was basically unproblematic. It might even have been a slight improvement on our previous system. RPC over message queue is a legitimate tactic after all. We of course didn’t have any benchmarks because why would you benchmark sweeping changes you make in the name of performance, but it was at the very least not obviously worse than the previous system.

Our next problem was the various client libraries that we were reimplementing everywhere. This was obviously stupid. Code reuse is good, right?

So we rationalized them, pulled them all out into their own repo, and produced a client package. You used it by installing it on your system (which was just a single command using the packaging system we were using), and then you could talk to an API server. It was straightforward enough.

So we’d solved our reimplementing problems, and we were at least claiming we’d solved our performance problems (and maybe we even had. At this late stage I honestly couldn’t tell you).

Thing is… it turned out that this was actually quite irritating to develop against.

It had already been a little painful before, but now in order to add a feature you had to do all the following steps:

1. Make a change to the server code
2. Make a change to the client library code
3. Make a change to the application code
5. Install the client library on your system

We decided to solve the first two problems first.

We noticed that a lot of the code between the client and the server was duplicated anyway (similar structure on each side after all). So we ended up commonizing it and putting the client library in the repo with the server. Not all of the server code was needed in the client library obviously, but it was much easier to just put it all in one directory and have a flag that let you test if you were running in client or server mode. So now at least all the changes you had to make to both client and server were in one repo, and they might even have been the same code.

At the moment what we have here is a slightly baroque architecture, but it’s fundamentally not that much worse than many you’d encounter in the wild. It’s not good, but looking at it from the outside you can sortof see where we’re coming from. What follows next is the point at which it all starts to go completely tea party.

You see, code duplication between client and server was still a problem.

In particular, data model duplication. If you had a Kitten model, you needed a Kitten model in both the client and the server and you needed to maintain both. This was quite a nuisance.

At this point some bright spark (it wasn’t me, I swear) realised something: Our ORM supported highly pluggable backends. They didn’t even need to be SQL – there were examples of people using it for document storage databases, even REST APIs. We had this API server, why not make it an ORM backend?

And if we’re doing that, can we do it in a way that reuses the models we’re already using? We’re already detecting if we’re running in client or server mode, can’t we just have it use a different backend in the two cases?

Well, of course we can.

Of course, the really nice thing about having an ORM is how you can chain things and build rich queries. So we do want to support the full range of query syntax for the ORM.

A weekend of caffeine fuelled development from this one guy later, we all arrived on a Monday morning to find a grand new vision in place. Here’s how it worked:

1. We have the same ORM models on both client and server
2. If we are in client mode, our backend uses the JSON-RPC server rather than talking to the database
3. Given a query object, we do a JSON RPC call to the corresponding backend methods on the server. This returns a bunch of models

Simple, right?

I’m going to unpack that.

1. I make a bunch of method calls to the ORM
2. This generates a Query object
3. We pass this Query object to a custom JSON serializer that has to support the full range of subtypes of Query
4. We send that JSON over a message queue
5. Our server pops the JSON off a message queue, deserializes it and calls a custom method to build a Query object
6. This Query object is passed to the ORM backend
7. The ORM backend converts the query object into an SQL query
8. The database adapter executes that SQL query and returns a bunch of rows
9. Those rows get wrapped as model objects
10. Those model objects get serialized as JSON and passed across the client message queue
11. The client pops the model JSON from the message queue
12. The client parses the JSON and wraps the resulting array of hashes as models

…yeah.

Anyway, we arrive on a Monday morning to find this all in place and broadly working (“There are just a few details to polish”).

And, you know what? We decided to roll with it. We were quite irritated with the status quo, and this clearly would make our lives easier – there was an awful lot less code to write when we wanted to add a feature and boy did we need to add features. So although we were probably a little suspicious, we decided to let that slide.

Of course… you see that long pipeline over there? Lot of moving parts isn’t it? Many of them, custom crap we’ve written. I bet that’s going to break, don’t you?

Of course it broke. A lot.

And naturally, as seems to happen, muggins here gets to be the guy in charge of fixing those bugs (how did this happen? I don’t know. I think the problem is that I don’t step back fast enough when the call for volunteers arrives. Or maybe people have an uncanny knack for spotting I’m actually quite good at it despite my best efforts to pretend I’m not).

One of the most common sources of bugs was user error. Specifically it was user error that was made really easy by our setup.

It required three steps to push a change to the code to your application: You had to restart the server, you had to install the package, you had to restart your application. If you forgot any one of those three steps, your client and server code would be out of sync (and remember how much of this was shared) and the resulting errors would be subtle and confusing. This frequently drove people to despair.

Remember how I believe in “Fail early, fail often“? It turns out I’ve believed this for some time (the first evidence I can find of my thinking along these lines comes from 2007. That would have been within about a year of my learning to program).

So the solution I hit upon for the problem was “Well, don’t do that then”. When a server or a client started up, it would create a signature that was a (MD5 I think) hash of all its code. This would then be transmitted along with every RPC call, and if the server detected that the client’s hash differed from its own it would instead respond with an error saying “No, you’re running the wrong client code. I’m not going to talk to you”. Unsubtle, but effective in making the error clear.

This solved the immediate problem, and we decided it was good enough.

Most of the next six months (when I wasn’t doing feature dev) I was fixing bugs with the pipeline – this particular obscure query was crashing our deserializer. This one query was somehow generating 17MB of JSON data and the parser didn’t like that very much. That sort of thing.

During this time people were getting increasingly irritated with the dev process. It was all very well having those errors be detected, but what you really wanted was for those errors to be fixed. And to not have to do three slow steps to make a simple change.

This was when my true contribution to our little Lovecraftian beauty came in.

“Well”, I reasoned, “the server has all the code, right? And the client needs all the code? And the server is already sending data to the client…”

So.

The package remained as a tiny shim library that needed to be installed to talk to the server, but it include really very little code (it still checked the code md5, but this now basically never changed).

1. On startup, the client would make its first RPC call. This was a “Hey, give me the code” call. The server would reply with a list of file paths and their source code
2. The client would create a temporary directory and write all the files into that temporary directory
3. The client would add that temporary directory to the load path and require the entry point to the library

This removed the install step: The client would forever and always be running the latest version of the code, because it fetched it from the server at start up. We still had to restart the server and the client, but at least one of the more irritating and easy to forget steps was removed.

I don’t think we ever implemented code reloading, though it’s obvious how we could have – on code changes, the server would just have to broadcast the changed files, which could again be written to the file system and reloaded.

Fortunately better judgement prevailed before we hit that point.

We were coming up to the first major release we’d have with all this infrastructure in place.

It was obviously not going to go well.

The site was dramatically slow in comparison to its previous “This is too slow!” HTTP incarnation. Why? Because it turns out that serializing and deserializing lots of ORM queries and models is really fucking slow! When we had the HTTP implementation in place we were a bit more careful about what we were doing, but this was all behind the scenes and invisible to us and mostly out of our hands.

It was also still quite buggy. Despite my best efforts to keep the whole thing reliable and functioning – I’d patched a lot of bugs – we kept finding new ones. The problem wasn’t in fixing individual bugs, it was that the core architecture was basically a disaster.

One night while wrestling with insomnia I had a revelation.

“OH MY GOD. IT’S JUST A LIBRARY”.

A weekend of caffeine fuelled development from me later, everyone arrived on a Monday morning to find a grand new vision in place. Here’s how it worked:

1. Everything lived in a single repo.
2. Everything that was previously server code was now just sitting in a single library that everything put directly on their load path.
3. Everything talked to the database directly, via that library.

That’s. It.

It took a little bit of time to get it stable after that – there were a lot of places where our bug workarounds now became bugs in their own right. There were a few days where it was touch and go – this was about a month before release and there was some serious head scratching and concerned moments where we thought we were going to have to release it in its previous form after all. But we got there, and the result was unsurprisingly both faster and more reliable than what came before it.

Obviously this is how we should have done it in the first place. It’s not just obvious in retrospect, it should have been obvious in the beginning. We were just too focused on fixing this one problem with our current system rather than calling the system itself into question to see it.

The project structure changed a bit over the time since then, but as far as I know this is still essentially how it looks, and I imagine how it will to continue to look indefinitely.

Unless someone decided that what was really needed is to abstract out some part of the database access into an RPC server. I hope no one did that, but I’m a little afraid to ask and find out.

So I have access to my twitter archive now. I was very excited by this, then a month later I still haven’t done anything about it.

I decided to fix this.

Everything I’m doing today is on a linux system (running Mint if you must know). You will need the following things installed to follow all of it.

atool
This is basically an archive management tool. Actually you don’t need this at all, but I used it as part of set up and it’s totally worth having on every system you run so I’m mentioning it anyway
git
Again, this is strictly optional, but you might want to replace it with some other VCS. When I’m working with a bunch of files I like to put them under version control, so any destructive operations I perform (deliberately or inadvertently) can easily be backed out of. I like git, so I used git, but there’s no reason to use it specifically over just about anything else here.
moreutils
I use this in precisely one step, but it’s quite useful for that step. You can easily find a way to do without it.
wget
jq
This is absolutely the core utility I’m using and you need it installed in order to do anything useful with this post

david@volcano-base ~ $mkdir -p data david@volcano-base ~$ cd data/ david@volcano-base ~/data $aunpack ~/Downloads/tweets.zip Archive: /home/david/Downloads/tweets.zip inflating: Unpack-2420/data/js/tweets/2013_03.js (etc) tweets.zip: extracted to tweets' (multiple files in root) david@volcano-base ~/data$ cd tweets/ css/ data/ img/ js/ lib/ david@volcano-base ~/data $cd tweets/data/js/tweets/ david@volcano-base ~/data/tweets/data/js/tweets$ ls 2008_04.js 2008_09.js 2009_01.js (etc)

You’re now in a directory with lots of .js files.

Before we do anything, lets put everything under git.

david@volcano-base ~/data/tweets/data/js/tweets $git init Initialized empty Git repository in /home/david/data/tweets/data/js/tweets/.git/ david@volcano-base ~/data/tweets/data/js/tweets$ git add *.js david@volcano-base ~/data/tweets/data/js/tweets $git commit -m "Initial commit of data files" [master (root-commit) 9aeb0dc] Initial commit of data files 59 files changed, 699462 insertions(+) create mode 100644 2008_04.js (etc) Now lets take a look at what we have here: david@volcano-base ~/data/tweets/data/js/tweets$ head -n25 2008_04.js Grailbird.data.tweets_2008_04 = [ { "source" : "web", "entities" : { "user_mentions" : [ ], "media" : [ ], "hashtags" : [ ], "urls" : [ ] }, "geo" : { }, "id_str" : "799851705", "text" : "Yay for google whoring.", "id" : 799851705, "created_at" : "Tue Apr 29 21:30:07 +0000 2008", "user" : { "name" : "David R. MacIver", "screen_name" : "DRMacIver", "protected" : false, "id_str" : "14368342", "profile_image_url_https" : "https://si0.twimg.com/profile_images/2609387884/1tu53xdcpssixve5o09m_normal.jpeg", "id" : 14368342, "verified" : false } }, {

First thing we see: These aren’t JSON files. They are actually Javscript. Fortunately they’re formatted in such a way that we can easily turn them into JSON:

david@volcano-base ~/data/tweets/data/js/tweets $for file in *.js; do tail -n+2$file | sponge $file; done david@volcano-base ~/data/tweets/data/js/tweets$ git diff | head diff --git a/2008_04.js b/2008_04.js index 5737052..2046162 100644 --- a/2008_04.js +++ b/2008_04.js @@ -1,4 +1,3 @@ -Grailbird.data.tweets_2008_04 = [ { "source" : "web", "entities" : { diff --git a/2008_06.js b/2008_06.js   david@volcano-base ~/data/tweets/data/js/tweets $git commit -a -m "remove initial assignment line so all our files are valid javascript" [master 8d41481] remove initial assignment line so all our files are valid javascript 59 files changed, 59 deletions(-) What’s going on here? Well, from the tail man page: -n, –lines=K output the last K lines, instead of the last 10; or use -n +K to output lines starting with the Kth So tail -n+2 outputs all lines starting with the second. sponge is from moreutils. According to its man page: sponge reads standard input and writes it out to the specified file. Unlike a shell redirect, sponge soaks up all its input before opening the output file. This allows constricting pipelines that read from and write to the same file. So what we’re doing in this loop is for each javascript file we’re stripping off the first line, buffering it up and then writing it back to the original file. (I expect there’s also a sed one liner to do this, but this was easier than looking up what it was) Right. Now for something actually interesting! For a starting point, I’ve often wondered how much I’ve actually written on twitter. So lets do that. david@volcano-base ~/data/tweets/data/js/tweets$ cat *.js | jq -r '.[] | .text' | wc -w 326212

Before I analyze what I’ve just done, I’m going to marvel at the fact that that’s a metric fuckton of words. Depending on how you count that’s about three novels tweeted. I don’t know if that says good or bad things about me.

Now lets unpack the command.

First the uninteresting bits: cat *.js concatenates all the js files and spews them to stdout, wc -w counts the number of words fed to its stdin. You knew that though.

Now lets talk about the jq command, which is the interesting bit. To be clear: I’m learning jq as I write this (I’ve been meaning to for a while and then didn’t), so I really don’t know that much about it, so all of what I say may be wrong.

As I understand it, the jq model is that everything in it is a stream of JSON values, and that’s also how it parses its STDIN. This is why concatenating the JSON files and feeding it to jq works: It parses one value from STDIN, then another, then another. So we’re starting with a stream of arrays.

We’re then building up a filter. ‘.’ is simply the filter which pipes its input to its output, but by modifying it as ‘.[]‘ we get a filter which accepts a stream of arrays and unpacks them by reading each array then streaming it to the output one at a time. So we’ve taken our stream of arrays and turned it into a stream of the array contents. Let’s verify that:

david@volcano-base ~/data/tweets/data/js/tweets $< 2008_04.js jq '.[]' | head -n50 { "user": { "verified": false, "id": 14368342, "profile_image_url_https": "https://si0.twimg.com/profile_images/2609387884/1tu53xdcpssixve5o09m_normal.jpeg", "id_str": "14368342", "protected": false, "screen_name": "DRMacIver", "name": "David R. MacIver" }, "created_at": "Tue Apr 29 21:30:07 +0000 2008", "id": 799851705, "text": "Yay for google whoring.", "id_str": "799851705", "geo": {}, "entities": { "urls": [], "hashtags": [], "media": [], "user_mentions": [] }, "source": "web" } { "user": { "verified": false, "id": 14368342, "profile_image_url_https": "https://si0.twimg.com/profile_images/2609387884/1tu53xdcpssixve5o09m_normal.jpeg", "id_str": "14368342", "protected": false, "screen_name": "DRMacIver", "name": "David R. MacIver" }, "created_at": "Tue Apr 29 10:36:22 +0000 2008", "id": 799412882, "text": "I get more and more sick of Java with every passing day. Impressive, given that I'm not even writing it any more...", "id_str": "799412882", "geo": {}, "entities": { "urls": [], "hashtags": [], "media": [], "user_mentions": [] }, "source": "web" } { "user": { "verified": false, "id": 14368342, (I’ve switched to just using the 2008_04.js file because we’re only looking at a small amount of data) We do indeed get a sequence of JSON objects one after another (note no commas or array markers). We then build up a pipe inside the jq language. The “.text” filter reads its input stream, looks up the text property on it and outputs the result as follows. david@volcano-base ~/data/tweets/data/js/tweets$ < 2008_04.js jq '.[] | .text' | head -n10 "Yay for google whoring." "I get more and more sick of Java with every passing day. Impressive, given that I'm not even writing it any more..." "Unleashing my inner interior decorator." "Having trouble with the twitter UI, which is just embarassing given how simple it is. :-)" "giving this twitter thing another try."

The final bit to explain here is the -r flag. From the man page:

–raw-output/-r: With this option, if the filter’s result is a string then it will be written directly to standard output rather than being formatted as a JSON string with quotes. This can be useful for making jq filters talk to non-JSON-based systems.

Indeed it can:

david@volcano-base ~/data/tweets/data/js/tweets $< 2008_04.js jq -r '.[] | .text' | head -n10 Yay for google whoring. I get more and more sick of Java with every passing day. Impressive, given that I'm not even writing it any more... Unleashing my inner interior decorator. Having trouble with the twitter UI, which is just embarassing given how simple it is. :-) giving this twitter thing another try. So we’ve chained all these together to get the actual text. Now lets save that text so we can do a bit more analysis on it: david@volcano-base ~/data/tweets/data/js/tweets$ head all_tweets.txt Yay for google whoring. I get more and more sick of Java with every passing day. Impressive, given that I'm not even writing it any more... Unleashing my inner interior decorator. Having trouble with the twitter UI, which is just embarassing given how simple it is. :-) giving this twitter thing another try. No, seriously. I mean it. Why do people use ASP? Every site I've encountered using it has driven me insane with its awfullness Barack Obama is the eleventh doctor! Brillig and the Slithy Toves would make a great band name According to Victoria I am well positioned to be the antichrist I know understand why tea and crumpets are our traditional fare david@volcano-base ~/data/tweets/data/js/tweets $git add all_tweets.txt david@volcano-base ~/data/tweets/data/js/tweets$ git commit all_tweets.txt -m "Just the text for all tweets" [master df1e778] Just the text for all tweets 1 file changed, 20553 insertions(+) create mode 100644 all_tweets.txt

And a sanity check:

david@volcano-base ~/data/tweets/data/js/tweets $wc -w all_tweets.txt 326212 all_tweets.txt Good, the same answer. Now, lets ask another interesting question: How much have I written not counting @replies? Answer: Not nearly so much. david@volcano-base ~/data/tweets/data/js/tweets$ grep -v '@' all_tweets.txt | wc -w 88966

So apparently most of my twitter usage is conversations: We go from about 3 novels to a short novel or long novella if I remove them.

What’s going on here?

Well, I’m doing a text search on all_tweets.txt. grep ‘@’ gives me all lines which contain an @, and then the -v flag inverts the sense of the match:

david@volcano-base ~/data/tweets/data/js/tweets $grep '@' all_tweets.txt | head @benaud I could. It's kinda hacked together and purely for sending messages. Still want it? @t_a_w JQuery doesn't meet my needs in two crucial ways: a) I'm not doing something in the browser. b) It left survivors. @t_a_w JQuery fails my needs in two crucial ways. Firstly, I'm not trying to do something in the browser. Secondly, it left survivors. @gnufied It's possible I was being slightly facetious... @gnufied But that is not object orientated! Encapsulation! You are a bad programmer and should go back to writing C with global variables. @t_a_w I'll have you know that the book I just bought covers the state of the art up to at *least* 1990. @jherber Not sure.I don't think arrays are the issue so much as the general collections API. Certainly there are too damn many toList calls. @jherber Oh, hey. I'd forgotten about the embarrassing slowness of List.sort. Thanks for reminding me. I'll add fix that. @mikesten Right, but there are two reports. The one I sent a link to is everything on expertise-identification, but there are other ones. @mikesten The full super-scary thing or just the work applicable ones? :-) david@volcano-base ~/data/tweets/data/js/tweets$ grep -v '@' all_tweets.txt | head Yay for google whoring. I get more and more sick of Java with every passing day. Impressive, given that I'm not even writing it any more... Unleashing my inner interior decorator. Having trouble with the twitter UI, which is just embarassing given how simple it is. :-) giving this twitter thing another try. No, seriously. I mean it. Why do people use ASP? Every site I've encountered using it has driven me insane with its awfullness Barack Obama is the eleventh doctor! Brillig and the Slithy Toves would make a great band name According to Victoria I am well positioned to be the antichrist I know understand why tea and crumpets are our traditional fare

We could have also worked this out with jq:

david@volcano-base ~/data/tweets/data/js/tweets $cat *.js | jq -r '.[] | select(.entities.user_mentions | length == 0) | .text' | wc -w 91071 What’s going on here? What we’ve done is we’ve added a filter in the middle of our two previous filters. This is a select filter, as explained here in the manual. For each row of input, it passes that to the filter it’s wrapping (note that the thing inside is a filter!), then reads all the output from that filter until it either runs out of output or it finds a true value. Here we only ever output one value (whether the length of the user_mentions array is 0), but here’s an illustration of what happens if you output multiple values: david@volcano-base ~/data/tweets/data/js/tweets$ echo -e '[false]\n[true]\n[false,true]\n[true,false]' | jq 'select(.[])' [ true ] [ false, true ] [ true, false ]

Observe that the arrays with any true value in them get returned but the arrays with all false values don’t.

An interesting thing to note: These answers are not exactly the same. I have more words written with no user mentions than I do words from tweets with no @s in them. What’s going on?

Let’s find out:

david@volcano-base ~/data/tweets/data/js/tweets $cat *.js | jq -r '.[] | select(.entities.user_mentions | length == 0) | select(.text | contains("@")) | .text' | grep -o '@[[:alnum:]_]\+' | sort | uniq -c | sort -nr 68 @njbartlett 19 @sylviazygalo 8 @Lars_Westergren 7 @burningodzilla 4 @jeanfinds 3 @a_y_alex 2 @torsemaciver 2 @Ms_Elevator 2 @charlesarmstrong 2 @cambellmichael 2 @benreesman 2 @allbery_b 1 @zarkonnen 1 @trampodevs 1 @tooPsychedout 1 @timperret 1 @thatsoph 1 @nuart11 1 @missbossy 1 @mikevpotts 1 @michexile 1 @MetalBeetleLtd 1 @mccraicmccraig 1 @MatthijsAKrul 1 @m 1 @lucasowen85 1 @lisascott869 1 @lauren0donnell 1 @Knewton_Pete 1 @kittenhuffers 1 @jnbartlet 1 @itomicsam 1 @iamreddaveMaximum 1 @georgie_guy 1 @geezusfreeek 1 @GarethAugustus 1 @drmaciver 1 @debashshg 1 @dbasishg 1 @CTerry1985 1 @communicaI 1 @cipher3d 1 @carnotwitk These are the usernames that appear in tweets that have no user_mentions in them. Based on a few random samples, none of them appear to be valid twitter user names. Some of them are obviously typos, some of them I recognise as people who have changed their usernames or deleted their accounts. So I think that explains that. Now lets explain the command. The jq command should hopefully be obvious – it’s just more of the same, but I’ve added a new select (this time using contains for string matching). The remainder is interesting though. The grep string ‘@[[:alnum:]_]\+’ is a POSIX regular expression which matches strings which start with @ and then are followed by any combination of alphanumeric characters and underscores. So e.g. “@foobarbaz” matches, as does “@foobar123″, or “@foo_bar_baz”. The -m flag says to only print the matches rather than the lines containing matches as is grep’s normal behaviour. So e.g. david@volcano-base ~/data/tweets/data/js/tweets$ <all_tweets.txt grep -o '@[[:alnum:]_]\+' | head @benaud @t_a_w @t_a_w @gnufied @gnufied @t_a_w @jherber @jherber @mikesten @mikesten

What’s going on with the bit after the grep?

As a whole unit, ‘sort | uniq -c | sort -nr’ means “Give me a tabulated count of all the lines I feed in here, ordered by reverse frequency”.

Taking it apart what happens is this:

First we feed the results into sort. This, err, sorts them (in dictionary order). We then pass the output for that to uniq.

What uniq classically does is it removes all consecutive lines which are the same. So for example:

david@volcano-base ~/data/tweets/data/js/tweets $<all_tweets.txt grep -o '@[[:alnum:]_]\+' | head | uniq @benaud @t_a_w @gnufied @t_a_w @jherber @mikesten Note that it doesn’t remove all duplicates: Only adjacent ones. That’s why we had to sort first. Adding the -c flag causes it to count up those adjacent lines: david@volcano-base ~/data/tweets/data/js/tweets$ <all_tweets.txt grep -o '@[[:alnum:]_]\+' | head | uniq -c 1 @benaud 2 @t_a_w 2 @gnufied 1 @t_a_w 2 @jherber 2 @mikesten

So between the sorting and the counting, this gives us our tallies.

We then sort again to put it in order. However we don’t want to sort numbers by their string value (this would put e.g. 2 after 11), so the -n flag to sort tells it to sort numerically (I don’t actually know what it does with the rest of the text. I think it just pulls off the first number and uses that). This would put it in ascending numerical order, so the -r flag reverses that.

OK. So, the text is actually a more reliable guide of tweets than the metadata if I just want to know who I’m tweeting at. And I do. So lets get that.

david@volcano-base ~/data/tweets/data/js/tweets $<all_tweets.txt grep -o '@[[:alnum:]_]\+' | sort | uniq -cd | sort -nr | head 793 @angusprune 641 @LovedayBrooke 503 @petermacrobert 412 @pozorvlak 360 @stef 339 @bluealchemist 322 @drsnooks 313 @reyhan 271 @communicating 239 @zarkonnen I’m not sure if that’s the exact answer I would have expected, but it’s not terribly surprising. Now I’d like to take a look at what words I use. In order to do this I’ll use grep again to extract words from the all_tweets.txt file: david@volcano-base ~/data/tweets/data/js/tweets$ grep -o '\b[[:alpha:]]\+\b' all_tweets.txt | head Yay for google whoring I get more and more sick

The ‘\b’ special character there is the only thing new in this (also [:alpha:], but that just means any alphabet character). ‘\b’ means word boundary.

One thing stands out here: There are going to be a lot of really uninteresting words in this, like “I” and “more” and “and”. In natural language processing, these are typically called stop words. We want to remove those from the calculation.

So what do we do?

Well, first we find a list of stop words. Some brief googling lead me to this file, which seems to be good enough. Let’s fetch it:

david@volcano-base ~/data/tweets/data/js/tweets $wget https://stop-words.googlecode.com/svn/trunk/stop-words/stop-words/stop-words-english1.txt --2013-03-03 13:01:39-- https://stop-words.googlecode.com/svn/trunk/stop-words/stop-words/stop-words-english1.txt Resolving stop-words.googlecode.com (stop-words.googlecode.com)... 2a00:1450:400c:c05::52, 173.194.78.82 Connecting to stop-words.googlecode.com (stop-words.googlecode.com)|2a00:1450:400c:c05::52|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 4812 (4.7K) [text/plain] Saving to: stop-words-english1.txt' 100%[=====================================================================================================================================================================================>] 4,812 --.-K/s in 0.001s 2013-03-03 13:01:39 (8.29 MB/s) - stop-words-english1.txt' saved [4812/4812] Now! An annoying bit. What came next was mysteriously not working for me. It turns out that there were some problems with this file. The first reason it was not working was the presence of a BOM (Byte Order Mark) at the beginning of this file was confusing poor grep. So first we have to strip the BOM like so: david@volcano-base ~/data/tweets/data/js/tweets$ tail -c+4 stop-words-english1.txt | sponge stop-words-english1.txt

-c works like -n except that it functions on characters (really, bytes) instead of lines. So what we’re doing here is stripping out the first three bytes of the file, which are the BOM.

The second reason it was not working was to do with line endings.

Basically, line endings are actually special characters which say “Here! Start a new line!”. Unfortunately there is some disagreement as to just what those special characters are. Unix considers it to be a single newline character, traditionally represented as ‘\n’, whileas windows (and the web) consider it to be two characters ‘\r\n’- the first being a carriage return, meaning “Go back to the beginning of the line”.

We’re going to be feeding these lines to grep to use as patterns, and grep is of the unix, so it wants its new lines to be pure ‘\n’ and will consider ‘\r’ to be part of the pattern to be matched on. So we need to remove those:

david@volcano-base ~/data/tweets/data/js/tweets $sed -i 's/\r//' stop-words-english1.txt The pattern we’re giving said says “replace the first carriage return with an empty string”. The -i flag means “And then replace the file with the results of this”. Normally it would write the results to the console. Some poking at the file caused me to realise it also doesn’t consider “I” to be a stop word. I don’t know why. Lets fix that: david@volcano-base ~/data/tweets/data/js/tweets$ echo i >> stop-words-english1.txt david@volcano-base ~/data/tweets/data/js/tweets $tail stop-words-english1.txt you'd you'll your you're yours yourself yourselves you've zero i General lesson here: Real world data is often messy. If your code isn’t working make sure the bug isn’t in your data. Now our stop words are ready to use. Lets add them to the git repo: david@volcano-base ~/data/tweets/data/js/tweets$ git add stop-words-english1.txt david@volcano-base ~/data/tweets/data/js/tweets $git commit stop-words-english1.txt -m "File containing stop words" [master bec4115] File containing stop words 1 file changed, 636 insertions(+) create mode 100644 stop-words-english1.txt We’re now prepared to use the stop word list: david@volcano-base ~/data/tweets/data/js/tweets$ grep -o '\b[[:alpha:]]\+\b' all_tweets.txt | grep -xvi -f stop-words-english1.txt | head Yay google whoring sick Java passing day Impressive m writing

grep -f means “Use the lines from this file as patterns and match on any of them”

The flag -v as before means “invert the match”, i.e. only give us things that don’t match the pattern. -i tells it to match case insensitively and -x tells it to only match things which match the whole line (So e.g. the fact that we have a single character ‘i’ in the list shouldn’t exclude words containing i).

Annoyingly the word m is still coming through. Rather than add every single character word to the stop words, lets just change our search to only match words of three letters or longer:

david@volcano-base ~/data/tweets/data/js/tweets $grep -o '\b[[:alpha:]]\{3,\}\b' all_tweets.txt | grep -xvi -f stop-words-english1.txt | head Yay google whoring sick Java passing day Impressive writing Unleashing Replacing the ‘\+’ with ‘\{3,\}’ has made the pattern only match words which are at least 3 characters long. Now, lets actually get the answer: david@volcano-base ~/data/tweets/data/js/tweets$ grep -o '\b[[:alpha:]]\{3,\}\b' all_tweets.txt | grep -xvi -f stop-words-english1.txt | sort | uniq -c | sort -nr | head 1573 http 1134 don 901 people 826 good 794 angusprune 641 LovedayBrooke 610 time 525 work 504 petermacrobert 484 bit

Err. Clearly there are some problems here.

First problem is that fairly obviously some of these are @replies showing up as words. Apparently word boundary (‘\b’) doesn’t mean what I thought it did. It’s also clearly extracting things like http from http:// and don from don’t.

Here’s my replacement solution:

david@volcano-base ~/data/tweets/data/js/tweets $grep -o '[^[:space:]]\{3,\}\b' all_tweets.txt | grep -v '@' | grep -xvi -f stop-words-english1.txt | sort | uniq -c | sort -nr | head -n25 887 people 810 good 600 time 519 work 443 problem 361 code 350 bad 335 Yeah 327 idea 300 pretty 300 point 287 bit 285 day 279 find 278 lot 275 wrong 250 coffee 246 hard 229 read 224 feel 221 thought 216 twitter 197 today 191 long 190 works Basically instead of looking for purely alphabetic words we look for any sequences of non whitespace characters. We then filter out things with @ in them afterwards. The results here turn out to be… really uninteresting. About the only things that look remotely specific to me on here are “coffee” and “code”, both of which it’s true I do care about quite a lot. Mmm. Tasty, tasty code. Before I go, here’s one more thing: david@volcano-base ~/data/tweets/data/js/tweets$ cat *.js | jq -r '.[] | .entities.urls[] | .expanded_url' | head http://ncommandments.com/40 http://twitpic.com/43pl3t http://twitpic.com/43nccl http://ncommandments.com/893 http://ncommandments.com/42 http://twitpic.com/4dnzhr http://yfrog.com/h2797xpj http://twitpic.com/4a1t1z http://twitpic.com/49yzs2 http://twitpic.com/49euse

Every URL you’ve ever posted to twitter (well, it would be without the head at the end to truncate it). Just more chaining of jq filters – we unpack the arrays, then we get the URLs off the object as an array, then we unpack that, then we get the .expanded_url off the url objects.

And that’s about it for now. I can’t think of anything else I particularly want to do. Time line analysis might be interesting – i.e. what’s changed over time (particularly in terms of who I tweet at), but I’m not very interested in doing that right now so I think I’ll leave it there.

Questions? Anything you’d like know how to do?

# A file that should exist in all your ruby projects

Alright, the title is maybe a little more general than is strictly valid. I’m making the assumption that if you’re writing ruby then you are, like me, the sort of trend bucking nonconformist that does exactly the same thing as all the other trend bucking non-conformists.

Specifically I’m assuming that if you are using ruby you’re also using bundler (you should be. It is the way and the truth and the light. There is no salvation save through bundler), and you are using git. If you’re not using the former you should fix that (did I mention you should fix that?). If you’re not using the latter then this might be worth reading anyway, but the specific file you’re going to need is different.

Anyway, at the root of your git repo you should have a file called “.gitattributes”. That file may contain various things, but the line it needs to contain is

Gemfile.lock -merge
`

What’s going on here?

Well, the Gemfile.lock is basically a compiled and pinned down version of your Gemfile. You’re supposed to commit it to your repo so as to get a consistent gem environment across all the different platforms you run on.

The problem comes when you’re working with other people, or even just on different branches, and you make changes to the Gemfile (or even just the Gemfile.lock) on each of those different branches. You might get away with it, but there’s a good chance that when you merge the branches it will silently just merge your Gemfile.lock files. This is because the lock file is a text format so git assumes it’s safe to merge.

Sometimes this will cause you no problems, or will cause you problems that you notice very quickly. The problem is that often it will produce a Gemfile.lock that confuses bundler into working in some cases and much later down the line you will get really confusing bundler errors when you try to use it in a slightly different context (we’ve e.g. found that if you have two incompatible versions of a gem in your Gemfile.lock it can cause confusingly different results depending on what’s installed on your system).

So what does this .gitattributes do? Simple: It tells git never to merge your Gemfile.locks. For merge purposes it treats them as binary files and will generate a conflict at this point, thus localising the error to where the problem occurred instead of some distant point down the line.