A (rewritten) manifesto for error reporting

So I wrote A manifesto for error reporting. I stand by it entirely, but it did end up more of a diatribe than a manifesto, and it mixed implementation details with end results. This post contains largely the same information but with less anger and hopefully clearer presentation.

The Manifesto

This is a manifesto for how errors should be reported by software to technical people whose responsibility it is to work with said software – it is primarily focused on the information that programmers need, but it’s going to be a lot of help to ops people as well. The principles described in here apply equally whether you’re writing a library or an application. They do not apply to how you should report errors to a non-technical end-user. That’s an entirely different problem.

This is primarily about how errors appear when represented in text formats – either through some sort of alerting mechanism or logs. It doesn’t cover more advanced tools like debuggers and live environments. Textual reports of errors are a lowest common denominator across multiple languages and are important to get right even if you have better tools.

Think about how people will debug your code

The guiding principle you should follow is that the way you report errors is important, and that you should think as carefully about how you convey information in failure cases as how you behave in non-failure cases. A moderate amount of careful forethought at this point can prevent a vast amount of effort and frustration at a later date.

In particular, when crafting software you should think about what information the person who is attempting to debug the problem is going to need. This information primarily takes three forms:

  1. What, specifically, is the problem that occurred?
  2. What has triggered this problem?
  3. Where in the code has this problem occurred?

If you bear these three questions in mind, and make sure to provide enough information to answer them, you will be in good stead. What follows is some specific advice for helping people answer these questions.

Be as specific as you can in your error messages

Your error messages should not be too long – a sentence is typically more than enough. They should however be descriptive, and tell you what happened.

A bad example of an error message is:

Invalid state

Better is:

Transaction aborted

Better yet:

Cannot commit an aborted transaction

Rather than merely telling you that the state is invalid, the error message tells you which invalid state you were in and what it is preventing you from doing.

Error messages should contain pertinent information about the values that produced them

This is not a good error message:

Index out of range

This is:

Index 8 is out of range for array of length 7

You could also do

Index 8 is out of range for array [1,2,3,4,5,6,7]

the problem with this is that if the array gets very large then so does the error message. So while error messages should contain information about the values that generated them, they do not need to contain the entire value: Only enough information about it to say why it triggered this error.

Another error message you shouldn’t generate from the exact value:

Failure to process credit card number XXXX XXX XXX XXX

Even ignoring the specific laws around processing credit card numbers, you should obviously not be logging confidential or secret information about users like this.

So there are reasons why your error messages can’t always sensibly contain the full values that triggered them. That being said, it’s much easier to recreate a problem if you can recreate the exact value, so it’s a good default to include more rather than less, and you should certainly be including some.

Error messages should locate where in the code they occur

In an ideal world, every error message would come with a complete stack trace that says exactly the chain of calls that it went through to get there. If absolutely necessary, and if you’re generating good and expressive error messages, it’s sufficient to include just the file and line number where the error occurred, but it’s not perfect and gives you much less information about how the problem was triggered.

The reason this is so important is that determining where the problem occurs in code is one of the first steps of any debugging process, so you can save a lot of time and effort for the person debugging by doing this for them at the point of the error.

In most languages if you are using exceptions, you get pretty close to this by default. On POSIX systems in C or C++ you can apparently also do this with the backtrace function.

Additionally, you should make a best effort to include stack traces when crossing process boundaries through RPC mechanisms: If a remote procedure can reasonably report a stack trace, it should report a stack trace and you should include that in your error report.

You should not mask lower level errors

It is common to wrap lower level errors in high level ones. It is also common to alter the display of errors in code you’re calling – e.g. in testing frameworks.

When you do either of these things the golden rule you must follow is that you should not remove information from the lower level errors, as they may be the most informative information the developer debugging the problem has about what actually went wrong.

In particular, if you are rethrowing exceptions you need to take steps to ensure that you include the original stack trace and error message (in many languages it is possible to alter the stack trace of the exception you’re throwing, and you can use this to chain the stack traces together).

Additionally you should never remove stack trace elements for display (it is acceptable to e.g. compress adjacent lines into a single one with a counter for repetitions. It’s OK to change the display, but not to remove information).

Error conditions should not be covered up

It is often tempting to believe that it is the code’s responsibility to attempt to cover for an error and keep on working regardless. Sometimes this is even viable and true. Sometimes however an error is more likely to be a sign of developer error which should be addressed sooner rather than later, and even when it is not an obvious developer error it is likely a symptom of something going genuinely wrong.

As a consequence unless an error condition is genuinely routine (a rough rule of thumb here would be “Can reasonably be expected to happen multiple times a day and we’re not going to do anything about that” it should be reported. It is fine for the code to recover from the error and attempt to proceed regardless, but the error needs to be logged. Even if it’s not a problem that needs fixing, it may end up as symptomatic of other problems.

Errors should be reported when you enter an invalid state, not just when you attempt to operate whilst in one

One of the most common errors to see in a Java application is a NullPointerException. In Ruby it’s similarly common to see a NameError or a NoMethodError.

Inevitably this is because a value has been allowed to enter somewhere that it shouldn’t be permitted.

Other forms of invalid state are also possible, but they basically all come down to the same thing: Your error is not caused by what you are currently doing, it is caused by what has come before. Your debugging now has to back track to find the point at which the object was put into an invalid state, because where the error appears to be occurring is of no help to you.

The solution to this is to validate your state when it changes: If data is only permitted to be within a certain range of values, check that it belongs to that range of values when you set or change it. This means that the problem will be caught at the point where it occurs rather than the point where it causes problems.

Recap

In summary:

  1. Think about how people will debug your code
  2. Be as specific as you can in your error messages
  3. Error messages should contain pertinent information about the values that produced them
  4. Error messages should locate where in the code they occur
  5. You should not mask lower level errors
  6. Error conditions should not be covered up
  7. Errors should be reported when you enter an invalid state, not just when you attempt to operate whilst in one

If you do all these things, your applications and libraries will be much easier to debug and maintain, and the people who have to do so will thank you.

This entry was posted in programming on by .