A manifesto for error reporting

So I do a lot of debugging.

It’s not because I write a lot of broken code so much as that I seem to be a go to guy for “Something mysterious is happening. Can you lend a hand?”

That’s fine. It’s something I’m reasonably good at, and sometimes it’s even enjoyable.

But it means I have a bit of a triggering issue which will get me very angry.

“What is that triggering issue, David? Tell us, please!” I hear you say.

Well, I’m glad you asked!

That issue is very simple: Bad error reporting.

A thing developers don’t seem to understand is that what happens when things go wrong is every bit as important as what happens when things go right. Possibly more important.

If you don’t realise this, when things go wrong you will feel the wrath of myself and all the ops people who have to deal with your software floating through the air, trying to set you on fire with our minds.

While I’m reasonably sure psychic powers aren’t actually a thing, do you really want to take that chance?

So, if you don’t want to experience spontaneous geek rage induced combustion, here is some helpful advice for you to follow.

First, a word on process. When something goes wrong, the question I am asking is “How can I make this not go wrong?”. In order to answer this, I must first answer the following questions:

  1. Where has it gone wrong?
  2. What has gone wrong with it?
  3. Why has it gone wrong?

Your job as a writer of software is to make it as easy as possible for me to answer these three questions.

Next a note of context:

How am I attempting to answer this question?

Well, in an ideal world, I’m attempting to answer it because I have a nice precise test case which reproduces the problem.

However, I first need to get to that point, and in order to get to that point I need enough information to give a pretty good answer to the first two questions. An entire application is not a test case, especially not if it’s in a complicated deployment environment. I need enough information about where it has gone wrong to extract a smaller test case and I need enough information about what has gone wrong to put that test case in a state where it will demonstrate the problem.

So what I’m actually looking at initially is almost certainly a log file. It’s OK if this log file is really the screen of a console, but the point is that something, somewhere, has given me a textual record that says “Hey, something’s gone a bit wrong. Here’s some info for you to look at”.

There is a possibility that if you’re writing an application or a framework or something you have deliberately avoided producing such a textual record of anything, or are piping your errors to /dev/null or something. Hopefully this is not the case, because if it is you don’t need to worry about spontaneous combustion because whomever has to deploy and maintain your code has probably already tracked you down to your home address and killed you in your sleep. No jury would convict.

So, from now on, I’m assuming you’ve done the decent thing and there’s some way of going from errors that occur to logs of such errors.

What can you do to make everyone’s lives easier?

Error messages

Obviously the prerequisite of this is that you actually tell me something in your error message. You’d never just write an error message that said “Something went wrong”, right? So assuming you’ve already got error messages that tell me roughly what went wrong, here is how to have error messages that tell me exactly what went wrong:

If your error message is triggered by a value, for the love of god include that value in your error message.

People don’t seem to do this. I don’t understand why. It’s very simple.

Don’t do:

   error "Bad argument"

Do do:

   error "Bad argument: #{argument.inspect}"

Even better if you tell me exactly why it is invalid:

   error "You can't curdle a frobnitzer which has already been curdled: #{argument.inspect}"

(Side note: All examples here will be in ruby, because that’s mostly what I’ve been working with when this has been pissing me off. The examples should be easily portable and the principles are language agnostic).

That’s it. You’ve already made my life at least 27% simpler with this one step.

Why is this important?

It’s important because tracking data flow is hard. It’s entirely possible that the function you’ve raised an error in is about 50 calls deep. I can probably track down what has been passed to it eventually after carefully looking through calls and such-like, but I shouldn’t need to. If you are not including the value in your error message then you have exactly the information I need at your finger tips and are failing to tell me. That’s kinda a dick move.

Exceptions are awesome. Do more of those

You know what are great?

Exceptions. Exceptions are great.

I mean obviously I’d rather if your code isn’t throwing exceptions, but I’d rather it’s not throwing exceptions because it doesn’t need to because everything is going swimmingly, not because it wouldn’t throw them if something went wrong.

Why are exceptions great?

Exceptions are nice for structuring error handling in code, they provide good classification for error recoveries, etc. etc.

That’s not what I care about here.

Exceptions contain one thing that elevates them to the status of patron saint of people who have to debug problems.

They carry a stack trace.

It’s like a glorious little audit trail that points the finger at exactly where the problem occurred. If you’ve followed the previous instructions and given them a good error message too then you’ve probably told me exactly what I need to know to reproduce the problem (there are some, ahem, exceptions to this which I will get on to later, but this is true most of the time).

Side note: I know this isn’t true in all languages. e.g. C++ exceptions don’t carry stack traces (I think) and Haskell ones have less than useful stack traces due to lazy evaluation. You have my sympathies. Everyone else, no excuses.

Further, they carry exactly the information I want to appear in the log on top of that: An error category and a message. An exception which bubbles up to the log file is my best friend for problem debugging.

Some specific notes on exceptions:

If you see an exception, say an exception

Never hide exceptions from me. Ever.

If you catch an exception, I need to know about it unless you’re really goddamn sure I don’t (examples where you may validly be goddamn sure I don’t include Python’s StopIteration and any other exceptions used for control flow. Yes this is a valid thing to do).

I don’t care if you send an email, dump it in a log file, whatever you want. I just need to know about it, and I need to know at the very least the exception class, the exception method and for the love of god the exception stack trace.

Thou Shalt Not Fuck With The Stack Trace

A lot of frameworky things (rails, rspec, etc. I define framework as any library or application where the usage pattern is “Don’t call our code, we’ll call yours”) think that exceptions are confusing and unhelpful. They might show you some of the stack trace, but you really don’t want the whole thing do you? Here, let us filter out those unhelpful bits.

NO.

NO NO NO NO NO NO NO NO NO NO.

NO.

Bad developer. Wrist slap. No cookie.

The chances that you actually correctly understand what is the important bit of the stack trace are effectively zero. Even if you somehow manage to correctly understand this, you are removing important context. The lack of that context will confuse me more than its presence. If I ever find you are doing this I will simply have to do everything again with the “stop lying to me you bastard” flag turned on.

And that’s terrible.

Except…

There is one case in which fucking with the stack trace is not only permitted but also mandatory.

It is OK to add more information to the stack trace.

In particular, if there is another stack trace involved you should also include that.

I often see code like this:

begin
   ...
rescue LowLevelException => e
   raise MyLibrarySpecificException(e.message)
end

Please take my outrage as read.

It doesn’t look like you’re doing it but you are once again fucking with the stack trace. Remember what I said about not doing that?

It’s OK to wrap exceptions. I understand the reasoning for doing it, and it’s often a good idea.

However: Your language almost certainly gives you the capability to override the stack trace. When you are wrapping an exception you must do this so that it includes the original stack trace. Ideally you would include both back traces, so your logs would contain something like:

MyLibrarySpecificException: Wrapped LowLevelException: "A message"
   this
   error
   was 
   thrown
   here
   -- WRAPPED BACKTRACE --
   the 
   original
   error
   was 
   thrown
   there

The details don’t matter. The point is: Include both back traces if you can, include only the original stack trace of the exception you’re wrapping if you absolutely must.

Here’s an example of how you can do that in Ruby:

class MyLibrarySpecificException < StandardError
  attr_reader :wrapped_exception
 
  def initialize(wrapped_exception)
    super("Wrapped #{wrapped_exception.class}: #{wrapped_exception.message}")
    @wrapped_exception = wrapped_exception  
  end
 
  def backtrace
    super + ["---WRAPPED EXCEPTION---"] + wrapped_exception.backtrace
  end
end

Enough of exceptions. Some more general principles.

If something goes wrong, tell me

This rant isn’t about Padrino, but it was a triggering point for it.

One of Padrino’s more interesting behaviours is that if you have a syntax error in one of your controller files it won’t fail to start. Instead what will happen is it will log a warning, continue loading and then just go “Eh, I don’t know anything about that” if you try to use routes defined in a controller it failed to load.

This is not helpful.

A common design principle seems to be that you should attempt to do the right thing – recover from errors, guess what the user meant, etc.

This is really not helpful.

The problem with fuzzy behaviour is that it produces fuzzy results. Postel’s Law is deeply unhelpful for library design: Code which you are running should be correct. If it’s a bit wrong, you should not attempt to run it, you should error out and make me fix my code.

This is because errors in code are signs of error in thought. The chances of my accidentally calling your code with the wrong value is much higher than the chances of me deliberately being a bit sloppy (and if I’m deliberately being a bit sloppy it’s OK to slap my wrist and punish me for it). Code which is doing the wrong thing is going to be a problem now or a problem later, and I’d much rather you told me it was a problem now so I can fix it now rather than having to locate it later.

On the subject of “now rather than later”.

Validate early, validate often

Suppose I write the following code:

class HelpfulHashWrapper
  def initialize(hash)
    @hash = hash
  end
 
  def do_something(some_key)
    return @hash[some_key]
  end
end

(ignore the fact that this class is stupid)

Now suppose I do the following:

1.8.7 :029 > my_wrapper = HelpfulHashWrapper.new(nil)
 => #<HelpfulHashWrapper:0x7f12d7bd60a0 @hash=nil> 
1.8.7 :032 > my_wrapper.do_something "hi"
NoMethodError: undefined method `[]' for nil:NilClass
	from (irb):26:in `do_something'
	from (irb):32

Where is the error here?

Hint: It’s not the point where the exception was raised.

I constructed the HelpfulHashWrapper with an argument that was never going to work. My HelfpulHashWrapper unhelpfully didn’t tell me that I had put it into an invalid state.

Why is this important?

Remember when I said that the first question I needed to be able to answer was “Where has it gone wrong?”

If I get an error when I try to use an object in an invalid state, I’m not really able to answer that question. Instead what I need to do is back track to the point where the object got put into an invalid state. This is hard work. The following version of the class will make my life much easier:

class HelpfulHashWrapper
  def initialize(hash)
    raise "I can only helpfully wrap hashes. #{hash.inspect} is not a hash" unless hash.is_a? Hash
    @hash = hash
  end
 
  def do_something(some_key)
    return @hash[some_key]
  end
end

I will now discover very early on when I’ve done something wrong, rather than waiting to find it at a mysterious later date.

Basically: The closer to the original error you report the problem, the easier it is for me to identify and fix the problem.

In summary

  1. Above all else, give me helpful error messages
  2. Helpful error messages contain any invalid values and a reason as to why they’re invalid.
  3. Throw exceptions if something goes wrong.
  4. Your application should record all exceptions it receives.
  5. Do not fuck with the stack trace
  6. Do not attempt to help me by not throwing an exception. If something maybe should throw an exception, it should throw an exception.
  7. Validate your internal state, and throw an exception when your state becomes invalid, not when I try to use it in an invalid state.

Doing these things will significantly reduce my blood pressure, will make your ops guys love you (or at least resent you slightly less bitterly), and will reduce your chances of spontaneous combustion by at least 43%.

This entry was posted in programming on by .

16 thoughts on “A manifesto for error reporting

  1. Henry

    Amen.

    What especailly drives me crazy is when people cover-up errors in configuration. That is always a What Where They Thinking moment. Or nightmare.

    Fail fast is your friend. Even, nay- especially, if you are trying to build robust systems.

    Reply
  2. cavetroll

    Great post! I agree with alot of the points. A comment on C++ exceptions: default STL exceptions don’t carry stack traces. However, at least in UNIX, you can use the system backtrace() utility to get a stack trace, compile the program with debug symbols, and use abi::__cxa_demangle() from the cxxabi.h to display the symbols nicely. All this can be rolled into a custom exception class. So – possible, but requires a bit of overhead :)

    Reply
  3. Pingback: An Error Reporting Manifesto | ebeab

  4. Adrian Smith

    Agree with all these points. Further: if code is logging something, then of the two options:

    1. Log in advance, “Will write XML file to $path…”

    2. Log afterwards, “Successfully written XML file to $path.”

    Then one should definitely do the first. Although it feels nice to log after the event, because you might have more information such as IDs assigned, if the process crashes, it’s important to know what it was trying to do when it crashed. If there’s information after the event that must be printed then do a second log.

    Reply
  5. Daenney

    I can agree with you on everything, except for one thing:

    It’s OK if this log file is really the screen of a console

    In my experience, if a developer doesn’t care enough to set up a proper logging facility with the option to write to file, syslog, whatever and instead just expects you to redirect shell output to something else than /dev/null they haven’t really thought about logging which will probably bite you later on. It’s code that’s usually a mess of log and print statements in exception blocks that go “You should never get to see this”. Oh djeesch, thanks.

    In most languages it’s not difficult at all to set up a logger and make it user-configurable, so just do it already.

    Reply
    1. david Post author

      So the use case I had in mind here was actually in test and development where I think logging to the console is eminently sensible.

      However I will say in its defence that I think logging to STDOUT and STDERR is kindof acceptable behaviour for production systems when you want to decouple your code from your deployment strategy a bit. It’s a legitimate school of thought that applications shouldn’t self-daemonize, and as such shouldn’t know about where their log files live and instead you should use something like libslack’s daemon program do daemonize them.

      Reply
      1. Sheldon Hearn

        Followers of that school of thought pay much for their ideal. If you don’t daemonize yourself, don’t manage your own pidfile, hold a single log file descriptor open for your lifetime, you demand the selection of weak operational strategies.

        Or maybe I just haven’t seen the big silver bullet that came along and made everyone forget how horrible daemon(8), supervise(8) and friends were.

        But to maintain perspective… this is a big gripe with a small comment on a small facet of a largely great post. :-)

        Reply
      2. david Post author

        We’ve actually found daemon (the libslack one) pretty good. Or at least good enough for our limited usage.

        One thing about the “hold a single log file descriptor open” is that the descriptor you’re holding open doesn’t have to be the actual log file you’re writing to – output to it can go to syslog or some other central logging service, and the managing program can handle signals that cause it to close and reopen the descriptor, etc.

        I’m not religiously of the “This is how it should work”, but it has a lot of appeal in terms of allowing you to configure how your daemons behave rather flexibly rather than having to have the logic in every program you run.

        Reply
  6. jrochkind

    I would add: Tell me in docs what exception classes a method might raise, in what circumstances.

    In a language like ruby, in docs is the only place it can be. In a language like Java, _theoretically_ this aspect is self documenting… but in practice, there are popular ways to obfuscate it anyway.

    Reply
  7. Phuong Ng

    I don’t think it is ever a good idea to fuck with the stack trace. I did it once and my dick became infected, I was lucky it didn’t have to be amputated.

    Reply
  8. Anders Hovmöller

    I have another pet peeve that you didn’t mention:

    Never EVER raise an exception with multiple error causes. You see this all the time in code with errors such as “Either the frobnitz isn’t frobbed or the barblang is not blanged”. Obviously the code has checked these two cases separately, so why the hell aren’t you telling me which problem it was?!

    Reply
    1. david Post author

      Oh yeah, good point.

      One thing that I will say in… not exactly defence, but explanation, is that often these are the results of translating low level signals and error codes. e.g. you see this a lot with file system operations. e.g. a file system move will return the same error code if the source file or target directory don’t exist. It’s not valid to then recheck to see which of the two conditions failed because concurrency.

      Reply
  9. Pingback: A (rewritten) manifesto for error reporting | David R. MacIver

  10. Pingback: A case study in bad error messages | David R. MacIver

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>