Article summary
Yesterday, I explained the different ways that errors can be represented in your code. Today I’ll talk about error handling, which is what makes errors such a tricky subject to begin with.
Why is error handling so difficult?
- There are typically many more ways for a system or operation to fail than to succeed.
- Error handling is often not a “local” event but one that takes place across multiple application layers.
- Errors are context-dependent, meaning that different parts of a system need to interpret the same error in entirely different ways.
- Handling errors across an application boundary forces you to represent the same error in two different ways.
These aspects of error handling mean that you’ll need to write a lot of code to handle everything thoroughly. If you aren’t programming with some foresight or strategy, that code will end up being hard to write, understand, and change. I hope this post will help you write error handling code that’s succinct and predictable.
The Process
You need to consider error handling when writing just about any imperative code. That doesn’t mean you need to implement handling in every single function, but you should be very intentional about including or excluding it.
Here’s a general procedure for error handling. These instructions may not feel very specific, and that’s because they aren’t. Details about how to handle errors vary drastically depending on the technologies you’re using and the goal of your application. This procedure, on the other hand, is meant to be applicable for just about any part of any program.
1. Which Failures Can Be Detected?
Look at the block of code you’re writing. Which failures can be detected by your code? Remember, failures are predictable errors, which means that there should be a logical answer to this question. If the code is making a network request, you need to detect network errors. If the code is parsing a spreadsheet to insert data into a database, then you need to detect file IO errors, parsing errors, domain model errors, and database errors.
Failures should be detected as soon as possible because you want to preserve context about how and why the error happened in order to handle it correctly. In other words, if code A calls into code B, and code B fails, then code B should detect the error and tell code A about it, instead of merely returning information that code A has to interpret as a success or failure.
2. Which Exceptions Should Be Detected?
Next, think about which exceptions should be detected by your code. Exceptions are errors that you don’t expect to happen, which means that you can’t count on detecting them near the “source” like you can for failures. Instead, you should detect exceptions where you can most conveniently react to them.
For example, in a typical web application, your server code should always expect to have a stable connection to the database. Losing that connection would certainly be an exception, but you shouldn’t try to detect that error in every single place that you interact with the database. Your data layer probably throws some kind of exception when that happens. The best place to detect that error might be in your router or controller, where you can simply return a 500 status and not bother with any other operations until the problem is fixed. Or if you need to undo part of a transaction, you may need to detect the error somewhere between the controller and the persistence layer.
Most languages will throw an error when the runtime encounters something deemed exceptional. That means you’re conveniently limited in the number of ways the error can be detected. It’s usually a good idea to throw your own custom errors so that you’re always using the throw/catch strategy for exceptional errors, but that isn’t strictly necessary. You can always pass error data between functions.
3. Control the Situation
If possible, turn the error into a domain type. Represent the error as some data structure that is defined by your application, not another library or application. If you detected the error by catching or pattern matching a data structure that you didn’t define, then you can always save a reference to that original error inside your own error data.
See Part 2 of this series for some ideas about how you can represent the error. And don’t feel bad if it seems like you’re writing a lot of code.
4. Record the Error (Maybe)
The place where you detect an error is where you should log it. This is where you have the most data about what operation was being attempted, what the inputs were, and how the error was detected. If the error is exceptional, you might also want to report it to a third-party service like Sentry. Lean toward including too much data instead of too little. In my experience, annoyingly-verbose logs are related to “normal” application states, not errors.
By logging the error immediately, you’ll ensure two things. First, once the logging is done, you can ditch any data that won’t be useful for recovery. This will make recovery more straightforward. Second, as you’re passing around data about an error that was already detected, you won’t have to wonder whether or not to log anything. You’ll know that it was already logged right away.
One thing I’ve struggled with is logging an error multiple times as it’s passed around. Doing so is tedious, but it might feel necessary in order to track down how an error gets handled. However, if you log the right information immediately, you should be able to trace through your own code to see how the error is handled. That way you can avoid redundant logs.
5. Recover or Delegate
Either “recover” from the error or pass it along. Recovery can mean a lot of things, but the important part is that control flow now belongs to code whose sole purpose is to deal with the error data or code that treats the error data as “normal” data. An example of the former would be rendering a 404 page for a client. An example of the latter is rendering errors on a web form while retaining the form’s inputs.
You may also want to log how the recovery will happen (e.g., “undoing x, y, and z” or “retrying…”). But logging should be right before the recovery happens, not necessarily where the error was detected.
Passing an error along means that the “receiving” block will essentially need to start over at step three (above). The error may need a new representation if it’s moved into a new application layer, but ideally, that’s not the case. Similarly, you may need to log the error again, but hopefully not.
You should recover from every single error one way or another. Once again, “recovery” is extremely situational. Recall from Part 1 that errors can be either internal or external. Generally speaking, recovering from internal errors means letting the user know that something out of their control failed and then doing something to make the error less likely in the future (like reporting a bug). Recovering from external errors means telling the user what went wrong and how they can fix it, or that something failed that you may or may not be able to fix.
Where to Go Next?
I hope this series has given you some strategies for defining, capturing, and handling errors in your application. All of this information applies to pretty much any application, which means you might be left with questions about errors in your specific tech stack.
I can’t go over all of those details here (partly because I’ve only had the opportunity to practice these ideas in three or four languages!), so I’ll leave you with a short list of more specific techniques that may or may not be applicable to you.
- Railway Oriented Programming – A strategy for easily treating error data like any other data
- Monadic Error Handling – Another strategy from the world of functional programming
- Discriminated Unions – Can be implemented with a nominal type system or structural type system
- Correlation IDs – Tying real instanced of error events to a universally unique ID for traceability
- Transactions – Making sure a collection of operations all succeed, fail, or get reverted