Building resilient applications: Exceptions are not that exceptional
Sometimes, things go wrong in an application. Most of the popular languages use the concept of Exceptions to handle these alternative flows: when things are not going the normal way. The name Exception already gives away that they don’t represent the regular flow of an application. Instead, they represent an exceptional flow. In his article “Vexing Exceptions”, Eric Lippert explains that it is an anti-pattern when used for controlling a common flow (he calls that boneheaded exceptions). However, just because something is exceptional, it doesn’t mean it is not something that happens on a regular basis.
This sounds like a contradiction, but in reality, it isn’t if we take scale into account. If 1 out of every 100,000 requests to an external web service fails, it makes failing exceptional (after all, in the vast majority of cases, the request succeeds). At the same time, if your application consists of multiple micro services that combined fire up a million requests a day (that’s around 12 request per second for all services combined), you’ll have a failing request every 2.4 hours on average.
This means that, in order to have a resilient application, especially when dealing with external dependencies like other web services or a database, it needs to be able to:
- Identify early that a problem exists
- Properly handle the situation
- Avoid making the situation worse
Identify a problem early
The “best” worse case scenario is when for example your application tries to access an external service, and directly fails to to a problem with the network or the service. In this case, an exception is thrown that makes your application immediately aware of a problem and allows you to handle it.
But there’s also a category where the application is not aware of a problem. For example: in networking, when a TCP connection is established and something happens with the connection below TCP level, both ends of the connection don’t get any notification that the connection is broken. This can cause you application to wait indefinitely and even causing it to become unresponsive for new request itself and thereby cascading the problem.
That’s why it’s a good practice to use Timeouts in your code when dealing with external dependencies. With timeouts, you explicitly state the maximum amount of time you want to wait for a response when firing a request. If there’s no response in time, an exception is thrown that allows you to handle it.
Properly handle the situation
The default way an application behaves when an exception is not handled, is by crashing: meaning the request or even the whole application terminates. This is good behavior for exceptions that can’t be handled in a sensible way, like out-of-memory exceptions. The best way here is to fail fast and minimize uncontrollable behavior.
But exceptions that can be handled shouldn’t just be handled, but should even be treated as a normal flow that can be expected. You should ask yourself if you can implement a fallback in case something goes wrong. When you want to retrieve some information from an external service but it is unavailable, as a fallback you could for example:
- Retry it for a number of times
- Try to use an alternative service
- Use a (perhaps slightly outdated) cached version of the information requested
- Use a default value
- Give the user the option to try again later
The thing to remember is that a failure is just as valid as an outcome as a real result and you can partially degrade your system temporarily instead of failing completely.
Avoid making the problem worse
If one component in your system fails, you should take care that this problem does not cause a chain reaction and isn’t cascaded throughout the entire system.
When a service is under heavy load and is experiencing problems with it, it doesn’t help if it keeps getting flooded by requests. Then the service stays under this heavy load and the situation only gets worse. What helps is if the application doing the requests gives the service some rest when it’s failing. You can do this by implementing the Circuit Breaker pattern. Like a real circuit breaker, it shorts immediately when a problem arises, preventing additional load to a system. After a certain amount of time, it starts probing the service again to check if it’s up again.
This pattern helps you to isolate a faulty component, stop stressing it further and avoid dependent components from being increasingly busy themselves causing failure throughout the entire system.
Keep in mind that failures will happen so treat it as such. Make sure you control failures as much as possible and think of good fallback scenarios to deal with them.
If you want to know more about this topic and want to build resilient application, I highly recommend you to read the book “Release it!” by Michael T. Nygard. This book is full of patterns on how to make your application more robust and production ready.
Want to know more? Develop yourself during the Foundations of CD and DevOps training by Jez Humble!
This blog was written by Harm Pauw.