Graceful Degradation — Why It Is Important And How To Achieve It
High availability is the norm for today’s cloud-enabled applications. They are expected to be always up, ready to serve customers no matter what happens.
Whether that is warranted or not is a topic for a separate discussion, but we know that contrary to our stable development environment, where load and availability are hardly an issue, production is the Wild West. Things fail constantly and we are one social media post from going viral and receiving unprecedented traffic.
While there are things you can — and likely should — do to reduce the chances of failure, here I would like to continue the theme of “embracing” failures and plan to handle them more gracefully.
How Reliable We Are
With the exception of very simple use cases, your application is likely composed of more than one dependency.
In this example we have an application that has 3 other dependencies, we could consider a simplified model where:
Reliability (R) = R_A * R_B * R_C
And the failure rate is
Failure Rate (λ) = 1 — R
If we have in our example:
Service A (R_A): 0.95 (95% reliable)
Service B (R_B): 0.90 (90% reliable)
Service C (R_C): 0.98 (98% reliable)
Our service would be
R = 0.95 * 0.90 * 0.98 = 0.8466
λ = 1–0.8466 = 0.1534, or ~ 15%
As we can see, we are very susceptible to variations in the reliability of the dependencies.
If things break all the time, does it mean we have no chance of improving our situation?
The World Is Not Binary
As developers, we tend to approach things as binary, whether they succeed or not. Akin to what was presented in “Your Coffee Shop Desn’t Use Two-Phase Commit”, the reality is both more complex and, to a certain extent, forgiving than our atomic mindset would think at first.
Imagine that you passed by your favorite store and decided to purchase a book that will be a gift. You pick up the book, enter in line and when you reach the cashier, he tells you they are out of gift wrappings.
Does the store cancel your purchase? Or does it offer you alternatives, such as “Do you want to purchase without the wrapping?” or “We can wrap it for you and notify you once it is ready to be picked up”.
In another scenario, your application contacts a carrier to obtain the shipping cost for a package. It is a high-traffic season and the service is too slow or unresponsive.
Do you inform the customer and cancel the process? Or is it possible to assign, in this exceptional case, a pre-calculated cost based on the historical data you have?
What do those two examples have in common? We identified, during design time, that these dependencies could go wrong and we devised a way to still succeed with the intent (allow the customer to purchase), even if all dependencies are not there.
Graceful Degradation
As we have seen in the previous examples, graceful degradation consists in providing an experience to the user that is still acceptable — by both business and end-user — when issues take place.
It is degraded, as it does not behave or offer the same capabilities as it would under normal circumstances, and it is graceful as it has been planned to be executed without the user being distracted, or even noticing, that something went wrong.
But how do we incorporate this behavior in our development lifecycle? The easiest way would be to add it as part of your design and iterate over it every time a new dependency is captured.
Design — after the project’s kick-off and initial spikes, you are bound to have an understanding of the feature.
Identify dependencies — at this stage, you are likely focusing on the development of the happy path, but it should not prevent you from highlighting to your Product/Business team those as points of failure.
Determine criticality — you should discuss with Product/Business team and find out from them if a given dependency is critical or not. In this discussion, you will have a critical role in sharing limitations or risks from the technical standpoint with those dependencies.
Product/Business should rate the criticality to help in the next step of this discussion.
Determine the fallback — assuming an item is critical, the discussion will go on the potential ways to continue to operate despite the lack of a certain dependency.
Let’s discuss the options in the next section.
What is important in this stage is that the discussion we are having can trigger a reevaluation of the feature and even the design.
Add the fallback— whatever has been determined as the graceful degradation outcome, we will incorporate it into the code, including the test to exercise the failures.
Repeat — being true to the incremental approach, it is recommended that at each iteration we go over the process as new dependencies may be highlighted, or the knowledge of the actual feature evolves enough to require further changes on how we handle the failures.
Determining the Actions
Now we know our dependencies and determine their critical nature, so how to handle it?
Just Fail
There are cases where there is really nothing else to do but to fail. But it should be the last one.
Many situations may look like this at first, so do not give up just yet!
Skip Failed Dependency
While sometimes dependencies are critical, other times it is okay if you skip a dependency under exceptional circumstances.
Imagine you have a service that needs to calculate a fraud score before requesting payment information. This prevents you from spending the transaction cost to authorize the payment that shouldn’t even be handled.
If the fraud provider is down or slow, the business can decide to continue as is and assume the risk and cost associated with this purchase. They may even define additional rules, such as if the total amount is < X, it is correct to continue despite the failure.
Postpone the Dependency
While the skip applies when you do not mind this failure and can continue, this option applies when you can continue but before you finish the normal process you have to retry that dependency.
Using the same example, imagine that you skipped the fraud verification and proceeded to authorize the payment. You can mark this order as still pending the fraud check to prevent any further processing and then, as an asynchronous task, you reattempt to validate the fraud.
If it is considered fraudulent you cancel the order prior to the fulfillment. If not you remove the blocker allowing it to continue to be processed.
Use a Good Enough Answer
In this case, the dependency would return a piece of information needed at that point in time. You can’t skip it but it is possible to replace the real value with an alternative that is “good enough”.
Still consistent with our online example, during checkout you need to inform the delivery promise. Under normal conditions, this would mean reaching a dependency that informs this based on the current volume of orders being processed.
Since the dependency is not available, use the historical average that is available once a day.
Now that we have a framework and template for the solution, how the implementation go?
Circuit Breakers to the Rescue
A circuit breaker is a well-known pattern when you want to avoid sending requests to an overwhelmed service.
Because of this, it is a perfect fit as your graceful degradation will kick in for the exact same reasons.
With that in mind, either implemented by yourself or via some sort of library, you should handle an open circuit with the fallback mechanism you chose.
This is the End
We have established that failures are part of the reality of your application, and as the number of dependencies grows, it becomes harder to maintain a higher reliability score.
While a high number of dependencies is a sign to reassess your architecture, you should look at those dependencies and assess what should you do in case some of them fail.
There is no one-size-fits-all answer here, but if you follow the discovery->discussion->solution cycle, you will be able to define what makes sense in your context.
Don´t forget that this is a largely business-driven decision, supported by the technical team. You will provide feedback, help with the risk assessment and solution design.
When you have the answers, the implementation may leverage circuit breakers with a fallback strategy, depending on what should you do: fail, skip, postpone, or provide a good enough response.
If you follow this approach you will enhance your solution with meaningful support artifacts, instead of trying to solve all issues from a purely technical standpoint.
In the next articles, let’s focus on increasing each service’s individual availability and discuss scalability aspects.