Over the past year, I have been working on business-to-business automated integration that, under ideal circumstances, should automatically broker and communicate with as few manual touchpoints as we can reasonably achieve. When the system operates works as expected, it works great. Low manual intervention is needed, and updates flow between our integration and other businesses. However, this project includes many internal and external integrations that expose the automation to many points of failure, from which we try to recover as much as possible. Here, we’ll explore strategies to strengthen this automation’s API clients by implementing retry and circuit breaker patterns.
The Importance of a Retry Mechanism
Our automated process often relies on HTTP calls to both external and internal integrations to manage inbound and outbound messages. Over the past year, we’ve experienced the frustration of a singular failed request disrupting the smooth operation, requiring manual intervention to resolve a simple issue. These failures weren’t due to bad data but rather transient network issues — entirely recoverable from our end. Our applications need to be smart enough to handle these transient errors gracefully. Implementing a retry mechanism allows your API client to automatically retry failed requests, increasing the chances of success without manual intervention. When we implemented a retry policy we considered the following.
- Retry Count: Determine how many times the operation should be retried before moving on to other messages.
- Error Handling: Decide which exceptions or error codes should trigger a retry. For instance, should you retry all 500-level server errors? What about request timeouts?
- Delay Strategy: Establish how long the system should wait before each retry.
- Exponential Backoff: Instead of a fixed delay, use an exponential backoff to gradually increase the wait time between retries allowing for a greater change when the request succeeds.
- Jitter: Using jitter is a helpful way to vary the delay in the amount of time between requests to prevent synchronized retries from multiple clients that could overwhelm other systems that could be already degraded.
While these guidelines are valuable for implementing retries, don’t reinvent the wheel. Chances are, the programming languages you’re using have well-established packages for this purpose. You don’t want to find yourself implementing your own decorrelated exponential backoff jitter algorithm — yikes!
Introducing the Circuit Breaker Pattern
While retries effectively handle short-term transient issues, repeatedly attempting operations on a consistently failing service can exacerbate an already degraded system. This is where the circuit breaker pattern comes into play. A circuit breaker typically operates in three states. The first is closed, where requests proceed normally. The second is open, where requests automatically fail before sending the actual request for a set time after determining the service has degraded. And, the final is half-open, which serves as a testing phase to check if the service has recovered.
In our case, we also used the circuit breaker to help us recover from longer service degradation. We used the circuit breaker in tandem with our message broker. That way, when the circuit was open, we paused processing new messages in an attempt not to lose any messages. And, once the circuit breaker opens and requests start to succeed again, we continue processing messages.
A Warning About Circuit Breaker and Error Codes
When implementing our circuit breaker we ran into a scenario where a state for particular messages was corrupted in downstream services. Retrying the requests would never resolve the issue, only manual intervention would. What was the HTTP response code they were returning when fetching or patching this data? You guessed it — Bad Request.
Since we’d configured the system to always retry everything and not continue processing messages until it no longer received bad requests, we inadvertently took down part of our automated system processing. Luckily we were able to catch this issue in our non-prod environment, this experience taught us it’s essential to understand all scenarios in which other services might respond and to collaborate with teams handling those services on a standard set of responses.
Dead Letter Queue
In light of our experience with the circuit breaker and error codes, implementing a Dead Letter Queue (DLQ) offers a solution. A letter queue is a holding area for messages that fail processing after multiple retry attempts. Using this queue, we create a safety net that prevents individual message failures from cascading into system-wide issues.
These messages are held in a side queue that allows for reprocessing after a longer period or allowing for manual intervention and resolution of that particular message. While manual intervention is not ideal for this automated system, sometimes we encounter new scenarios we didn’t or couldn’t anticipate while developing this automation. In a system where losing data is unacceptable, this is a way to prevent data loss while investigating an automated way to recover from your newfound issues.
Alerting and Logging
Automated alerting and logging have been crucial for us to gain insights into what is happening in production. We have developed and added to our automated system to ensure it’s operating as expected. Additionally, any unexpected or exceptional behavior is logged for inspection and resolution on our path to building a robust automated integration service.