Photo by Tim Mossholder on Unsplash
There is a question to which typically we, as Developers, kind of turn a blind eye.
When was the last time your product had downtime? How did it affect your customers?
And why wouldn’t we?
We follow all the Agile processes, we go through multiple debates during the code review phase, we have automated and unit tests, we have QA teams that catch nasty bugs early, we write types on both Frontend and Backend, we use state-of-the-art tools, we do everything by the book. We trust that our code represents our best collective work. We expect it to simply work!
Until it doesn’t.
API resiliency refers to the idea that we build APIs which are able to recover from failure. Failures can be caused by either our own or third-party service problems, server outages, DDoS attacks, network issues, and so much more. Frankly, there are innumerable reasons for which these failures occur. What’s more important is how you recover from them and ensuring they do no lasting damage. Let’s talk about an example.
Case: Artist wants to upload his song to our App
We have created an App that has a Client and an API Server.
In our case, let’s say that our user is an artist who is trying to submit their song to our platform, so we can collect royalties on their behalf.
We wrote a POST HTTP method that creates the song in the Server’s Database. This endpoint has many ways to respond back. It could return 200 for success, 400 for unauthorized access, 500 for server errors, or even worse, it might never answer back.
In cases of failure, clients either deal with it with error handling methods that give our artist some information about what went wrong or, in case we don’t want to interrupt the artist’s journey, they schedule it to try again. But beware: We want to be careful in this case to not overload the system with requests. Which is why it is recommended to implement exponential backoff.
Exponential backoff is a standard error-handling strategy for network applications. In this approach, a client periodically retries a failed request with increasing delays. Clients should use exponential backoff for all requests that return HTTP 5xx and 429 response codes, as well as for disconnections from the server. Eventually, the client should reach either a limit of maximum retries or time and stop attempting to communicate with the server. The great thing about exponential backoff is that it ensures that, when the Server is amidst an incident, it is not flooded with requests.
But, how do we know if the first request actually failed?
As we said before, there can be many reasons why an API could not respond back to the client. So, how are we sure that the server’s database hasn’t already saved the song? If one of our retries succeeds, we could end up having submitted the same song twice. This is a big problem, because what we actually wanted to do is to make sure our whole operation is idempotent.
Idempotence means that, if an identical request has been made once or several times in a row, it results in the same effect while leaving the server in the same state.
A great everyday example of idempotence is a dual button ON/OFF setup. Pressing ON once or multiple times results in only 1 result: the system is on. Same goes for the OFF button.
When talking about idempotence in the context of HTTP, another term that pops up is data safety. In that case, safety means that the request doesn’t mutate data on invocation. The table below shows commonly used HTTP methods, their safety and idempotence.
| Http Method | Safety | Idempotency |
| GET | Yes | Yes |
| PUT | No | Yes |
| POST | No | No |
| DELETE | No | Yes |
| PATCH | No | No |
So as we can see, our POST method fails at both. Great.
Let’s go back to our whole operation. What do we have?
We have a client that tells the Server that it needs to save the song. A possible scenario of that JSON request could look like this:
songTitle: “Comfortably Numb”,
songArtist: “Pink Floyd”,
We could ask the client to perform this request as an idempotent request, by providing an additional Idempotency-Key: <key> header to the request.
An idempotency key is a unique value generated by the Client, which the server uses to recognize subsequent attempts of the same request. How you create unique keys is up to you, but it’s suggested to use V4 UUIDs, or another random string with enough entropy to avoid collisions.
When the Server receives this Idempotency-Key, it should save in the database the body of the first request made for any given idempotency key and resulting status code, regardless of whether it succeeded or failed.
Now, for every request that comes, the Server can verify if it has already mutated the data, just by checking the status in its DB. The idempotency layer should compare incoming parameters to those of the original request and errors (unless they’re the same) to prevent accidental misuse.
With this solution, we no longer have to worry about duplication of data or conflicts. it ensures that, no matter how many times we repeat this process, our operation is idempotent and the artist will receive the success message when the system is able to save their song.
These Idempotency-Keys should be eligible for removal automatically after they’re at least 24 hours old.
P.S. For the simplicity of the example we saved our songs and the keys to the same database. Ideally, you will save these keys on a cache server (e.g. Redis) with a TTL of 24hours, so the removal of the old keys happens by default.
An application that uses an API which implements idempotence can follow the steps below to ensure proper usage:
- Create idempotence keys and attach them to the header.
- When a request is unsuccessful, follow a retry policy such as exponential backoff.
- Save the request body and idempotency key in a cache server.
- Mutate the data.
- Update the idempotency key with the result of the mutation
- After a failure of either 5xx, 429 or no response from the server, the client retries the request.
- The server validates if the idempotency key exists on the cache server.
- The server validates if the body of the request is the same as the one in cache server
- If everything is the same, the server will not mutate the data but return the previously saved result of the mutation.
Elektra Bilali Simou
Engineering manager ORFIUM