On the morning of January 28th, we encountered an issue with the Redis cluster that we leverage for caching within our Activity API. The API is built to detect any Redis outages or health issues, and continue operating as normal. Unfortunately, the issue we encountered this morning uncovered a failover scenario that we hadn't accounted for. Although Redis was 100% healthy and available, the API was unable to make writes due to an improper configuration setting. Once we identified the root cause, we were able to deploy a configuration change and restore full functionality to the Activity API.
During this time, some incoming “send” events to the Activity API resulted in an error response, preventing the activity record from being saved to our database.
Fortunately, through our backup access log, we were able to recover incoming requests that encountered an error. We will be able to replay each of those failed requests and create the missing activity record retroactively.
The incident resulted in intermittent errors from roughly 11:58PM EST on January 27th to 7:36AM EST on January 28th.
Immediately upon restoring normal operation to the API, we took measures to protect against this issue and similar issues moving forward. The underlying configuration issue was resolved, and we conducted a comprehensive review of the configuration as a whole. Additionally, we’ve hardened the failover logic within our Activity API to handle this type of scenario without an interruption to its operation.