On Tuesday, June 2nd at 9:42am EDT, PactSafe’s Redis cluster became unresponsive – resulting in a sustained outage of the Activity API and REST API. Upon investigation, it was discovered that the primary Redis node had encountered a memory limit, which prevented communication with the rest of the cluster. As the cluster attempted to recover with new primaries, the same issue would occur on the new primary. This kept the cluster in an unhealthy state where it was “up and reachable” but no primary could be communicated with.
Once the team identified and confirmed the root cause of the issue, they were able to scale up a new, larger primary and secondary node and reset the cluster state. As soon as the new primary finished syncing all of the existing cluster’s data, the Activity API and REST API recovered fully.
During the window that the Activity API and REST API were unresponsive, users would have seen API requests hang and eventually timeout. This would result in the inability to submit an acceptance and receive a valid response, likely blocking clickthrough workflows. If a clickwrap acceptance was impacted during this window, that signer will be prompted to accept again upon their next visit. Beyond the unresponsiveness of the API, no existing data was lost or ongoing impact to the system occurred.
The incident began at 9:42am EDT when the Redis cluster became unresponsive, and the system (including both the Activity API and REST API) had fully recovered by 11:22am EDT.
Total incident duration: 1 hour, 40 minutes
During the course of the investigation and post-incident discussions, the team has come away with a few insights and lessons that we will address going forward.
The team has already taken steps to prevent this issue in the future, but there is more work to be done.