Experiencing an issue with our Activity API and signing requests in PactSafe

Incident Report for Ironclad Clickwrap

Postmortem

What happened?

On Tuesday, June 2nd at 9:42am EDT, PactSafe’s Redis cluster became unresponsive – resulting in a sustained outage of the Activity API and REST API. Upon investigation, it was discovered that the primary Redis node had encountered a memory limit, which prevented communication with the rest of the cluster. As the cluster attempted to recover with new primaries, the same issue would occur on the new primary. This kept the cluster in an unhealthy state where it was “up and reachable” but no primary could be communicated with.

Once the team identified and confirmed the root cause of the issue, they were able to scale up a new, larger primary and secondary node and reset the cluster state. As soon as the new primary finished syncing all of the existing cluster’s data, the Activity API and REST API recovered fully.

Potential Impact

During the window that the Activity API and REST API were unresponsive, users would have seen API requests hang and eventually timeout. This would result in the inability to submit an acceptance and receive a valid response, likely blocking clickthrough workflows. If a clickwrap acceptance was impacted during this window, that signer will be prompted to accept again upon their next visit. Beyond the unresponsiveness of the API, no existing data was lost or ongoing impact to the system occurred.

Incident Duration

The incident began at 9:42am EDT when the Redis cluster became unresponsive, and the system (including both the Activity API and REST API) had fully recovered by 11:22am EDT.

Total incident duration: 1 hour, 40 minutes

Lessons Learned

During the course of the investigation and post-incident discussions, the team has come away with a few insights and lessons that we will address going forward.

Our Redis failover solutions in the Activity API and REST API do not account for every type of unhealthy state. The health checks they rely on in order to failover need to be broadened.
We need to improve the isolation between services and dependencies to prevent this type of widespread outage of multiple parts of the system.
We need to ensure that all infrastructure components are instrumented with monitoring and alerts on key resource utilization.
Our current Redis implementation is a key area that we need to upgrade to support auto-scaling as data growth occurs.

How this will be prevented going forward

The team has already taken steps to prevent this issue in the future, but there is more work to be done.

Instrument every production server (including Redis) with advanced infrastructure monitoring.
Create alerts on all key infrastructure metrics that have the potential to cause a disruption of service.
Replace the current, legacy Redis implementation with a managed service that supports auto-scaling and improved health monitoring.
Separate the existing Redis-reliant services (caching, pub-sub, sessions, etc.) into independent clusters that can be isolated from one another.
Enhance and test the failover solutions within the Activity API and REST API to encompass additional types of unhealthy cluster states and errors.

Posted Jun 16, 2020 - 13:17 EDT

Resolved

We have been monitoring and feel confident we’ve resolved this issue at this time. Within the next 48 hours we will be providing a post-mortem outlining more detail on what happened.

Posted Jun 02, 2020 - 11:22 EDT

Update

We are fully operational but are still monitoring for ongoing issues. Our engineering and architecture team is working to ensure that we shore up any issues. At this time, we do not believe there was any data lost when processing acceptances during the outage. We'll be writing up a full post-mortem summarizing what the issue was, why it happened, and how we'll be improving our system architecture going forward.

Posted Jun 02, 2020 - 11:21 EDT

Monitoring

Our application, REST APIs, and Activity APIs are starting to come back online and we are monitoring their performance.

Posted Jun 02, 2020 - 11:07 EDT

Update

We are continuing to investigate and isolate the nature of the issue.

Posted Jun 02, 2020 - 10:50 EDT

Update

We have restarted a number of servers and are seeing a significant decrease in errors in our Activity API but are still seeing issues in our REST API.

Posted Jun 02, 2020 - 10:20 EDT

Update

We are still continuing to see timeouts on both our Activity and REST APIs and are still investigating the issue.

Posted Jun 02, 2020 - 10:07 EDT

Update

We are now seeing an increased volume of errors in our REST API as well.

Posted Jun 02, 2020 - 09:53 EDT

Investigating

We are seeing an increased number of errors (504 Gateway Timeout) on our Activity API which is causing an outage with capturing acceptance of contracts. We are investigating with the highest priority.

Posted Jun 02, 2020 - 09:49 EDT

This incident affected: Application User Interface (Service Availability, Authentication, Editor (HTML), Editor (PDF)), REST API (Service Availability), Activity API (Service Availability, Acceptance Tracking via Send), and Embedded Clickwrap Groups (Capturing Acceptance).