Intermittent issues with Actity API

Incident Report for Ironclad Clickwrap

Postmortem

What happened?

On the morning of January 28th, we encountered an issue with the Redis cluster that we leverage for caching within our Activity API. The API is built to detect any Redis outages or health issues, and continue operating as normal. Unfortunately, the issue we encountered this morning uncovered a failover scenario that we hadn't accounted for. Although Redis was 100% healthy and available, the API was unable to make writes due to an improper configuration setting. Once we identified the root cause, we were able to deploy a configuration change and restore full functionality to the Activity API.

Potential Impact

During this time, some incoming “send” events to the Activity API resulted in an error response, preventing the activity record from being saved to our database.

Fortunately, through our backup access log, we were able to recover incoming requests that encountered an error. We will be able to replay each of those failed requests and create the missing activity record retroactively.

Outage Duration

The incident resulted in intermittent errors from roughly 11:58PM EST on January 27th to 7:36AM EST on January 28th.

What We’re Doing to Prevent Going Forward

Immediately upon restoring normal operation to the API, we took measures to protect against this issue and similar issues moving forward. The underlying configuration issue was resolved, and we conducted a comprehensive review of the configuration as a whole. Additionally, we’ve hardened the failover logic within our Activity API to handle this type of scenario without an interruption to its operation.

Posted Jan 29, 2019 - 14:18 EST

Resolved

This issue has been resolved, but we will be researching and providing a full post-mortem as to what happened and how we'll be preventing this from happening in the future.

Posted Jan 28, 2019 - 11:48 EST

Monitoring

Due to an increased amount of activity over the weekend, we identified an issue with how our logs were written that caused issues with writing new values to our database. We have temporarily resolved the issue and will provide a post-mortem with full outage timings and path to prevent in the future.

Posted Jan 28, 2019 - 08:37 EST

This incident affected: Activity API (Service Availability).