Temporary API Outage
Incident Report for PactSafe
Postmortem

What happened?

On Saturday at approximately midnight GMT (6pm EST), our SSL certificate behind our database instances expired causing an outage across our API stack.

What was affected?

During this brief outage, all systems connected to our database (Activity API and our REST API) were temporarily unavailable until we cycled through our database instances with the updated certificates to bring them back online.

Why did it happen?

This brief outage was mainly a result of human error—through all of our testing in updating our SSL certificates (over a month ago) in MongoDB, it was very difficult to complete a full end-to-end test of updated certificates without taking the system offline. We believed we had executed the migration of SSL certificates properly.

Unfortunately, we had missed a key item in our migration process that caused API calls to error and for our system to go down for a brief time.

What are we doing to prevent this from happening again?

We've added multiple new thresholds for alerts to more quickly resolve an outage like this in the future. Additionally, any time key things like SSL certificates expire we're going to earmark these times and dates and will have key staff on call to ensure proper testing goes into place at the right time.

Posted Nov 14, 2016 - 14:55 EST

Resolved
We're back online after resolving the issue with our SSL certificates in our various database configurations.
Posted Nov 12, 2016 - 20:01 EST
Identified
We experienced an issue with our SSL certificate on our MongoDB instance. We're currently addressing and should have a resolution in place shortly.
Posted Nov 12, 2016 - 19:50 EST
Investigating
We're currently seeing an excessive amount of 400 errors when sending calls to our Activity API. Currently investigating. We believe this to be an issue with an SSL certificate on our database.
Posted Nov 12, 2016 - 19:43 EST