The Art of the Meltdown Postmortem

There have been two high-profile technical meltdowns in the last while, one at GitLab and another at Instapaper.

In both cases the underlying issues causing the problems were database related, and in both cases there were significant issues with both the backup regime and the emergency response routine.

Fortunately for the rest of us, in both cases the companies involved have responded, once the dust cleared, with detailed postmortems.

GitLab posted Postmortem of database outage of January 31:

On January 31st 2017, we experienced a major service outage for one of our products, the online service GitLab.com. The outage was caused by an accidental removal of data from our primary database server.

And Instapaper posted Instapaper Outage Cause & Recovery:

The critical system that failed was our MySQL database, which we run as a hosted solution on Amazon’s Relational Database Service (RDS). Here we’ll cover what went wrong, how we resolved the issue and what we’re doing to improve reliability moving forward.

This is a laudable trend, and one that’s of tremendous utility to the broader digital community. All digital systems fail. All disks fail. All DNS setups go wrong. Having procedures in place to deal with this is all you can do; and because it’s hard to imagine, in advance, how and why things will go wrong, gaining insight from the real world failures of others with a similar setup is some of the best education you can get.

I learned actionable things from both GitLab’s issue and Instapaper’s.

For example, I’ve added more redundancy to regular MySQL backups and, following the example of Marco Arment, have added an automatic process that, every day, launches a new EC2 instance, installs MySQL, imports the data dumped from Amazon RDS, and performs tests to check data integrity.

And, from Instapaper’s issue, I’ve learned about a hard limited of 2TB for older Amazon RDS instances, and about some of the challenges of migrating from legacy MySQL setups into Amazon’s Aurora.

Being open about failure, especially when the failures are as much human as technical, isn’t easy, and it goes against our natural impulses to try to contain and control information. So GitLab and Instapaper’s engineering teams and management deserve our thanks for realizing that being open is ultimately not only good for their business, but it’s good for the wider web.

Comments