At 1:10PM PST on October 25th we began encountering increased error rates on our main database cluster. We had recently (12:30PM) deployed a new version of our main application which included a database schema migration. During the application of the migration, our database CPU usage increased to greater than 100% for a sustained period of time.
After roughly 10 minutes of sustained usage, the AWS Aurora failover process was initiated automatically. This failover process is configured to start whenever access to the cluster becomes limited for more than 5 minutes.
Once we identified the cause of the error rates, the migration had already timed out and we immediately began the failover process once again, to return the cluster to its normal state. During this second failover process, AWS automatically initiated a master credentials change action. This change action reset our master cluster password, which led to further connection issues. Again, once this was identified we immediately issued a second request in order to restore the credentials. This entire process, from the initial increased error rates to completion of the second credentials change request, took approximately 45 minutes.
The obvious questions in this postmortem are, one, how could we have avoided the initial failover and two, how could we have avoided the first credentials change request? We failed to find any meaningful documentation regarding the automatic credentials change and have reached out to AWS for clarification. Regarding the initial failover, we have implemented a new deployment policy that limits deployments with corresponding database migrations to fall within our scheduled maintenance window (Thursday at 10PM-12AM PST).