Event Start Time: 12:24 pm
Event End Time: 2:18 pm
Issue: Two database servers experienced high load.
Timeline:
At 12:24 pm a database server was experiencing heavy load. Due to an issue in AWS Aurora, the server started to refuse new database connections and became non-responsive.
At 12:24 pm the issue was reported to the Eduphoria Infrastructure team to investigate. The team found the server was under extreme CPU load for an extended period of time.
At 12:28 pm the first server appeared to recover, then escalated again. The server was pulled from service. The decision was made disable all customers on that server and wait for the server to become responsive again.
At 12:56 pm the database became available again. The infrastructure team began inspecting the server to verify it was operational.
At 12:58 pm the Eduphoria Infrastructure team re-enabled the customers who were experiencing downtime.
At 1:00 pm another database server entered the same state due the a similar query. The decision was made to wait the server out instead of rebooting and risking an even longer downtime.
At 2:18 pm the Eduphoria Infrastructure team re-enabled the customers who were experiencing downtime.
Further Action:
Eduphoria has already addressed the long recovery time issue with Amazon AWS. They are aware of the issue but do not expect resolution for several months.
Amazon has a patch available for the issue that caused the database to refuse connections. This patch is currently in testing on some of the Eduphoria database servers and will be rolled out to all servers when we have assessed its stability.
Two customers (one on each of the two downed servers) are missing scripts that were recently rolled out to all other customers. These scripts speed performance in forethought. The development team is still working on solutions to apply these changes to these customers without additional downtime. Once these scripts are applied, the load on these servers will be reduced.
We will be moving some of the larger customers off of these two servers to distribute load across our databases more evenly.
Comments
0 comments
Please sign in to leave a comment.