Event Start Time: 7:15 am
Event End Time: 9:32 am
Issue: A single database server is experiencing extended recovery time after failure.
At 7:15 am a database server was experiencing heavy load. Due to an issue in AWS Aurora, the server started to refuse new database connections.
At 7:30 am the issue was reported to the Eduphoria Infrastructure team to investigate. The team found the server was under extreme CPU load for an extended period of time.
At 7:38 am the decision was made to restart the database server. During extreme load, database recovery after a reboot can take an extended period when bin-logging is turned on.
At 9:25 am the database became available again. The infrastructure team began inspecting the server to verify it was operational.
At 9:31 am the Eduphoria Infrastructure team re-enabled the customers who were experiencing downtime.
Eduphoria has already addressed the long recovery time issue with Amazon AWS. They are aware of the issue but do not expect resolution for several months.
The query that caused the database to experience heavy load will not be run again during the normal school week. Additional monitoring will be in place when it is ran again.
Eduphoria is updating its database server maintenance procedures to list server reboots as a last resort. Other actions can be taken before that will be done.
Amazon has a patch available for the issue that caused the database to refuse connections. This patch is currently in testing on some of the Eduphoria database servers and will be rolled out to all servers when we have assessed its stability.