Event Start Time: 1:40 pm
Event End Time: 2:10 pm
Issue: Access interruption due to load balancer failure
1:40 pm The infrastructure team determined servers were being cycled prematurely because of automation tied to Health Checks.
1:45 pm Changes were made to the pool to use Server Health Checks instead of Load Balancer Checks to try and prevent this from happening. This inadvertently caused all of the servers in the pool to be terminated and started again.
1:55 pm The new servers came online one at a time. As they started to pass health checks and become active they were inundated with connections from active users. The servers would then slow and fail, causing them to leave the load balancer.
2:08 pm The health check was temporarily completely disabled to allow the servers to come online.
2:10 pm Majority of servers came online and remained in the load balancer.
Policy changes: Moving forward Health Checks will no longer be changed during business hours. Health checks will be tested on the beta stack before going to production.
Technology changes: In the event of any outage, traffic will be restricted until the server pool is scaled up and is available to handle the load from current requests. This will prevent servers from being prematurely removed from the pool because of high load.