Updating root zone file
A simple example of the kind of issue throttling protects against is a runaway application that naively retries a request as fast as possible when it fails to get a positive result.Our systems are scaled to handle these sorts of client errors, but during a large operational event, it is not uncommon for many users to inadvertently increase load on the system.We’d like to share more about the service event that occurred on Monday, October 22nd in the US- East Region.We have now completed the analysis of the events that affected AWS customers, and we want to describe what happened, our understanding of how customers were affected, and what we are doing to prevent a similar issue from occurring in the future.However, because many of the servers became memory-exhausted at the same time, the system was unable to find enough healthy servers to failover to, and more volumes became stuck.
This caused the system to begin to failover from the degraded servers to healthy servers.As part of replacing that server, a DNS record was updated to remove the failed server and add the replacement server.While not noticed at the time, the DNS update did not successfully propagate to all of the internal DNS servers, and as a result, a fraction of the storage servers did not get the updated server address and continued to attempt to contact the failed data collection server.By PM PDT, about 60% of the affected volumes had recovered.
The team continued to work to understand the issue and restore performance for the remaining volumes.We believe we can make adjustments to reduce the impact of any similar correlated failure or degradation of EBS servers within an Availability Zone.Impact on the EC2 and EBS APIs The primary event only affected EBS volumes in a single Availability Zone, so those customers running with adequate capacity in other Availability Zones in the US East Region were able to tolerate the event with limited impact to their applications.The Primary Event and the Impact to Amazon Elastic Block Store (EBS) and Amazon Elastic Compute Cloud (EC2) At AM PDT Monday, a small number of Amazon Elastic Block Store (EBS) volumes in one of our five Availability Zones in the US-East Region began seeing degraded performance, and in some cases, became “stuck” (i.e. The root cause of the problem was a latent bug in an operational data collection agent that runs on the EBS storage servers.