Summary of the AWS Service Event in the US East Region
In the single datacenter that did not successfully transfer to the generator backup, all servers continued to operate normally on Uninterruptable Power Supply (“UPS”) power. As onsite personnel worked to stabilize the primary and backup power generators, the UPS systems were depleting and servers began losing power at 8:04pm PDT. Ten minutes later, the backup generator power was stabilized, the UPSs were restarted, and power started to be restored by 8:14pm PDT. At 8:24pm PDT, the full facility had power to all racks.
… Time for the completion of this recovery was extended by a bottleneck in our server booting process. Removing this bottleneck is one of the actions we’ll take to improve recovery times in the face of power failure.
… As the power and systems returned, a large number of ELBs came up in a state which triggered a bug we hadn’t seen before. The bug caused the ELB control plane to attempt to scale these ELBs to larger ELB instance sizes. This resulted in a sudden flood of requests which began to backlog the control plane. At the same time, customers began launching new EC2 instances to replace capacity lost in the impacted Availability Zone, requesting the instances be added to existing load balancers in the other zones. These requests further increased the ELB control plane backlog. Because the ELB control plane currently manages requests for the US East-1 Region through a shared queue, it fell increasingly behind in processing these requests; and pretty soon, these requests started taking a very long time to complete.
Heroku HTTP Routing Errors
The routing outage was the result of three root causes.
The first root cause is related to the streaming data API which connects the dyno manifold to the routing mesh. On the dyno management side, an engineer was performing a manual garbage collection process which created an unusual record in the data stream. On the routing side, the subprocess of the router which handles the incoming stream could not parse this record.
Heroku Dyno Outage
On Tuesday (Feb 21), we deployed a code change to the dyno management agent which runs on the machines which comprise our dyno manifold. This change introduced a latent problem with creating extra (“phantom”) dynos, which only became visible under unusual load conditions that would occur that weekend.
Heroku Elevated Error Rates
On Saturday we had a rough night. There were problems routing HTTP traffic to customer applications, to varying degrees, for more than four hours. … Upon further investigation, we discovered that we were being attacked by a very large scale distributed denial of service (DDoS) attack. This attack was sending multiples of our normal production traffic levels to our HTTP routing system which was unable to handle the load.
Summary of the Amazon EC2, Amazon EBS, and Amazon RDS Service Event in the EU West Region
First, our EC2 management service (which handles API requests to RunInstance, CreateVolume, etc.), has servers in each Availability Zone. The management servers which receive requests continued to route requests to management servers in the affected Availability Zone. Because the management servers in the affected Availability Zone were inaccessible, requests routed to those servers failed. Second, the EC2 management servers receiving requests were continuing to accept RunInstances requests targeted at the impacted Availability Zone. Rather than failing these requests immediately, they were queued and our management servers attempted to process them. Fairly quickly, a large number of these requests began to queue up and we overloaded the management servers receiving requests, which were waiting for these queued requests to complete. The combination of these two factors caused long delays in launching instances and higher error rates for the EU West EC2 APIs.
Heroku: HTTP and Git Connectivity Problems
At approximately 03:30 Pacific time, it was determined that Heroku was the target of a distributed denial of service (DDoS) attack. The method of the DDoS was a SYN flood [1]. Our network engineers applied a variety of host-based techniques to reduce the effect of the attacks. A combination of tweaked networking configuration and some quickly built tools for maintaining system firewall rules helped us get the affected servers to the point where they were no longer crashing, but we were still seeing very high levels of network packet loss to the nodes — users often had to reload multiple times in order to get a web page to load.