<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:dc="http://purl.org/dc/elements/1.1/" version="2.0"><channel><atom:link rel="hub" href="http://tumblr.superfeedr.com/" xmlns:atom="http://www.w3.org/2005/Atom"/><description>A collection of postmortems from server outages.</description><title>Nine Fives</title><generator>Tumblr (3.0; @ninefives)</generator><link>http://ninefives.tumblr.com/</link><item><title>GitHub Downtime last Saturday</title><description>&lt;a href="https://github.com/blog/1364-downtime-last-saturday"&gt;GitHub Downtime last Saturday&lt;/a&gt;: &lt;blockquote&gt;
  &lt;p&gt;When this happened it caused a great deal of churn within the network as all of our aggregated links had to be re-established, leader election for spanning-tree had to take place, and all of the links in the network had to go through a spanning-tree reconvergence. This effectively caused all traffic between access switches to be blocked for roughly a minute and a half.&lt;/p&gt;
  
  &lt;p&gt;[…] When the network froze, many of our fileservers which are intentionally located in different racks for redundancy, exceeded their heartbeat timeouts and decided that they needed to take control of the fileserver resources. They issued STONITH commands to their partner nodes and attempted to take control of resources, however some of those commands were not delivered due to the compromised network. When the network recovered and the cluster messaging between nodes came back, a number of pairs were in a state where both nodes expected to be active for the same resource. This resulted in a race where the nodes terminated one another and we wound up with both nodes stopped for a number of our fileserver pairs.&lt;/p&gt;
&lt;/blockquote&gt;</description><link>http://ninefives.tumblr.com/post/39567454080</link><guid>http://ninefives.tumblr.com/post/39567454080</guid><pubDate>Wed, 26 Dec 2012 00:00:00 -0500</pubDate><category>github</category></item><item><title>GitHub's network problems last Friday</title><description>&lt;a href="https://github.com/blog/1346-network-problems-last-friday"&gt;GitHub's network problems last Friday&lt;/a&gt;: &lt;blockquote&gt;
  &lt;p&gt;In the course of our troubleshooting we discovered that our aggregation switches were missing a number of MAC addresses from their tables, and thus were flooding any traffic that was sent to those devices across all of their ports. Because of these missing addresses, a large percentage of our traffic was being sent to every access switch and not just the switch that the destination devices was connected to. During normal operation, the switch should “learn” which port each MAC address is connected through as it processes traffic. For some reason, our switches were unable to learn a significant percentage of our MAC addresses and this aggregate traffic was enough to saturate all of the links between the access and aggregation switches, causing the poor performance we saw throughout the day.&lt;/p&gt;
  
  &lt;p&gt;[…]&lt;/p&gt;
  
  &lt;p&gt;We need to be more mindful of tunnel-vision during incident response. We fixated for a very long time on the idea of a bridge loop and it blinded us to other possible causes. We hope to begin doing more scheduled incident response exercises in the coming months and will build scenarios that reinforce this.&lt;/p&gt;
&lt;/blockquote&gt;</description><link>http://ninefives.tumblr.com/post/37281028854</link><guid>http://ninefives.tumblr.com/post/37281028854</guid><pubDate>Wed, 05 Dec 2012 16:19:51 -0500</pubDate><category>github</category></item><item><title>Asana: On Last Week’s Downtime</title><description>&lt;a href="http://blog.asana.com/2012/09/on-last-weeks-downtime/"&gt;Asana: On Last Week’s Downtime&lt;/a&gt;: &lt;blockquote&gt;
  &lt;p&gt;Though we cannot be absolutely certain, we believe that the combination of our growing number of users and the loss of the database cache that resulted when we resized the database caused a sudden, sharp increase in lock contention within mysql. The issues were compounded by the fact that Amazon uses proprietary technology to power the file system that RDS runs on, but doesn’t offer documentation about how this technology works during a resize operation. Further, the lack of root access to our database made the cause of the problems more difficult for us to understand.&lt;/p&gt;
&lt;/blockquote&gt;</description><link>http://ninefives.tumblr.com/post/32467185218</link><guid>http://ninefives.tumblr.com/post/32467185218</guid><pubDate>Fri, 28 Sep 2012 14:34:37 -0400</pubDate></item><item><title>GitHub availability this week</title><description>&lt;a href="https://github.com/blog/1261-github-availability-this-week"&gt;GitHub availability this week&lt;/a&gt;: &lt;blockquote&gt;
  &lt;p&gt;Monday’s migration caused higher load on the database than our operations team has previously seen during these sorts of migrations. So high, in fact, that they caused Percona Replication Manager’s health checks to fail on the master. In response to the failed master health check, Percona Replication manager moved the ‘active’ role and the master database to another server in the cluster and stopped MySQL on the node it perceived as failed.&lt;/p&gt;
  
  &lt;p&gt;At the time of this failover, the new database selected for the ‘active’ role had a cold InnoDB buffer pool and performed rather poorly. The system load generated by the site’s query load on a cold cache soon caused Percona Replication Manager’s health checks to fail again, and the ‘active’ role failed back to the server it was on originally.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;[…]&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;Upon attempting to disable maintenance-mode, a Pacemaker segfault occurred that resulted in a cluster state partition. … In the second, single-node cluster, node ‘c’ was elected at 8:19 AM, and any subsequent messages from the other two-node cluster were discarded. As luck would have it, the ‘c’ node was the node that our operations team previously determined to be out of date. … As a result of this data drift, inconsistencies between MySQL and other data stores in our infrastructure were possible. … Consequentially, some events created during this window appeared on the wrong users’ dashboards. Also, some repositories created during this window were incorrectly routed.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;[…]&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;In summary, three primary events contributed to the downtime of the past few days. First, several failovers of the ‘active’ database role happened when they shouldn’t have. Second, a cluster partition occurred that resulted in incorrect actions being performed by our cluster management software. Finally, the failovers triggered by these first two events impacted performance and availability more than they should have.&lt;/p&gt;
&lt;/blockquote&gt;</description><link>http://ninefives.tumblr.com/post/31749768173</link><guid>http://ninefives.tumblr.com/post/31749768173</guid><pubDate>Mon, 17 Sep 2012 17:09:27 -0400</pubDate></item><item><title>Asana Outage – 09/10/12</title><description>&lt;a href="http://help.asana.com/customer/portal/articles/729045-outage-%E2%80%93-09-10-12-"&gt;Asana Outage – 09/10/12&lt;/a&gt;: &lt;blockquote&gt;
  &lt;p&gt;Shortly after this, our DNS provider, GoDaddy, experienced a totally unrelated outage.&lt;/p&gt;
&lt;/blockquote&gt;</description><link>http://ninefives.tumblr.com/post/31337483243</link><guid>http://ninefives.tumblr.com/post/31337483243</guid><pubDate>Tue, 11 Sep 2012 11:22:17 -0400</pubDate><category>asana</category></item><item><title>Heroku Widespread Application Outage</title><description>&lt;a href="https://status.heroku.com/incidents/386"&gt;Heroku Widespread Application Outage&lt;/a&gt;: &lt;blockquote&gt;
  &lt;p&gt;The disruption was precipitated by an Amazon Web Services outage which affected the US East region, beginning at 8:04PM PDT. … Approximately 30% of our EC2 instances, which were responsible for running applications, databases and supporting infrastructure (including some components specific to the Bamboo stack), went offline. …&lt;/p&gt;
  
  &lt;ul&gt;&lt;li&gt;The management API for the AWS US East region became unavailable
  
  &lt;ul&gt;&lt;li&gt;In order to restore sufficient capacity quickly, we needed to bring additional instances online in the same region. Without the API, we couldn’t start any new instances.&lt;/li&gt;
  &lt;li&gt;We regularly rely on this API in order to collect information about the state of our infrastructure, and to diagnose problems. This lack of visibility slowed the recovery process.&lt;/li&gt;
  &lt;/ul&gt;&lt;/li&gt;
  &lt;li&gt;Elastic Load Balancer instances and Elastic IP addresses failed to respond promptly to configuration changes
  
  &lt;ul&gt;&lt;li&gt;We use ELBs and EIPs to redirect traffic away from failed instances to redundant and secondary systems, so when these mechanisms responded slowly or malfunctioned, traffic continued to be routed to systems which were down.&lt;/li&gt;
  &lt;/ul&gt;&lt;/li&gt;
  &lt;/ul&gt;&lt;p&gt;…A large number of EBS volumes, which stored data for Heroku Postgres services, went offline and their data was potentially corrupted. As a result, customer databases remained down until they could be recovered…&lt;/p&gt;
  
  &lt;p&gt;In past EBS incidents, it was very rare for volumes to be damaged even if they sustained downtime, so they were very fast to recover. We were therefore not fully prepared for a recovery effort of this magnitude, and it took several hours to automatically restore the affected databases.&lt;/p&gt;
&lt;/blockquote&gt;</description><link>http://ninefives.tumblr.com/post/27052114847</link><guid>http://ninefives.tumblr.com/post/27052114847</guid><pubDate>Wed, 11 Jul 2012 00:00:00 -0400</pubDate><category>heroku</category></item><item><title>Tarsnap outage</title><description>&lt;a href="http://www.daemonology.net/blog/2012-07-04-tarsnap-outage.html"&gt;Tarsnap outage&lt;/a&gt;: &lt;blockquote&gt;
  &lt;p&gt;2012-06-30 05:25 UTC: I finish configuring the replacement Tarsnap server and start the process of regenerating its local state from S3. The first phase of this process involves reading millions of stored S3 objects; unfortunately, these reads were performed in sequential order, triggering a worst-case performance behaviour in S3. As a result, this phase of recovery took much longer than I had anticipated; unfortunately, the design of the code meant that changing the order in which objects were read was not something I could do “on the fly”.&lt;/p&gt;
&lt;/blockquote&gt;</description><link>http://ninefives.tumblr.com/post/26529310994</link><guid>http://ninefives.tumblr.com/post/26529310994</guid><pubDate>Wed, 04 Jul 2012 21:40:33 -0400</pubDate><category>tarsnap</category></item><item><title>Summary of the AWS Service Event in the US East Region  </title><description>&lt;a href="https://aws.amazon.com/message/67457/"&gt;Summary of the AWS Service Event in the US East Region  &lt;/a&gt;: &lt;p&gt;In the single datacenter that did not successfully transfer to the generator backup, all servers continued to operate normally on Uninterruptable Power Supply (“UPS”) power. As onsite personnel worked to stabilize the primary and backup power generators, the UPS systems were depleting and servers began losing power at 8:04pm PDT. Ten minutes later, the backup generator power was stabilized, the UPSs were restarted, and power started to be restored by 8:14pm PDT. At 8:24pm PDT, the full facility had power to all racks.&lt;/p&gt;

&lt;p&gt;… Time for the completion of this recovery was extended by a bottleneck in our server booting process. Removing this bottleneck is one of the actions we’ll take to improve recovery times in the face of power failure.&lt;/p&gt;

&lt;p&gt;… As the power and systems returned, a large number of ELBs came up in a state which triggered a bug we hadn’t seen before. The bug caused the ELB control plane to attempt to scale these ELBs to larger ELB instance sizes. This resulted in a sudden flood of requests which began to backlog the control plane. At the same time, customers began launching new EC2 instances to replace capacity lost in the impacted Availability Zone, requesting the instances be added to existing load balancers in the other zones. These requests further increased the ELB control plane backlog. Because the ELB control plane currently manages requests for the US East-1 Region through a shared queue, it fell increasingly behind in processing these requests; and pretty soon, these requests started taking a very long time to complete.&lt;/p&gt;</description><link>http://ninefives.tumblr.com/post/26417608847</link><guid>http://ninefives.tumblr.com/post/26417608847</guid><pubDate>Mon, 02 Jul 2012 00:00:00 -0400</pubDate><category>aws</category></item><item><title>Heroku HTTP Routing Errors</title><description>&lt;a href="https://status.heroku.com/incidents/372/"&gt;Heroku HTTP Routing Errors&lt;/a&gt;: &lt;p&gt;The routing outage was the result of three root causes.&lt;/p&gt;

&lt;p&gt;The first root cause is related to the streaming data API which connects the dyno manifold to the routing mesh. On the dyno management side, an engineer was performing a manual garbage collection process which created an unusual record in the data stream. On the routing side, the subprocess of the router which handles the incoming stream could not parse this record.&lt;/p&gt;</description><link>http://ninefives.tumblr.com/post/26349031150</link><guid>http://ninefives.tumblr.com/post/26349031150</guid><pubDate>Thu, 07 Jun 2012 00:00:00 -0400</pubDate><category>heroku</category></item><item><title>Heroku Dyno Outage</title><description>&lt;a href="https://status.heroku.com/incidents/308"&gt;Heroku Dyno Outage&lt;/a&gt;: &lt;p&gt;On Tuesday (Feb 21), we deployed a code change to the dyno management agent which runs on the machines which comprise our dyno manifold. This change introduced a latent problem with creating extra (“phantom”) dynos, which only became visible under unusual load conditions that would occur that weekend.&lt;/p&gt;</description><link>http://ninefives.tumblr.com/post/26349105793</link><guid>http://ninefives.tumblr.com/post/26349105793</guid><pubDate>Mon, 05 Mar 2012 00:00:00 -0500</pubDate><category>heroku</category></item><item><title>Summary of Windows Azure Service Disruption on Feb 29th, 2012</title><description>&lt;a href="https://blogs.msdn.com/b/windowsazure/archive/2012/03/09/summary-of-windows-azure-service-disruption-on-feb-29th-2012.aspx?Redirected=true"&gt;Summary of Windows Azure Service Disruption on Feb 29th, 2012&lt;/a&gt;: &lt;blockquote&gt;
  &lt;p&gt;When the GA creates the transfer certificate, it gives it a one year validity range. It uses midnight UST of the current day as the valid-from date and one year from that date as the valid-to date. The leap day bug is that the GA calculated the valid-to date by simply taking the current date and adding one to its year. That meant that any GA that tried to create a transfer certificate on leap day set a valid-to date of February 29, 2013, an invalid date that caused the certificate creation to fail.&lt;/p&gt;
  
  &lt;p&gt;As mentioned, transfer certificate creation is the first step of the GA initialization and is required before it will connect to the HA. When a GA fails to create its certificates, it terminates. The HA has a 25-minute timeout for hearing from the GA. When a GA doesn’t connect within that timeout, the HA reinitializes the VM’s OS and restarts it.&lt;/p&gt;
  
  &lt;p&gt;If a clean VM (one in which no customer code has executed) times out its GA connection three times in a row, the HA decides that a hardware problem must be the cause since the GA would otherwise have reported an error. The HA then reports to the FC that the server is faulty and the FC moves it to a state called Human Investigate (HI).&lt;/p&gt;
&lt;/blockquote&gt;</description><link>http://ninefives.tumblr.com/post/26423037442</link><guid>http://ninefives.tumblr.com/post/26423037442</guid><pubDate>Wed, 29 Feb 2012 00:00:00 -0500</pubDate><category>azure</category></item><item><title>Heroku Elevated Error Rates</title><description>&lt;a href="https://status.heroku.com/incidents/245"&gt;Heroku Elevated Error Rates&lt;/a&gt;: &lt;p&gt;On Saturday we had a rough night. There were problems routing HTTP traffic to customer applications, to varying degrees, for more than four hours. … Upon further investigation, we discovered that we were being attacked by a very large scale distributed denial of service (DDoS) attack. This attack was sending multiples of our normal production traffic levels to our HTTP routing system which was unable to handle the load.&lt;/p&gt;</description><link>http://ninefives.tumblr.com/post/26349155122</link><guid>http://ninefives.tumblr.com/post/26349155122</guid><pubDate>Tue, 06 Dec 2011 00:00:00 -0500</pubDate><category>heroku</category></item><item><title>Heroku: Intermittent push and unidling errors</title><description>&lt;a href="https://status.heroku.com/incidents/216"&gt;Heroku: Intermittent push and unidling errors&lt;/a&gt;: &lt;blockquote&gt;
  &lt;p&gt;The hot spare for the database was not known to be in a usable state due to missing information in the management UI. We are improving our monitoring to monitor the status of the hot spare to ensure that it is in a usable state. We are also improving the documentation around failing over to the hot spare since it was also not immediately obvious to the responding engineers what the procedure was.&lt;/p&gt;
&lt;/blockquote&gt;</description><link>http://ninefives.tumblr.com/post/26585400496</link><guid>http://ninefives.tumblr.com/post/26585400496</guid><pubDate>Tue, 11 Oct 2011 00:00:00 -0400</pubDate></item><item><title>Summary of the Amazon EC2, Amazon EBS, and Amazon RDS Service Event in the EU West Region  </title><description>&lt;a href="https://aws.amazon.com/message/2329B7/"&gt;Summary of the Amazon EC2, Amazon EBS, and Amazon RDS Service Event in the EU West Region  &lt;/a&gt;: &lt;p&gt;First, our EC2 management service (which handles API requests to RunInstance, CreateVolume, etc.), has servers in each Availability Zone. The management servers which receive requests continued to route requests to management servers in the affected Availability Zone. Because the management servers in the affected Availability Zone were inaccessible, requests routed to those servers failed. Second, the EC2 management servers receiving requests were continuing to accept RunInstances requests targeted at the impacted Availability Zone. Rather than failing these requests immediately, they were queued and our management servers attempted to process them. Fairly quickly, a large number of these requests began to queue up and we overloaded the management servers receiving requests, which were waiting for these queued requests to complete. The combination of these two factors caused long delays in launching instances and higher error rates for the EU West EC2 APIs.&lt;/p&gt;</description><link>http://ninefives.tumblr.com/post/26418365337</link><guid>http://ninefives.tumblr.com/post/26418365337</guid><pubDate>Sat, 13 Aug 2011 00:00:00 -0400</pubDate><category>aws</category></item><item><title>Heroku: HTTP and Git Connectivity Problems</title><description>&lt;a href="https://status.heroku.com/incidents/156"&gt;Heroku: HTTP and Git Connectivity Problems&lt;/a&gt;: &lt;p&gt;At approximately 03:30 Pacific time, it was determined that Heroku was the target of a distributed denial of service (DDoS) attack. The method of the DDoS was a SYN flood [1]. Our network engineers applied a variety of host-based techniques to reduce the effect of the attacks. A combination of tweaked networking configuration and some quickly built tools for maintaining system firewall rules helped us get the affected servers to the point where they were no longer crashing, but we were still seeing very high levels of network packet loss to the nodes — users often had to reload multiple times in order to get a web page to load.&lt;/p&gt;</description><link>http://ninefives.tumblr.com/post/26419967667</link><guid>http://ninefives.tumblr.com/post/26419967667</guid><pubDate>Wed, 18 May 2011 00:00:00 -0400</pubDate><category>heroku</category></item><item><title>Summary of the Amazon EC2 and Amazon RDS Service Disruption in the US East Region </title><description>&lt;a href="https://aws.amazon.com/message/65648/"&gt;Summary of the Amazon EC2 and Amazon RDS Service Disruption in the US East Region &lt;/a&gt;: &lt;p&gt;In this case, because the issue affected such a large number of volumes concurrently, the free capacity of the EBS cluster was quickly exhausted, leaving many of the nodes “stuck” in a loop, continuously searching the cluster for free space. This quickly led to a “re-mirroring storm,” where a large number of volumes were effectively “stuck” while the nodes searched the cluster for the storage space it needed for its new replica. At this point, about 13% of the volumes in the affected Availability Zone were in this “stuck” state.&lt;/p&gt;

&lt;p&gt;After the initial sequence of events described above, the degraded EBS cluster had an immediate impact on the EBS control plane. When the EBS cluster in the affected Availability Zone entered the re-mirroring storm and exhausted its available capacity, the cluster became unable to service “create volume” API requests. Because the EBS control plane (and the create volume API in particular) was configured with a long time-out period, these slow API calls began to back up and resulted in thread starvation in the EBS control plane. The EBS control plane has a regional pool of available threads it can use to service requests. When these threads were completely filled up by the large number of queued requests, the EBS control plane had no ability to service API requests and began to fail API requests for other Availability Zones in that Region as well.&lt;/p&gt;</description><link>http://ninefives.tumblr.com/post/26418153089</link><guid>http://ninefives.tumblr.com/post/26418153089</guid><pubDate>Fri, 29 Apr 2011 00:00:00 -0400</pubDate><category>aws</category></item><item><title>Heroku: Widespread Application Outage</title><description>&lt;a href="https://status.heroku.com/incidents/151"&gt;Heroku: Widespread Application Outage&lt;/a&gt;: &lt;p&gt;Historically, the best move for us in these incidents is to do our best to keep things running (killing unhealthy instances, etc.) and wait for AWS to resolve things. Rarely has that taken more than an hour or two.&lt;/p&gt;

&lt;p&gt;In this case, the EC2 outage lasted a total of about 12 hours. In the afternoon on Thursday, we were able to begin starting new instances en masse and we believed we’d be just an hour or two away from recovery. The majority of applications were back up on Thursday afternoon, but it took us much longer to recover the remaining ones.&lt;/p&gt;</description><link>http://ninefives.tumblr.com/post/26419904462</link><guid>http://ninefives.tumblr.com/post/26419904462</guid><pubDate>Wed, 27 Apr 2011 00:00:00 -0400</pubDate><category>heroku</category></item><item><title>Heroku Shared Database Offline</title><description>&lt;a href="https://status.heroku.com/incidents/144"&gt;Heroku Shared Database Offline&lt;/a&gt;: &lt;blockquote&gt;
  &lt;p&gt;We have had these types of disk attachment problems in the past and have always successfully recovered from them with the help of our service provider. After escalating this issue to our provider, we worked with them for roughly five hours to get the affected disk attached. At 07:15 AM PDT we were informed that the data on the disk had been corrupted and that we would need to restore from a backup.&lt;/p&gt;
  
  &lt;p&gt;At this time, we provisioned additional database capacity and began the process of restoring the affected databases. This process was hampered by some problems with our tools and with our backup accounting. In some cases, our internal database did not reflect the most recent backups that we had available, and we were forced to develop custom tools to restore the correct versions of these databases.&lt;/p&gt;
&lt;/blockquote&gt;</description><link>http://ninefives.tumblr.com/post/26585365487</link><guid>http://ninefives.tumblr.com/post/26585365487</guid><pubDate>Tue, 29 Mar 2011 00:00:00 -0400</pubDate><category>heroku</category></item><item><title>Heroku Tuesday Postmortem</title><description>&lt;a href="https://blog.heroku.com/archives/2010/10/27/tuesday_postmortem/"&gt;Heroku Tuesday Postmortem&lt;/a&gt;: &lt;p&gt;A slowdown in our internal messaging systems caused a previously unknown bug in our distributed routing mesh to be triggered. This bug caused the routing mesh to fail. After isolating the bug, we attempted to roll back to a previous version of the routing mesh code. While the rollback solved the initial problem, there as an unexpected incompatibility between the routing mesh and our caching service. This incompatibility forced us to move back to the newer routing mesh code, which required us to perform a “hot patch” of the production system to fix the initial bug. This patch was successful and all applications were returned to service.&lt;/p&gt;</description><link>http://ninefives.tumblr.com/post/26349189853</link><guid>http://ninefives.tumblr.com/post/26349189853</guid><pubDate>Wed, 27 Oct 2010 00:00:00 -0400</pubDate><category>heroku</category></item><item><title>Amazon S3 Availability Event</title><description>&lt;a href="http://status.aws.amazon.com/s3-20080720.html"&gt;Amazon S3 Availability Event&lt;/a&gt;: &lt;p&gt;…when the corruption occurred, we didn’t detect it and it spread throughout the system causing the symptoms described above. We hadn’t encountered server-to-server communication issues of this scale before and, as a result, it took some time during the event to diagnose and recover from it.&lt;/p&gt;</description><link>http://ninefives.tumblr.com/post/26348913929</link><guid>http://ninefives.tumblr.com/post/26348913929</guid><pubDate>Sun, 20 Jul 2008 00:00:00 -0400</pubDate><category>heroku</category></item></channel></rss>
