Lessons From The Heroku Amazon OutagePosted By: Nati Shalom on June 18, 2012
Earlier this week we experienced the third significant AWS outage in the past 14 months, as reported by Justin Lee in his report Heroku, Pinterest Among Sites Knocked Offline in Amazon Data Center Outage on theWhirr magazine:
An Amazon data center in Ashburn, Virginia suffered a power outage at 9:45 p.m. PDT on Thursday, causing some websites using AWS cloud technology to go offline. High-profile websites like Heroku, Pinterest, Quora and HootSuite saw downtime, as well as many smaller sites.
In this post I'd like to briefly address the lessons from this experience, and more importantly, focus on the lessons from this sort of failure and their effect on public PaaS offerings, such as Heroku.
Lesson from the Heroku Outage: Choose the Right PaaS for the Job
One of the promises of PaaS is productivity. PaaS providers like Heroku claim to increase productivity by abstracting the user from the details of the underlying infrastructure, and even go so far as to claim that PaaS makes application operations redundant.
Lesson 1: Choose the Right PaaS for the Job
The main lesson from this outage is that relying on the PaaS provider to carry all your operations isn't always a safe bet. When we move to PaaS we still need to understand how they run their disaster recovery, high-availability, and scaling procedures. Heroku-like PaaS also forces you to a lowest common denominator approach to dealing with continuous availability and scalability. In reality, however, there are many tradeoffs between scalability, performance, and high availability. The best fit between those tradeoffs tends to be application specific, so compromising on a lowest common denominator could be less productive and more costly at the end of the day.
Which brings me to the last point in this section -- PaaS was meant to provide a higher productivity for running our apps on the cloud by abstracting the details of how we run our application (the operation) from the application developer. The black-box approach of many of the public PaaS offerings, such as Heroku, is often an extreme measure in this regard. There is often a close coupling between what the application does and the way we run it. A new class of Open PaaS platforms such as Cloudify, Cloud Foundry, and OpenShift offer a different open source alternative that gives you more control of the underlying PaaS platform. Cloudify takes it even further, providing an open recipe model that integrates with the likes of Chef, enabling you to easily customize and control your operations without affecting developer productivity.
Lesson 2: Database Availability Must Address Data Center Failure
The other area that Heroku, and to be honest, most other PaaS offering don't adequately address is database high-availability, which is obviously a tough area. Specificaly, in the event of data center failure or availability zone failure, as in the present case. To deal with database availability, it is necessary to ensure real-time synchronization of the database across sites. The example at the bottom of this post refers to a specific way this can be done, between two mySql instances running on Amazon and Rackspace with Cloudify.
Lesson 3: Coping with Failure, Avoiding a Single Point of Failure
The general lesson from this and previous failures is actually not new. To be fair, this lesson is not specific to AWS or to any cloud service. Failures are inevitable, and often happen when and where we least expect them to. Instead of trying to prevent failure from happening we should design our systems to cope with failure.
The method of dealing with failures is also not that new -- use redundancy, don't rely on a single point failure (including a data center or even a data center provider). Automate the fail-over process, etc...
Haven't Learned from Past Lessons?
The question that comes out of this experience IMO is not necessarily how to deal with failures (those lessons are as old as the mainframe or even older), but rather -- why are we failing to implement the lessons? Assuming that the people running these systems are among the best in the industry makes this question even more intresting. Here is my take on it:
- We give up responsibility when we move to the cloud: When we move our operations to the cloud, we often assume that we're out-sourcing our data center operation completely, including our disaster recovery procedures. The truth is that when we move to the cloud we're only outsourcing the infrastructure, not our operations, and the responsibility of how to use this infrastructure remain ours.
- Complexity: The current DR processes and tools were designed for a pre-cloud world, and do not work well in a dynamic environment, such as the cloud. Many of the tools that are provided by the cloud vendors (Amazon in this specific case) are still fairly complex to use.
Implementing Past Lessons in a Cloud World
The first point is is easy -- we need to assume full responsibility for our applications' disaster recovery procedures, in the cloud world just as if we were running our own data center. The hard part in the cloud world is that we often have less visibility, control, and knowledge of the infrastructure, which affects our ability to protect our applications -- and each sub-component of our application -- from failure. On the other hand, the cloud enables us to spawn new instances easily on various data center locations, a.k.a Availability Zones.
And so, most failures can be addressed by moving from the failed system to a completely different system regardless of the root cause of the failure. Therefore, the first lesson is that in the cloud world it is easier to implement disaster recovery plans, by moving our application traffic to a completely different redundant system in a snap, rather than trying to protect every component of our application from a failure. If we're willing to tolerate a short window of downtime, we can even use an on-demand backup site rather than pay the consistent cost and overhead of maintaining a hot backup site.
Which brings me to the next point: What do we need to build such a solution?
A consistent redundant environment that is ready to take over in case of failure needs to include the following elements:
- Workload Migration: Specifically, the ability to clone your application environment and configuration in a consistent way accorss sites, and on demand.
- Data Synchronization: The ability to maintain a real-time copy of the data between two sites.
- Network Connectivity: Enabling the flow of network traffic between two sites.
Which leads to the second challenge: Complexity. Here, I would use an example of a simple web-app and show how we can easily create two sites on demand. I would even go so far as to set this environment on two separate clouds to show how we can ensure an even higher degree of redundancy by running our application across two different cloud providers.
A Step by Step Example: Fail-Over from AWS to Rackspace
In this example, we picked Amazon and Rackspace as the two target sites. The same solution would also work between two availability zones in Amazon or data centers. We've also tried the same example with a combination of HP Cloud Services and a flavor of a private cloud.
The example demonstrates a very simple web application with global load-balancer (Rackspace), and a Web application (Pet Clinic) based on Tomcat as the Web front end and MySQL as the database.
The Goal: Fail-over with no change to the target application
The goals that we set for ourselves were seamlessness and no change to the target application or database. We achieved this by plugging the replication service into the existing instances of MySQL. The replicating service listened to the MySQL events and replicated every change to its peer MySQL instance. Cloudify enabled us to clone the same application in both Amazon and Racksapce while maintaining a consistent configuration setup as well as consistent scaling and fail-over SLAs. Cloudify does this by abstracting all the information through portable recipe definitions. Cloudify wraps the application instances with its management and control services based on the definition provided in the recipe. This enabled us to clone the environment as well as add elastic scaling without changing the target application (Pet Clinic in this case).
You can read the full details, including code references on Github, in Dotan Horvits' blog post: AWS Outage Thoughts on Disaster Recovery Policies.