Jaap Haagmans The all-round IT guy

19Aug/130

Plan for downtime in the cloud

Really, any system administrator can build a scalable environment in the cloud. If you have the manpower available, you could consider migrating to AWS, which will probably save you lots of money. However, scalable != highly available. And cloud != always online. The last statement is especially true for Amazon. Some people even argue AWS is not a "real" cloud service. And by their definition, Windows Azure will probably be a better fit.

AWS has had many problems over the past few years. The one that everyone remembers is the outage of April 2011 that took down big websites like Netflix and Reddit. But there are many, many more examples, like the storm in Virginia that took down an entire Availability Zone and the US-East ELB problems on Christmas Eve 2012. Usually, these incidents are isolated to specific services in a single Availability Zone, but they managed to have a serious impact on many people worldwide (using Instagram and Netflix, for example). AWS has experienced very long recovery times for some of these outages, frustrating many of its clients, some of who even left AWS.

So, why am I still advocating the use of AWS? That's because AWS provides many tools to actually plan for these kinds of outages. Netflix has posted many extensive blog posts on what they could have done to prevent these service outages to occur. Reddit even more so. People engineering for the cloud can take these experiences to heart and learn from them.

To plan for downtime, I tend not to rely too much on the PaaS services Amazon provides. EBS, for instance, is sensitive to trouble in a single Availability zone. So, if you (need to) use EBS, make sure you don't depend on a single volume, because even though EBS volumes are replicated volumes, an AZ can always go down. In fact, the outage of April `11 was an EBS outage, but it took down the entire AZ. Make sure you build your own replication cluster across Availability Zones (or even regions) and make sure it can failover if needed.

When I say "plan for downtime", I don't mean downtime of your application, I mean service disruptions within AWS. Of course your application may experience downtime (failing over also takes time), but you will want to make sure you can recover from an AZ outage (or maybe even a region outage) as fast as possible.

Identifying single points of failure

To make sure your setup actually -is- highly available, you will have to eliminate single points of failure. To find them, try to draw your entire setup and start crossing out some things. What would happen if your database server fails? What would happen if the webservers in AZ 2 can't connect to the database servers in AZ 1? What would happen if EBS in AZ 1 fails? What happens when an AZ goes down entirely? If you've managed to eliminate single points of failure, you can easily test whether your plan works, by simply shutting down instances (preferably using a copy of your environment for testing purposes). For example, restart every instance one by one or stop all instances in AZ 1 and see what happens. And if this all works, even try doing this on your live environment. Monitor your application(s) using tools like New Relic to see if there are any increased error rates for end users.

Real world example

A good example I've seen recently was a company that had a well-engineered cluster built on AWS. Its entire cluster ran in a private VPC, with the only exception being a NAT instance that was responsible for outgoing traffic and incoming SSH connections. The instance ran into a problem with its EBS volume getting "stuck", which meant all API connections to AWS and the outside world (which it heavily relied on) failed. It was unreachable through SSH. When the problem was identified, it was easily fixed by firing up a new NAT instance, changing the route table and reallocating the public IP address, but some of the applications were still having problems due to the API requests to other applications (outside the VPC) failing. Some of the applications had to be restored manually by the developers.

The company then quickly identified two single points of failure. The NAT instance was one, the API requests were another. I'd like to also point out that if this would have been caused by an AZ outage, it would have been very hard to recover from, because the AWS API tends to get overloaded on these outages, making it impossible to launch new instances or reassiging IP addresses, for example.

To address the NAT issue, the company launched two NAT instances in every AZ and created a routing table for every subnet, making sure every subnet connected to the Internet Gateway through the network interface that was attached to the active NAT instance. Heartbeat was used to reassign this network interface when an instance became unresponsive. From the outside internet, the public IP attached to this network interface was used to connect to the NAT instances. The API problem was something that needed fixing in the application itself. Using Ruby on Rails, I can recommend using delayed_job to execute API requests to the outside world, because it's able to retry failed requests and can be used to log these failures. It's easy to implement functionality that can stop a worker when (for example) the API you're trying to connect to can't be reached and restart it when it can.

What I've learned

For people who have been building highly available setups outside of AWS, most of this is common sense. However, AWS opens up a few new possibilities that can be taken advantage of. Hosting your applications in multiple datacenters is not always possible, but AWS might just provide the middle ground you've been looking for. I've learned a few things when I've started designing for AWS:

  • Instances fail. It's that simple. When planning your AWS setup, think of your EC2 instances as being disposable. It's often much harder to recover a failing instance than simply firing up a new one. Make sure you automate as much as you can by using launch scripts (or, if you can, use auto scaling).
  • Don't rely on the API. In the NAT example above, the API is used to failover to another NAT instance by reassigning the network interface. When a major outage occurs, this might fail because everyone tries to hit the API. If you rely on the API to be able to failover, rethink this procedure.
  • EBS volumes are not simply harddisks. They're quite reliable, but performance varies a lot. In fact, latency might be a big issue for some applications.
  • AWS services can also fail. What happens if your Elastic Loadbalancer fails? Will you be at the mercy of Amazon or do you have a backup plan?
  • Don't lock yourself in. Because of that last point, you might want to consider using as little AWS services as possible. You can setup your own highly available loadbalancer using HAproxy for example. ElastiCache can be replaced by your own memcached cluster. And DynamoDB is just a document-based noSQL server, you could also take a look at CouchDB or MongoDB. Even S3 can be replaced if needed (by something like RIAK). If you don't lock yourself in, you can easily have disaster recovery in place at any other hosting provider outside of AWS.
  • Employ N+1 redundancy on every level. A hot spare in the same AZ isn't going to help you if the AZ completely loses power. This is something that AWS is quite unique in: you can actually "plan" for failover in case an atom bomb drops on the US, without having to move over to another provider. It might be a bit of a stretch, but there are companies or governments that require this level of planning. Although, if this is a requirement, your will probably also have to have a disaster recovery strategy available outside of AWS.
  • You're responsible. I know this will sound crude, but it's a mindset I've adopted which works for me. If your setup goes down because an AWS service fails, don't take it out on Amazon. If you've created an AZ-dependency and an AZ goes down, that's on you. If you've created a region-dependency and the entire region goes down, that's on you. If AWS as a whole goes down, that's on you. Not because it's your fault that an AWS service fails, but because you've created a dependency. Agreed, I often depend on at least one AZ staying up in a region, which is mostly a cost-vs-benefit kind of decision. If the entire region would go down, I would have to accept that it won't failover. But if you're hosted in a single datacenter and that datacenter would, for any reason, run into problems, you'd have the same issues. That's not specific to AWS.

There's no final best practices guide for AWS, because defining these best practices is an ongoing process. Some issues can better be resolved on a software level, some issues can be (relatively) safely left to AWS processes. Try to read all postmortems Amazon has published over the years to get some insight in how these kinds of problems occur and try to translate them to your own setup. Continue to identify weak links as your setup grows and adjust accordingly.