Jaap Haagmans The all-round IT guy


Plan for downtime in the cloud

Really, any system administrator can build a scalable environment in the cloud. If you have the manpower available, you could consider migrating to AWS, which will probably save you lots of money. However, scalable != highly available. And cloud != always online. The last statement is especially true for Amazon. Some people even argue AWS is not a "real" cloud service. And by their definition, Windows Azure will probably be a better fit.

AWS has had many problems over the past few years. The one that everyone remembers is the outage of April 2011 that took down big websites like Netflix and Reddit. But there are many, many more examples, like the storm in Virginia that took down an entire Availability Zone and the US-East ELB problems on Christmas Eve 2012. Usually, these incidents are isolated to specific services in a single Availability Zone, but they managed to have a serious impact on many people worldwide (using Instagram and Netflix, for example). AWS has experienced very long recovery times for some of these outages, frustrating many of its clients, some of who even left AWS.

So, why am I still advocating the use of AWS? That's because AWS provides many tools to actually plan for these kinds of outages. Netflix has posted many extensive blog posts on what they could have done to prevent these service outages to occur. Reddit even more so. People engineering for the cloud can take these experiences to heart and learn from them.

To plan for downtime, I tend not to rely too much on the PaaS services Amazon provides. EBS, for instance, is sensitive to trouble in a single Availability zone. So, if you (need to) use EBS, make sure you don't depend on a single volume, because even though EBS volumes are replicated volumes, an AZ can always go down. In fact, the outage of April `11 was an EBS outage, but it took down the entire AZ. Make sure you build your own replication cluster across Availability Zones (or even regions) and make sure it can failover if needed.

When I say "plan for downtime", I don't mean downtime of your application, I mean service disruptions within AWS. Of course your application may experience downtime (failing over also takes time), but you will want to make sure you can recover from an AZ outage (or maybe even a region outage) as fast as possible.

Identifying single points of failure

To make sure your setup actually -is- highly available, you will have to eliminate single points of failure. To find them, try to draw your entire setup and start crossing out some things. What would happen if your database server fails? What would happen if the webservers in AZ 2 can't connect to the database servers in AZ 1? What would happen if EBS in AZ 1 fails? What happens when an AZ goes down entirely? If you've managed to eliminate single points of failure, you can easily test whether your plan works, by simply shutting down instances (preferably using a copy of your environment for testing purposes). For example, restart every instance one by one or stop all instances in AZ 1 and see what happens. And if this all works, even try doing this on your live environment. Monitor your application(s) using tools like New Relic to see if there are any increased error rates for end users.

Real world example

A good example I've seen recently was a company that had a well-engineered cluster built on AWS. Its entire cluster ran in a private VPC, with the only exception being a NAT instance that was responsible for outgoing traffic and incoming SSH connections. The instance ran into a problem with its EBS volume getting "stuck", which meant all API connections to AWS and the outside world (which it heavily relied on) failed. It was unreachable through SSH. When the problem was identified, it was easily fixed by firing up a new NAT instance, changing the route table and reallocating the public IP address, but some of the applications were still having problems due to the API requests to other applications (outside the VPC) failing. Some of the applications had to be restored manually by the developers.

The company then quickly identified two single points of failure. The NAT instance was one, the API requests were another. I'd like to also point out that if this would have been caused by an AZ outage, it would have been very hard to recover from, because the AWS API tends to get overloaded on these outages, making it impossible to launch new instances or reassiging IP addresses, for example.

To address the NAT issue, the company launched two NAT instances in every AZ and created a routing table for every subnet, making sure every subnet connected to the Internet Gateway through the network interface that was attached to the active NAT instance. Heartbeat was used to reassign this network interface when an instance became unresponsive. From the outside internet, the public IP attached to this network interface was used to connect to the NAT instances. The API problem was something that needed fixing in the application itself. Using Ruby on Rails, I can recommend using delayed_job to execute API requests to the outside world, because it's able to retry failed requests and can be used to log these failures. It's easy to implement functionality that can stop a worker when (for example) the API you're trying to connect to can't be reached and restart it when it can.

What I've learned

For people who have been building highly available setups outside of AWS, most of this is common sense. However, AWS opens up a few new possibilities that can be taken advantage of. Hosting your applications in multiple datacenters is not always possible, but AWS might just provide the middle ground you've been looking for. I've learned a few things when I've started designing for AWS:

  • Instances fail. It's that simple. When planning your AWS setup, think of your EC2 instances as being disposable. It's often much harder to recover a failing instance than simply firing up a new one. Make sure you automate as much as you can by using launch scripts (or, if you can, use auto scaling).
  • Don't rely on the API. In the NAT example above, the API is used to failover to another NAT instance by reassigning the network interface. When a major outage occurs, this might fail because everyone tries to hit the API. If you rely on the API to be able to failover, rethink this procedure.
  • EBS volumes are not simply harddisks. They're quite reliable, but performance varies a lot. In fact, latency might be a big issue for some applications.
  • AWS services can also fail. What happens if your Elastic Loadbalancer fails? Will you be at the mercy of Amazon or do you have a backup plan?
  • Don't lock yourself in. Because of that last point, you might want to consider using as little AWS services as possible. You can setup your own highly available loadbalancer using HAproxy for example. ElastiCache can be replaced by your own memcached cluster. And DynamoDB is just a document-based noSQL server, you could also take a look at CouchDB or MongoDB. Even S3 can be replaced if needed (by something like RIAK). If you don't lock yourself in, you can easily have disaster recovery in place at any other hosting provider outside of AWS.
  • Employ N+1 redundancy on every level. A hot spare in the same AZ isn't going to help you if the AZ completely loses power. This is something that AWS is quite unique in: you can actually "plan" for failover in case an atom bomb drops on the US, without having to move over to another provider. It might be a bit of a stretch, but there are companies or governments that require this level of planning. Although, if this is a requirement, your will probably also have to have a disaster recovery strategy available outside of AWS.
  • You're responsible. I know this will sound crude, but it's a mindset I've adopted which works for me. If your setup goes down because an AWS service fails, don't take it out on Amazon. If you've created an AZ-dependency and an AZ goes down, that's on you. If you've created a region-dependency and the entire region goes down, that's on you. If AWS as a whole goes down, that's on you. Not because it's your fault that an AWS service fails, but because you've created a dependency. Agreed, I often depend on at least one AZ staying up in a region, which is mostly a cost-vs-benefit kind of decision. If the entire region would go down, I would have to accept that it won't failover. But if you're hosted in a single datacenter and that datacenter would, for any reason, run into problems, you'd have the same issues. That's not specific to AWS.

There's no final best practices guide for AWS, because defining these best practices is an ongoing process. Some issues can better be resolved on a software level, some issues can be (relatively) safely left to AWS processes. Try to read all postmortems Amazon has published over the years to get some insight in how these kinds of problems occur and try to translate them to your own setup. Continue to identify weak links as your setup grows and adjust accordingly.


Databases in a multi region setup

When it comes to hosting your server cluster in multiple datacenters, one of the things that most people have difficulty with is the database setup. I'll do a more generic "do's and don'ts" article on multi-region setups later on, but this is something I've ran into myself on numerous occasions (also outside of AWS). Do mind that this will be focussed on MySQL based use cases, but I'll also tell you why you should consider NoSQL in some situations.

Before talking about complicated forms of master-master replication, I'll provide you with some alternatives that might fit your own use case.

What Amazon does

For their own e-commerce solution, Amazon has geo-specific websites. In the US, the website runs on Amazon.com, in the EU it runs on Amazon.co.uk and Amazon.de. This in part eliminates the need for having a 1-on-1 copy of their database available for every region. It does some endpoint caching using Cloudfront to speed up the websites worldwide (which is why amazon.com feels fast from the EU until you login or order something). However, if placing the order takes too long, you're shown a thank you page that will tell you to wait for confirmation by e-mail (meaning your order has been queued).

Of course, this may not be the scenario you were looking for. Most medium-sized e-commerce companies don't have a website for every region. They just want to manage their website from one central location, but they also want someone from the other end of the world to experience a fast website.

Database slave nodes across the globe

If your application is built so that database writes only occur from the backend which is managed by the company (e.g. a purely informational website with a CMS), you can set up slave nodes in multiple regions. Writes to the master database are passed to its slaves and the new data is available to the world.

Like before, this is only a scenario that works in specific use cases, but it can be useful for some. In this specific use case, you could also consider using Cloudfront for your entire website, but that may be more expensive (especially in the case of (multiple) websites running over SSL) and if your website is updated irregularly, it might require you to alter your change workflow to protect consistency.

Optimize your application for multiple regions

You can build an application that is optimized for running in multiple regions. You will have to dig a little outside the scope of AWS for this, but there are tremendous advantages to this. Take the use case of Twitter for example. A tweet can't be edited, only deleted. This means Twitter can (if necessary) queue the creation of tweets. Using AJAX, the tweet is immediatly displayed, but the database write can take just a little longer. This is useful to ease and spread the load on database servers, but because you don't have to wait for the actual transaction to happen, tweeting feels very fast for your end user (even though the process might take up to minutes on occasion).

Yeah but, you might say, Twitter must have some kind of super-database behind it that can easily replicate across the globe. You might be surprised to hear that Twitter actually uses MySQL to store tweets. Granted, it does use some very fine-grained, sustainable replication and distribution techniques (named Gizzard) and it aggregates and caches a lot of data using other techniques (like FlockDB and memcached), MySQL is still used as the main data store.

Also, take a look at the other end of the spectrum, where you have the fully distributed database model of Bitcoin transactions. There's no master database, but every wallet has a copy of the transaction database. All connected clients are notified of any transaction, of which they notify their peers. This is how the database is synchronized across the globe.

Global replication with multiple masters

However, if you must use global replication, it's not impossible, although it might take some time to architect and set up. I'd like to take Paypal as an example. Paypal uses a transaction based consistency model, meaning that everything can be rebuilt using transactions. Paypal also uses replication across four AWS regions. Let's say it stores your current credit level in a table. If two people send you money at the same time, the transactions may be processed in two different regions. In region 1 your credit is set to 30, in region 2 it's set to 20. When those two updates meet, an inconsistency is detected and the real credit can be calculated using the underlying transactions (that are always unique). This is how Paypal can use a multi-master setup using MySQL.

Master-master replication in MySQL is usually set up so that each node is a slave and a master at the same time. Paypal uses replication in a circular motion. The MySQL node in region 1 is master to the node in region 2 (its slave), which is master to the node in region 3, which is master to the node in region 4, which is master to the node in region 1. This won't sound appealing to every database administrator, because the round time for a write could be quite high. In the case of Paypel, this is fine because every transaction is unique (and, like tweets on Twitter, can't be edited), meaning an eventually-consistent model suits them. But what if you have a typical MySQL PHP application that lets MySQL handle ID generation using auto_increment? Two inserts in the same table on multiple nodes -will- cause a replication error. If the round time for write consistency is 2 seconds (which could be quite conservative), you can imagine this happening all the time on a fairly busy website.

Now, AWS multi-AZ RDS instances don't have this problem. Yes, the replication is done using a master-master setup, but the second DB instance is a hot spare, meaning writes to it will only happen if the other database is down. But a multi-region setup with four active masters will have this problem. This is where MySQLs auto_increment_increment comes in. Set this in your my.cnf on the first node:

auto_increment_increment = 4
auto_increment_offset = 1

On the second node:

auto_increment_increment = 4
auto_increment_offset = 2

Et cetera. This will ensure MySQL increases every ID by the number of nodes available (4 in this case), using a starting offset. This means that the IDs on node 1 will be 1,5,9,13,17,... and on node 2 the range will be 2,6,10,14,18,... This also ensures that when a node can't replicate for some reason, it can "catch up" later without generating additional replication errors.

However, this will not prevent other replication errors. Say that your application has to create invoices and the IDs need to increase by 1 for every invoice (tax regulations in many countries stipulate this). If you use multiple databases, you can't simply take the last invoice ID and add one, because another node might do the same before the changes on the other node have reached it. You also can't use auto_increment, because the invoice numbers won't follow up. You will have to design your application around these kinds of issues. One way to go is setup some kind of invoice queue (SQS can be used for this) that will create these invoices one at a time, making sure invoice IDs follow up. This is an asynchronous process though, meaning you can't immediatly send a response to the client (although you could use NodeJS to simulate it).

Another way to go would be to set up an ID distribution agent, to which every node can send a request for a new ID. It makes sure it distributes every ID only once and can also be setup to support multiple nodes (checking the new ID with the other nodes before giving it out). You will have to take into account that a MySQL insert can also fail on certain occasions (omission of a value that can't be NULL for instance) while the ID has already been granted. So your transaction validation has to be very thorough and you should incorporate a rollback scenario (meaning, for instance, the ID can be reallocated by the distribution agent).

If your MySQL replication setup mainly targets failover scenarios, you might not run into these problems, but it's still something to think about.


Increasing EBS performance and fault tolerance using RAID

Even though I will normally say you should consider your EC2 instances and EBS data as being disposable, this is not always possible. There are setups imaginable that simply cannot make use of S3 for their "dynamic" file storage (e.g. due to use of legacy software packages that highly depend on file system storage). In these situations, only making snapshots might not be sufficient, as the downtime might be quite high.

Increasing performance of EBS

EBS performance is often increased using RAID0, also called striping. Data is distributed over multiple volumes, increasing I/O capabilities. In fact, you can scale your RAID0 setup to up to 16 drives on Windows or even more on Linux. Many AWS users are employing this technique and are reporting it to be quite performant.

What should worry you if the first part of this post applies to you, is that if one EBS drive somehow fails, your entire RAID0 volume will fail, effectively corrupting all data on it. If this doesn't worry you (it might not, many setups on AWS aren't filesystem-dependent), you're now free to go. The rest of this post doesn't apply to you. However, I know there are people out there who will be -very- worried by this.

Before I go on, I'd like to note that Adrian Cockcroft mentions they only use 1TB EBS volumes to reduce (or maybe even eliminate) multi-tenancy, which will generate more consistent I/O results.

Increasing fault tolerance of EBS volumes

Amazon states that EBS volumes are 99,5-99,9% reliable over any given year. Compared to a regular physical drive, that's an impressive number. However, it might not be enough for you. You'd probably think that RAID1 can solve that. According to Amazon, you're wrong. EBS volumes are replicated through an Availability Zone, meaning that if the physical hardware behind your EBS volume goes down, your EBS volume will persist somewhere else in the AZ. So RAID1 will not reduce the chance that you lose your data (technically, this isn't true, but let's not go into that).

However, there's something Amazon seems to overlook. An EBS volume might underperform from time to time. If you don't use RAID1, you will have to just wait it out (or build a new volume from a snapshot). If you do use RAID1, you can quickly swap the EBS volume for a brand new one and rebuild the RAID1 array. That gives you complete control!

I myself am using RAID10 to make use of the advantages of both RAID1 and RAID0. But it's something you'll have to figure out for yourself. In fact, in some cases RAID1 might outperform RAID0 (especially when looking at random reads). However, RAID1 writes are always slower than RAID0 writes.

Resilient filesystems

I will get back to this after we're done setting it up, but we're working on moving to Gluster for centralized file storage. We're currently using a robust NFS solution to mount a webroot volume to our webservers, but it's still a single point of failure. Gluster provides us with a way to set up a resilient cluster for file storage, that can scale endlessly. Our plan is to build it on top of RAID10 EBS volumes and replicate across Availability Zones.

In any case, EBS performance shouldn't be too big of an issue. Yes, the latency might not be ideal for every use case, but if that forms a real issue, you're probably better off renting a dedicated server solution anyway.

Tagged as: , , , 1 Comment

Setting up your own dynamic CDN with edge locations using Varnish and SSL

As I mentioned earlier in my post about the new SSL functionality for Amazon Cloudfront, there's a possibility to set up your own CDN with "edge" locations. I prefer calling them edgy though, because we're using Amazon regions and not the real Amazon Edge locations that are available. But it will provide us with some flexibility. You will only serve from regions you think you need (thus saving costs) and you can always add your own edge location hosted at a datacenter outside of AWS (for instance, somewhere in Amsterdam).

Please mind that I haven't built the POC environment for this yet. I am fairly confident that the below will work, but please comment if you don't agree.

Basically, what we want is to send visitors to the content location nearest to them. On these locations, we will cache static content and try to cache dynamic content as much as possible, while being able to serve content through SSL. Take a look at this sketch for a visual image of what we'll try to do:



For the DNS, we will of course use Amazons Route53. Not only does Route53 serve clients from the nearest possible location (read: location with lowest latency), it can also do health checks and route to the endpoint with the lowest possible latency. Read more about latency based routing in the AWS docs. Set it up to include your edge locations and monitor the health of these locations.

The Edge locations

This is where it gets interesting. There are a few possible ways to go, you can setup a simple Apache/nginx server to host your content, but you will have to worry about keeping copies of your content on every server. It's possible, but it might not be as easy to use. Besides, it will not provide a way to easily serve dynamic content.

I've chosen a Varnish based caching solution for this specific use case, because it's very flexible and provides a lot of tweaking options. Besides, Varnish will perform well on a relatively "light" server. Varnish will not be able to handle SSL termination though, so we will use nginx as a proxy to offload SSL. You can read how to Offload SSL using nginx in their wiki.

Setting up your specific Varnish environment is outside the scope of this article, because there are too many use cases to put into one single article. I will provide a few things to consider though.

Let nginx only handle SSL traffix

Varnish is perfectly able to handle unencrypted traffic. So nginx should only listen to port 443. Varnish can listen to port 80.

Use auto-scaling

For some, this is a no-brainer. I think it's best practise to always use auto-scaling. You will probably not want to scale out your edge location, but you do want to automatically terminate unhealthy EC2 instances and fire up a new one. Something to consider here is that you will have to work around the fact that normally, you will not be able to keep the same IP address for your replacement instance. A possible workaround is using ELB, but you will need an ELB for every region you're using and that will cost you more than the instance itself. There's a possibility to "detach" an Elastic IP on termination and attach it again in the launch sequence of your new EC2 instance, but I don't have a ready-to-go script for that solution (yet).

Consider whether dynamic content can be cached

If you have a session-based website with lots of changing data, it might not pay off to try and cache the dynamic data. If so, use your CDN purely for the static content on another domain. The CDN step will add latency on cache-misses, so if the miss rate is very high, you might be better of querying your server directly for dynamic content. If, for instance, you use content.example.com as your CDN URL and point Varnish to www.example.com as your origin, you can set your application to use content.example.com as domain for all static file references (images, javascripts, stylesheets) and www.example.com for all other URLs.

Distributing the SSL certificate

Your servers can run solely on ephemeral storage, thanks to the auto-scaling setup. However, one thing that needs to be consistently spread across your endpoints is the SSL certificate itself. I suggest using S3 for this. Put your instances in a security group that is allowed to read from the bucket where you store your certificates and have them pull the necessary files from S3 on launch. This can also be done for the nginx and Varnish config files if you like.

The origin

The origin can be anything you like, it doesn't even have to be an AWS-hosted setup. But it can also be an S3 bucket or simply your website on a loadbalanced EC2 setup. If, for instance, your origin serves a very heavy PHP-website using Apache (like a Magento webshop), you will reduce the load on Apache tremendously by not having to serve all those small static files, but only do the heavy lifting. I've seen examples of heavy loadbalanced setups that could be reduced to half their size by simply using a CDN.


Amazon AWS Cloudfront now supports custom SSL domains, will we use it?

As you may know, Amazon Cloudfront is a great service that provides you with the possibility to serve both static and dynamic bits of content from an edge location near the end user. I'll do an article on how to optimize Cloudfront for dynamic content later, but I'd like to talk about a new feature that Amazon presented a month ago.

If you're using Cloudfront as a traditional CDN, you'll probably have a CNAME configured at content.yourdomain.com or static.yourdomain.com or similar, pointing to Cloudfront. For websites running on HTTP, that's perfectly fine. However, if you're using HTTPS, up until a month ago, this would have not been possible. Amazon didn't provide customers with a possibility to upload an SSL certificate for their CDN domain.

However, that has changed. As of mid-June 2013, Amazon supports what they call "Custom SSL certificates", basically enabling you to upload your own SSL certificate that will be distributed across all edge locations.

There is a downside though, which is the cost of this feature. It amounts to a whopping $600.- per certificate per month (pro-rated by the hour, of course). For us, this would mean a 40% increase in cost for our entire AWS infrastructure, which is why we opted not to implement it. We're continuing use of our nginx-based EC2 server as our CDN. We'd love to serve our static content from edge locations, but not at a 40% cost increase.

If you don't mind using a .cloudfront.net subdomain for your static content, you can of course use Amazons wildcard SSL certificate at a slightly higher rate per 10.000 requests. For many companies, this will do fine.

Update: Amazon has updated its announcement to explain the high cost of this feature. They state the following:

Some of you have expressed surprise at the price tag for the use of SSL certificates with CloudFront. With this custom SSL certificate feature, each certificate requires one or more dedicated IP addresses at each of our 40 CloudFront locations. This lets us give customers great performance (using all of our edge locations), great security (using a dedicated cert that isn’t shared with anyone else) and great availability (no limitations as to which clients or browsers are supported). As with any other CloudFront feature, there are no up-front fees or professional services needed for setup. Plus, we aren’t charging anything extra for dynamic content, which makes it a great choice for customers who want to deliver an entire website (both static and dynamic content) in a secure manner.

The thing is, for $600.- per month, I could rent more than 40 on-demand micro instances, each with its own elastic (dedicated) IP. If you'd spread 8 heavy-reserved small instances over all major regions, you'd be able to use Route53's latency-based routing and it would probably cost you less than $150.- per month (traffic not included). Latency might not be as low as with CloudFront, but I think it's definitely something I'd consider if a client wants to lower its global latency.

I'll do a post about this as well in the near future.

Tagged as: , , No Comments