Jaap Haagmans The all-round IT guy


Finally, SSD-backed EBS drives

Last week, Amazon Web Services announced they now support SSD drives for EBS. Although some say this isn't news, especially since you were already able to get much more I/O operations in per second for provisioned IOPs drives, you can now choose between magnetic storage based EBS volumes (at 5 cents / GB) and SSD-based ones (starting at 10 cents / GB).

The "regular" SSD volumes will get you a guaranteed baseline of 3 IOPs per provisioned GB, but is able to burst to 3000 IOPs in total, which in all is much better than traditional EBS volumes, at a marginally lower price (since you don't pay for the IOPs). There are, of course, also SSD-volumes with provisioned IOPs, but it's suggested that these are actually the same as the provisioned IOPs volumes we previously had. Which makes sense, because magnetic storage could never provide 4.000 IOPs.

One little thing to consider: if you've previously always provisioned 1 TB for all your EBS volumes to ensure consistent performance, that's still the most cost-effective strategy. A "regular" 1 TB SSD volume will cost you $100.- per month and get you 3000 IOPs, while getting 3000 provisioned IOPs for a 100 GB volume will set you back a little over $200.-. This might need some thought, but it's really worth noting.

Filed under: AWS, EBS, Performance No Comments

VPC now supports EIP management on instance launch

Before, it was impossible to specify whether you wanted to use a public (or elastic) IP address when launching an instance in a VPC subnet. That was one of the main reasons I've been unable to create resilient NAT instances without relying on other instances. Either you use a default VPC (I never do) and always have an EIP assigned, or you have a custom VPC and no EIP assigned.

NAT instances are very important in nearly all my setups. But when one fails, the chain of events that follows always depends on other instances. That means that when a NAT instance in AZ 1a fails, the NAT instance in AZ 1b takes over by changing the route table. It also changes the route table for AZ 1a, making the formerly public subnet a private subnet which routes through AZ 1b. When the NAT instance in AZ 1a relaunches through auto scaling, it requests an EIP, attaches it, then changes the route table for AZ 1a back to a public subnet and it will stand on its own feet again. It works, but on failure, the NAT instance in AZ 1b suddenly becomes a single point of failure until someone manually restores service. I've had problems with this setup on some occasions, so I was eagerly waiting for Amazon to come up with a way to simplify this process.

A few days ago, Amazon announced that it now supports EIP management on instance launch. That means that when launching an instance through the console, you can choose whether you want to assign an EIP. They also announced that they're working to support this for the auto scaling service and that's what I've been waiting for. It means that I can relaunch my NAT instance automatically through auto scaling (with a --min-size and --max-size of 1), attach an EIP and change the route table, without relying on a second instance.

Tagged as: , No Comments

Why I don’t use custom AMIs for EC2 instances

When I started using AWS, most of the documentation on auto scaling and EC2 in general advised creating AMIs to launch a copy of your instance when, for example, scaling up. And it sounds convenient: you don't need to install packages and configure your server. You simply copy the running instance and you're done. When I started auto scaling, I quickly decided this method was not for me.

I found that every time I changed something on one of my servers, I had to create a new AMI from that instance, create a new launch config and terminate running instances that would then launch from the newly created AMI. While this works, I've discovered that using user-data files can be much cleaner than using AMIs. You can easily switch to a new Amazon AMI when it's released and a user-data file will ensure you only install packages you actually need (while an AMI can build up lots of redundant packages and files over the years).

But the most important reason for me to do this was simplifying the release of new application versions. When using an AMI, I'd have to update the application code on one instance, create the AMI, create my launch config, update the auto scaling group and terminate all other running instances. Using a launch script, I can simply push my code to the "stable" branch on git and start terminating instances one at a time. The launch script will ensure all instances have the right packages installed and pull the latest code from git. All our webservers are in a private subnet, connecting through a NAT instance, so the git repository can be setup to only allow access from our public NAT IP addresses. In fact, you can have a git repo within the private subnet that isn't password protected for this purpose.

An example launch config script for a Passenger/nginx server that would host a Rails application could be something like this:

yum update
yum install git ruby19 httpd gcc gcc-c++ ruby19-devel curl-devel openssl-devel zlib-devel
gem1.9 install passenger bundler --no-ri --no-rdoc
passenger-install-nginx-module --auto --auto-download --prefix=/opt/nginx
adduser webapp
cd /home/webapp && git clone YOUR_GIT_REPO:yourapp.git && cd webapp && bundle install
cat /home/webapp/yourapp/config/custom-configs/nginx.conf > /opt/nginx/conf/nginx.conf
cat /home/webapp/yourapp/config/custom-configs/initd-nginx > /etc/init.d/nginx
chown webapp:webapp -R /home/webapp/*
chmod 755 /home/webapp
chmod +x /etc/init.d/nginx && chkconfig nginx on && service nginx start

The last 3 lines might need a bit of explanation. I've chosen to include the nginx-configuration and an init.d script with the app. I could easily put those on something like S3, but I felt that since nginx is installed automatically with every deploy, this was just as easy. However, if you make regular changes to your nginx.conf file, you might want to do this differently.

If you combine this with a Capistrano script that would iterate through your running instances (tag your auto scaling group to easily and automatically find the right instances) and shuts them down, you have fully automated deployment in a clustered environment, without having to use your own AMIs. It's as simple as git push && capistrano deploy!


Plan for downtime in the cloud

Really, any system administrator can build a scalable environment in the cloud. If you have the manpower available, you could consider migrating to AWS, which will probably save you lots of money. However, scalable != highly available. And cloud != always online. The last statement is especially true for Amazon. Some people even argue AWS is not a "real" cloud service. And by their definition, Windows Azure will probably be a better fit.

AWS has had many problems over the past few years. The one that everyone remembers is the outage of April 2011 that took down big websites like Netflix and Reddit. But there are many, many more examples, like the storm in Virginia that took down an entire Availability Zone and the US-East ELB problems on Christmas Eve 2012. Usually, these incidents are isolated to specific services in a single Availability Zone, but they managed to have a serious impact on many people worldwide (using Instagram and Netflix, for example). AWS has experienced very long recovery times for some of these outages, frustrating many of its clients, some of who even left AWS.

So, why am I still advocating the use of AWS? That's because AWS provides many tools to actually plan for these kinds of outages. Netflix has posted many extensive blog posts on what they could have done to prevent these service outages to occur. Reddit even more so. People engineering for the cloud can take these experiences to heart and learn from them.

To plan for downtime, I tend not to rely too much on the PaaS services Amazon provides. EBS, for instance, is sensitive to trouble in a single Availability zone. So, if you (need to) use EBS, make sure you don't depend on a single volume, because even though EBS volumes are replicated volumes, an AZ can always go down. In fact, the outage of April `11 was an EBS outage, but it took down the entire AZ. Make sure you build your own replication cluster across Availability Zones (or even regions) and make sure it can failover if needed.

When I say "plan for downtime", I don't mean downtime of your application, I mean service disruptions within AWS. Of course your application may experience downtime (failing over also takes time), but you will want to make sure you can recover from an AZ outage (or maybe even a region outage) as fast as possible.

Identifying single points of failure

To make sure your setup actually -is- highly available, you will have to eliminate single points of failure. To find them, try to draw your entire setup and start crossing out some things. What would happen if your database server fails? What would happen if the webservers in AZ 2 can't connect to the database servers in AZ 1? What would happen if EBS in AZ 1 fails? What happens when an AZ goes down entirely? If you've managed to eliminate single points of failure, you can easily test whether your plan works, by simply shutting down instances (preferably using a copy of your environment for testing purposes). For example, restart every instance one by one or stop all instances in AZ 1 and see what happens. And if this all works, even try doing this on your live environment. Monitor your application(s) using tools like New Relic to see if there are any increased error rates for end users.

Real world example

A good example I've seen recently was a company that had a well-engineered cluster built on AWS. Its entire cluster ran in a private VPC, with the only exception being a NAT instance that was responsible for outgoing traffic and incoming SSH connections. The instance ran into a problem with its EBS volume getting "stuck", which meant all API connections to AWS and the outside world (which it heavily relied on) failed. It was unreachable through SSH. When the problem was identified, it was easily fixed by firing up a new NAT instance, changing the route table and reallocating the public IP address, but some of the applications were still having problems due to the API requests to other applications (outside the VPC) failing. Some of the applications had to be restored manually by the developers.

The company then quickly identified two single points of failure. The NAT instance was one, the API requests were another. I'd like to also point out that if this would have been caused by an AZ outage, it would have been very hard to recover from, because the AWS API tends to get overloaded on these outages, making it impossible to launch new instances or reassiging IP addresses, for example.

To address the NAT issue, the company launched two NAT instances in every AZ and created a routing table for every subnet, making sure every subnet connected to the Internet Gateway through the network interface that was attached to the active NAT instance. Heartbeat was used to reassign this network interface when an instance became unresponsive. From the outside internet, the public IP attached to this network interface was used to connect to the NAT instances. The API problem was something that needed fixing in the application itself. Using Ruby on Rails, I can recommend using delayed_job to execute API requests to the outside world, because it's able to retry failed requests and can be used to log these failures. It's easy to implement functionality that can stop a worker when (for example) the API you're trying to connect to can't be reached and restart it when it can.

What I've learned

For people who have been building highly available setups outside of AWS, most of this is common sense. However, AWS opens up a few new possibilities that can be taken advantage of. Hosting your applications in multiple datacenters is not always possible, but AWS might just provide the middle ground you've been looking for. I've learned a few things when I've started designing for AWS:

  • Instances fail. It's that simple. When planning your AWS setup, think of your EC2 instances as being disposable. It's often much harder to recover a failing instance than simply firing up a new one. Make sure you automate as much as you can by using launch scripts (or, if you can, use auto scaling).
  • Don't rely on the API. In the NAT example above, the API is used to failover to another NAT instance by reassigning the network interface. When a major outage occurs, this might fail because everyone tries to hit the API. If you rely on the API to be able to failover, rethink this procedure.
  • EBS volumes are not simply harddisks. They're quite reliable, but performance varies a lot. In fact, latency might be a big issue for some applications.
  • AWS services can also fail. What happens if your Elastic Loadbalancer fails? Will you be at the mercy of Amazon or do you have a backup plan?
  • Don't lock yourself in. Because of that last point, you might want to consider using as little AWS services as possible. You can setup your own highly available loadbalancer using HAproxy for example. ElastiCache can be replaced by your own memcached cluster. And DynamoDB is just a document-based noSQL server, you could also take a look at CouchDB or MongoDB. Even S3 can be replaced if needed (by something like RIAK). If you don't lock yourself in, you can easily have disaster recovery in place at any other hosting provider outside of AWS.
  • Employ N+1 redundancy on every level. A hot spare in the same AZ isn't going to help you if the AZ completely loses power. This is something that AWS is quite unique in: you can actually "plan" for failover in case an atom bomb drops on the US, without having to move over to another provider. It might be a bit of a stretch, but there are companies or governments that require this level of planning. Although, if this is a requirement, your will probably also have to have a disaster recovery strategy available outside of AWS.
  • You're responsible. I know this will sound crude, but it's a mindset I've adopted which works for me. If your setup goes down because an AWS service fails, don't take it out on Amazon. If you've created an AZ-dependency and an AZ goes down, that's on you. If you've created a region-dependency and the entire region goes down, that's on you. If AWS as a whole goes down, that's on you. Not because it's your fault that an AWS service fails, but because you've created a dependency. Agreed, I often depend on at least one AZ staying up in a region, which is mostly a cost-vs-benefit kind of decision. If the entire region would go down, I would have to accept that it won't failover. But if you're hosted in a single datacenter and that datacenter would, for any reason, run into problems, you'd have the same issues. That's not specific to AWS.

There's no final best practices guide for AWS, because defining these best practices is an ongoing process. Some issues can better be resolved on a software level, some issues can be (relatively) safely left to AWS processes. Try to read all postmortems Amazon has published over the years to get some insight in how these kinds of problems occur and try to translate them to your own setup. Continue to identify weak links as your setup grows and adjust accordingly.


EC2 performance and cost vs dedicated servers and in-house solutions

When it comes to performance and cost/benefit analysis, AWS has had to endure quite some criticism over the years. The pricing structure of AWS, though transparent, is often labelled as steep and unfavourable to steady, long-term infrastructures. I agree to some point, but infrastructures are rarely steady. I've seen companies splashing cash on hardware that was utilized at 10% during their lifetime. I've also seen companies that grew faster than their infrastructure allowed, requiring them to do a second investment and concede a big write-off on their recently bought hardware. If you want to avoid these situations, you need to plan ahead and hope you don't catch up with the future too soon. Or you'll have to go out for a crystal ball.

For people who simply can't plan that far ahead, virtualisation provides middle ground. Given that your contracts are flexible you can, for instance, tune up your Exchange server at moments notice, with minimal downtime. AWS goes a little step further, enabling you to control the resources yourself, thus providing you with the possibility to plan around your own schedule.

Many people argue that the services other than bare EC2 are expensive. This is mainly due to the fact that AWS provides an extra level of service. With EC2, you're responsible for everything that happens on your server (no matter what kind of support level agreement you have). If you rent an RDS instance though, AWS also takes responsibility for the software layer. When you compare a large EC2 instance with a large RDS instance, you'll see that the resources provided are comparable, but the price of an RDS instance is 8 cents per hour higher (in the EU region). Now, if you're comfortable managing your own MySQL instance, you're probably better off running MySQL on an EC2 instance. And that goes for almost every service AWS provides. You can even setup your own loadbalancers if you'd like. Or, like I argued before, it's possible to setup your own distributed CDN.


So, let's take a look only at the real building blocks: EC2 instances. How do they perform? And how does that compare to our in-house solutions?

For this comparison, I'm taking a look at some benchmarks taken on an m1.large instance. It's said to have 7.5 GiB of memory, 4 EC2 Compute Units (2 virtual cores with 2 EC2 Compute Units each), 850 GB of local instance storage, 64-bit platform. Does that mean anything to you? Well, it doesn't to me. How does an EC2 Compute Unit (ECU) relate to a real CPU for example? And 7.5 GiB of memory sounds great, but if all memory buses on the server are stuffed with slow 8 GB RAM slices (for a total of 64 GB RAM) it probably doesn't relate to a dedicated server with 4x 2 GB DDR4 RAM. We all know that slow RAM can be deadly for general performance. So, let's do a benchmark!

Yes, I know that some of you will say that benchmarks are the root of all evil. They can't be trusted. A benchmark today says nothing about a benchmark tomorrow. And you're probably right. But I just want to know the ballpark we're in. So to do that, I'm using sysbench on a large EC2 instance running the 64 bit Linux AMI.


[ec2-user@ip-10-0-0-17 ~]$ sysbench --test=cpu --cpu-max-prime=20000 run
sysbench 0.4.12:  multi-threaded system evaluation benchmark
Running the test with following options:
Number of threads: 1
Doing CPU performance benchmark
Threads started!
Maximum prime number checked in CPU test: 20000
Test execution summary:
    total time:                          36.1470s
    total number of events:              10000
    total time taken by event execution: 36.1343
    per-request statistics:
         min:                                  3.57ms
         avg:                                  3.61ms
         max:                                  4.72ms
         approx.  95 percentile:               3.71ms
Threads fairness:
    events (avg/stddev):           10000.0000/0.00
    execution time (avg/stddev):   36.1343/0.00


[ec2-user@ip-10-0-0-17 ~]$ sysbench --test=fileio --file-total-size=1G --file-test-mode=rndrw --init-rng=on --max-time=300 --max-requests=0 run
sysbench 0.4.12:  multi-threaded system evaluation benchmark
Running the test with following options:
Number of threads: 1
Initializing random number generator from timer.
Extra file open flags: 0
128 files, 8Mb each
1Gb total file size
Block size 16Kb
Number of random requests for random IO: 0
Read/Write ratio for combined random IO test: 1.50
Periodic FSYNC enabled, calling fsync() each 100 requests.
Calling fsync() at the end of test, Enabled.
Using synchronous I/O mode
Doing random r/w test                                                                                                                                                                           
Threads started!                                                                                                                                                                                
Time limit exceeded, exiting...                                                                                                                                                                 
Operations performed:  259620 Read, 173080 Write, 553828 Other = 986528 Total
Read 3.9615Gb  Written 2.641Gb  Total transferred 6.6025Gb  (22.536Mb/sec)
 1442.33 Requests/sec executed
Test execution summary:
    total time:                          300.0010s
    total number of events:              432700
    total time taken by event execution: 5.9789
    per-request statistics:
         min:                                  0.01ms
         avg:                                  0.01ms
         max:                                  0.16ms
         approx.  95 percentile:               0.02ms
Threads fairness:
    events (avg/stddev):           432700.0000/0.00
    execution time (avg/stddev):   5.9789/0.00

Do mind that this is network attached storage (EBS), thus uncomparable to a physical disk in a server when it comes to response times. And yes, I know that that's outside the EC2 scope, but Amazon actually recommends against using ephemeral drives for almost anything, so EBS performance is probably what anyone will be looking for anyway. And I'm all about my readers (yes, all 3 of them. Hi mom!).


[ec2-user@ip-10-0-0-17 ~]$ sysbench --test=memory --memory-block-size=1M --memory-total-size=7G run
sysbench 0.4.12:  multi-threaded system evaluation benchmark
Running the test with following options:
Number of threads: 1
Doing memory operations speed test
Memory block size: 1024K
Memory transfer size: 7168M
Memory operations type: write
Memory scope type: global
Threads started!
Operations performed: 7168 ( 3649.63 ops/sec)
7168.00 MB transferred (3649.63 MB/sec)
Test execution summary:
    total time:                          1.9640s
    total number of events:              7168
    total time taken by event execution: 1.9552
    per-request statistics:
         min:                                  0.27ms
         avg:                                  0.27ms
         max:                                  0.47ms
         approx.  95 percentile:               0.29ms
Threads fairness:
    events (avg/stddev):           7168.0000/0.00
    execution time (avg/stddev):   1.9552/0.00

Interpreting the results

Well, this is nice and all, but what does it mean? Well, first of all it tells me that the CPU doesn't disappoint. It's a little slower than the quad core 2.53 Ghz processor I've tested locally, which on average does 2.85ms per request, but the 4 ECUs are built up out of 2 "virtual cores", so I suspected performance to be somewhere near a dual core processor. I just don't have any available.

I was actually amazed by the I/O results. I've calculated 1400 IOPS and a throughput of 22.5 Mb/s! Compare that to my 7200RPM SATA disk, which struggles to get 200 IOPS and a throughput of 3.2 Mb/s.

The memory let me down a little. The system I've tested with has 4GB DDR3 RAM and manages to get 4700 ops/s, while the m1.large instance gets 3700 ops/s. It's still quite good though, considering it's shared memory.

Cost comparison

The m1.large instance isn't cheap. If you want an on-demand instance running for a month, it will set you back $190.- in the EU region. However, going for a heavy reserved instance might be a good choice for many and then it can go as low as $60.- per month (one time fee included). Buying a dual core server with 8 GB RAM will, at the moment, set you back around $700.-. Calculating conservatively, power will cost you about $350.- per year (at 15 cents per KWh, which is nowhere near consumer prices in the EU), meaning running this server for 3 years will cost you $1750.-, not counting maintenance, cooling and possible breakdown. The EC2 instance will have cost you $2160.-. And if it breaks down, you can have a new one running in under 2 minutes.

Now, tell me. If someone would tell you it costs $400.- to install and maintain a physical server for 3 years, would you go for it? I would.


Databases in a multi region setup

When it comes to hosting your server cluster in multiple datacenters, one of the things that most people have difficulty with is the database setup. I'll do a more generic "do's and don'ts" article on multi-region setups later on, but this is something I've ran into myself on numerous occasions (also outside of AWS). Do mind that this will be focussed on MySQL based use cases, but I'll also tell you why you should consider NoSQL in some situations.

Before talking about complicated forms of master-master replication, I'll provide you with some alternatives that might fit your own use case.

What Amazon does

For their own e-commerce solution, Amazon has geo-specific websites. In the US, the website runs on Amazon.com, in the EU it runs on Amazon.co.uk and Amazon.de. This in part eliminates the need for having a 1-on-1 copy of their database available for every region. It does some endpoint caching using Cloudfront to speed up the websites worldwide (which is why amazon.com feels fast from the EU until you login or order something). However, if placing the order takes too long, you're shown a thank you page that will tell you to wait for confirmation by e-mail (meaning your order has been queued).

Of course, this may not be the scenario you were looking for. Most medium-sized e-commerce companies don't have a website for every region. They just want to manage their website from one central location, but they also want someone from the other end of the world to experience a fast website.

Database slave nodes across the globe

If your application is built so that database writes only occur from the backend which is managed by the company (e.g. a purely informational website with a CMS), you can set up slave nodes in multiple regions. Writes to the master database are passed to its slaves and the new data is available to the world.

Like before, this is only a scenario that works in specific use cases, but it can be useful for some. In this specific use case, you could also consider using Cloudfront for your entire website, but that may be more expensive (especially in the case of (multiple) websites running over SSL) and if your website is updated irregularly, it might require you to alter your change workflow to protect consistency.

Optimize your application for multiple regions

You can build an application that is optimized for running in multiple regions. You will have to dig a little outside the scope of AWS for this, but there are tremendous advantages to this. Take the use case of Twitter for example. A tweet can't be edited, only deleted. This means Twitter can (if necessary) queue the creation of tweets. Using AJAX, the tweet is immediatly displayed, but the database write can take just a little longer. This is useful to ease and spread the load on database servers, but because you don't have to wait for the actual transaction to happen, tweeting feels very fast for your end user (even though the process might take up to minutes on occasion).

Yeah but, you might say, Twitter must have some kind of super-database behind it that can easily replicate across the globe. You might be surprised to hear that Twitter actually uses MySQL to store tweets. Granted, it does use some very fine-grained, sustainable replication and distribution techniques (named Gizzard) and it aggregates and caches a lot of data using other techniques (like FlockDB and memcached), MySQL is still used as the main data store.

Also, take a look at the other end of the spectrum, where you have the fully distributed database model of Bitcoin transactions. There's no master database, but every wallet has a copy of the transaction database. All connected clients are notified of any transaction, of which they notify their peers. This is how the database is synchronized across the globe.

Global replication with multiple masters

However, if you must use global replication, it's not impossible, although it might take some time to architect and set up. I'd like to take Paypal as an example. Paypal uses a transaction based consistency model, meaning that everything can be rebuilt using transactions. Paypal also uses replication across four AWS regions. Let's say it stores your current credit level in a table. If two people send you money at the same time, the transactions may be processed in two different regions. In region 1 your credit is set to 30, in region 2 it's set to 20. When those two updates meet, an inconsistency is detected and the real credit can be calculated using the underlying transactions (that are always unique). This is how Paypal can use a multi-master setup using MySQL.

Master-master replication in MySQL is usually set up so that each node is a slave and a master at the same time. Paypal uses replication in a circular motion. The MySQL node in region 1 is master to the node in region 2 (its slave), which is master to the node in region 3, which is master to the node in region 4, which is master to the node in region 1. This won't sound appealing to every database administrator, because the round time for a write could be quite high. In the case of Paypel, this is fine because every transaction is unique (and, like tweets on Twitter, can't be edited), meaning an eventually-consistent model suits them. But what if you have a typical MySQL PHP application that lets MySQL handle ID generation using auto_increment? Two inserts in the same table on multiple nodes -will- cause a replication error. If the round time for write consistency is 2 seconds (which could be quite conservative), you can imagine this happening all the time on a fairly busy website.

Now, AWS multi-AZ RDS instances don't have this problem. Yes, the replication is done using a master-master setup, but the second DB instance is a hot spare, meaning writes to it will only happen if the other database is down. But a multi-region setup with four active masters will have this problem. This is where MySQLs auto_increment_increment comes in. Set this in your my.cnf on the first node:

auto_increment_increment = 4
auto_increment_offset = 1

On the second node:

auto_increment_increment = 4
auto_increment_offset = 2

Et cetera. This will ensure MySQL increases every ID by the number of nodes available (4 in this case), using a starting offset. This means that the IDs on node 1 will be 1,5,9,13,17,... and on node 2 the range will be 2,6,10,14,18,... This also ensures that when a node can't replicate for some reason, it can "catch up" later without generating additional replication errors.

However, this will not prevent other replication errors. Say that your application has to create invoices and the IDs need to increase by 1 for every invoice (tax regulations in many countries stipulate this). If you use multiple databases, you can't simply take the last invoice ID and add one, because another node might do the same before the changes on the other node have reached it. You also can't use auto_increment, because the invoice numbers won't follow up. You will have to design your application around these kinds of issues. One way to go is setup some kind of invoice queue (SQS can be used for this) that will create these invoices one at a time, making sure invoice IDs follow up. This is an asynchronous process though, meaning you can't immediatly send a response to the client (although you could use NodeJS to simulate it).

Another way to go would be to set up an ID distribution agent, to which every node can send a request for a new ID. It makes sure it distributes every ID only once and can also be setup to support multiple nodes (checking the new ID with the other nodes before giving it out). You will have to take into account that a MySQL insert can also fail on certain occasions (omission of a value that can't be NULL for instance) while the ID has already been granted. So your transaction validation has to be very thorough and you should incorporate a rollback scenario (meaning, for instance, the ID can be reallocated by the distribution agent).

If your MySQL replication setup mainly targets failover scenarios, you might not run into these problems, but it's still something to think about.


Increasing EBS performance and fault tolerance using RAID

Even though I will normally say you should consider your EC2 instances and EBS data as being disposable, this is not always possible. There are setups imaginable that simply cannot make use of S3 for their "dynamic" file storage (e.g. due to use of legacy software packages that highly depend on file system storage). In these situations, only making snapshots might not be sufficient, as the downtime might be quite high.

Increasing performance of EBS

EBS performance is often increased using RAID0, also called striping. Data is distributed over multiple volumes, increasing I/O capabilities. In fact, you can scale your RAID0 setup to up to 16 drives on Windows or even more on Linux. Many AWS users are employing this technique and are reporting it to be quite performant.

What should worry you if the first part of this post applies to you, is that if one EBS drive somehow fails, your entire RAID0 volume will fail, effectively corrupting all data on it. If this doesn't worry you (it might not, many setups on AWS aren't filesystem-dependent), you're now free to go. The rest of this post doesn't apply to you. However, I know there are people out there who will be -very- worried by this.

Before I go on, I'd like to note that Adrian Cockcroft mentions they only use 1TB EBS volumes to reduce (or maybe even eliminate) multi-tenancy, which will generate more consistent I/O results.

Increasing fault tolerance of EBS volumes

Amazon states that EBS volumes are 99,5-99,9% reliable over any given year. Compared to a regular physical drive, that's an impressive number. However, it might not be enough for you. You'd probably think that RAID1 can solve that. According to Amazon, you're wrong. EBS volumes are replicated through an Availability Zone, meaning that if the physical hardware behind your EBS volume goes down, your EBS volume will persist somewhere else in the AZ. So RAID1 will not reduce the chance that you lose your data (technically, this isn't true, but let's not go into that).

However, there's something Amazon seems to overlook. An EBS volume might underperform from time to time. If you don't use RAID1, you will have to just wait it out (or build a new volume from a snapshot). If you do use RAID1, you can quickly swap the EBS volume for a brand new one and rebuild the RAID1 array. That gives you complete control!

I myself am using RAID10 to make use of the advantages of both RAID1 and RAID0. But it's something you'll have to figure out for yourself. In fact, in some cases RAID1 might outperform RAID0 (especially when looking at random reads). However, RAID1 writes are always slower than RAID0 writes.

Resilient filesystems

I will get back to this after we're done setting it up, but we're working on moving to Gluster for centralized file storage. We're currently using a robust NFS solution to mount a webroot volume to our webservers, but it's still a single point of failure. Gluster provides us with a way to set up a resilient cluster for file storage, that can scale endlessly. Our plan is to build it on top of RAID10 EBS volumes and replicate across Availability Zones.

In any case, EBS performance shouldn't be too big of an issue. Yes, the latency might not be ideal for every use case, but if that forms a real issue, you're probably better off renting a dedicated server solution anyway.

Tagged as: , , , 1 Comment

Setting up your own dynamic CDN with edge locations using Varnish and SSL

As I mentioned earlier in my post about the new SSL functionality for Amazon Cloudfront, there's a possibility to set up your own CDN with "edge" locations. I prefer calling them edgy though, because we're using Amazon regions and not the real Amazon Edge locations that are available. But it will provide us with some flexibility. You will only serve from regions you think you need (thus saving costs) and you can always add your own edge location hosted at a datacenter outside of AWS (for instance, somewhere in Amsterdam).

Please mind that I haven't built the POC environment for this yet. I am fairly confident that the below will work, but please comment if you don't agree.

Basically, what we want is to send visitors to the content location nearest to them. On these locations, we will cache static content and try to cache dynamic content as much as possible, while being able to serve content through SSL. Take a look at this sketch for a visual image of what we'll try to do:



For the DNS, we will of course use Amazons Route53. Not only does Route53 serve clients from the nearest possible location (read: location with lowest latency), it can also do health checks and route to the endpoint with the lowest possible latency. Read more about latency based routing in the AWS docs. Set it up to include your edge locations and monitor the health of these locations.

The Edge locations

This is where it gets interesting. There are a few possible ways to go, you can setup a simple Apache/nginx server to host your content, but you will have to worry about keeping copies of your content on every server. It's possible, but it might not be as easy to use. Besides, it will not provide a way to easily serve dynamic content.

I've chosen a Varnish based caching solution for this specific use case, because it's very flexible and provides a lot of tweaking options. Besides, Varnish will perform well on a relatively "light" server. Varnish will not be able to handle SSL termination though, so we will use nginx as a proxy to offload SSL. You can read how to Offload SSL using nginx in their wiki.

Setting up your specific Varnish environment is outside the scope of this article, because there are too many use cases to put into one single article. I will provide a few things to consider though.

Let nginx only handle SSL traffix

Varnish is perfectly able to handle unencrypted traffic. So nginx should only listen to port 443. Varnish can listen to port 80.

Use auto-scaling

For some, this is a no-brainer. I think it's best practise to always use auto-scaling. You will probably not want to scale out your edge location, but you do want to automatically terminate unhealthy EC2 instances and fire up a new one. Something to consider here is that you will have to work around the fact that normally, you will not be able to keep the same IP address for your replacement instance. A possible workaround is using ELB, but you will need an ELB for every region you're using and that will cost you more than the instance itself. There's a possibility to "detach" an Elastic IP on termination and attach it again in the launch sequence of your new EC2 instance, but I don't have a ready-to-go script for that solution (yet).

Consider whether dynamic content can be cached

If you have a session-based website with lots of changing data, it might not pay off to try and cache the dynamic data. If so, use your CDN purely for the static content on another domain. The CDN step will add latency on cache-misses, so if the miss rate is very high, you might be better of querying your server directly for dynamic content. If, for instance, you use content.example.com as your CDN URL and point Varnish to www.example.com as your origin, you can set your application to use content.example.com as domain for all static file references (images, javascripts, stylesheets) and www.example.com for all other URLs.

Distributing the SSL certificate

Your servers can run solely on ephemeral storage, thanks to the auto-scaling setup. However, one thing that needs to be consistently spread across your endpoints is the SSL certificate itself. I suggest using S3 for this. Put your instances in a security group that is allowed to read from the bucket where you store your certificates and have them pull the necessary files from S3 on launch. This can also be done for the nginx and Varnish config files if you like.

The origin

The origin can be anything you like, it doesn't even have to be an AWS-hosted setup. But it can also be an S3 bucket or simply your website on a loadbalanced EC2 setup. If, for instance, your origin serves a very heavy PHP-website using Apache (like a Magento webshop), you will reduce the load on Apache tremendously by not having to serve all those small static files, but only do the heavy lifting. I've seen examples of heavy loadbalanced setups that could be reduced to half their size by simply using a CDN.


Serving dynamic content using Cloudfront

As I mentioned earlier, it's possible to serve dynamic content using Cloudfront. Which is wonderful, because this means Cloudfront emerged from being a "simple" CDN to being an actual caching solution for your entire website. There are a few things to keep in mind though.

Misses and hits

Cloudfront is a caching mechanism. In fact, I wouldn't be surprised if it's based on something like the proven Varnish. So it works with misses and hits. If it doesn't find the content the visitor is looking for, it will count as a miss and put in a request to the origin server. After this, it will cache the missed fragment. If your miss rate is very high, your website will in fact be slower for most of your visitors. If you can't properly cache your website, you're probably better off not using Cloudfront. However, a hit will be very fast. Some websites can produce a hit rate of over 99%, which means they serve almost every visitor from an edge location, while the origin can remain at rest.

Cookies and sessions

If your website serves content that is visitor-specific (like shopping carts or an account page), you will have to specify the cookies that are used to identify the session. If you don't, the cart of the first user that visited a page will be cached and displayed to every other user. If Cloudfront knows about these session cookies, it will be able to store a version of the page for each individual visitor. If you display the shopping cart on every page though, this might add overhead you'll want to avoid. If that's the case, loading the cart in a separate request using AJAX might be a better way to go, so that the majority of the page can be cached once for all users while retaining the websites dynamic nature.

Page expiry

You can handle expiration of pages entirely within the origin of your website. By default, Cloudfront will assume your objects or pages expire after 24 hours and will check for updates once a day. If your website is pretty much static, this could be fine for you. Every page will have one "slow hit" per day and that's it. However, many websites will require a much lower setting because the underlying data changes. You can set the max-age on your cache control header value on every page and Cloudfront will respect that. So if you have a very busy blog with commenting functionality, you could set the max-age to 600 seconds (10 minutes) on your homepage and to 10 seconds on your post pages for instance. If every post is visited every second, that will reduce load time for 90% of your visitors. But in this case, you could also consider loading comments through AJAX, reducing the expiration time needed (in fact, the expiration time could be very high for posts).


If you have an e-commerce website, you probably handle (parts of) user requests through SSL. Custom SSL domains with Cloudfront come at a price though, so you will want to think this through. An option might be to use an URL like secure.example.com for your encrypted pages and send those requests directly to your origin, while serving "unsecure" pages through Cloudfront.


Amazon AWS Cloudfront now supports custom SSL domains, will we use it?

As you may know, Amazon Cloudfront is a great service that provides you with the possibility to serve both static and dynamic bits of content from an edge location near the end user. I'll do an article on how to optimize Cloudfront for dynamic content later, but I'd like to talk about a new feature that Amazon presented a month ago.

If you're using Cloudfront as a traditional CDN, you'll probably have a CNAME configured at content.yourdomain.com or static.yourdomain.com or similar, pointing to Cloudfront. For websites running on HTTP, that's perfectly fine. However, if you're using HTTPS, up until a month ago, this would have not been possible. Amazon didn't provide customers with a possibility to upload an SSL certificate for their CDN domain.

However, that has changed. As of mid-June 2013, Amazon supports what they call "Custom SSL certificates", basically enabling you to upload your own SSL certificate that will be distributed across all edge locations.

There is a downside though, which is the cost of this feature. It amounts to a whopping $600.- per certificate per month (pro-rated by the hour, of course). For us, this would mean a 40% increase in cost for our entire AWS infrastructure, which is why we opted not to implement it. We're continuing use of our nginx-based EC2 server as our CDN. We'd love to serve our static content from edge locations, but not at a 40% cost increase.

If you don't mind using a .cloudfront.net subdomain for your static content, you can of course use Amazons wildcard SSL certificate at a slightly higher rate per 10.000 requests. For many companies, this will do fine.

Update: Amazon has updated its announcement to explain the high cost of this feature. They state the following:

Some of you have expressed surprise at the price tag for the use of SSL certificates with CloudFront. With this custom SSL certificate feature, each certificate requires one or more dedicated IP addresses at each of our 40 CloudFront locations. This lets us give customers great performance (using all of our edge locations), great security (using a dedicated cert that isn’t shared with anyone else) and great availability (no limitations as to which clients or browsers are supported). As with any other CloudFront feature, there are no up-front fees or professional services needed for setup. Plus, we aren’t charging anything extra for dynamic content, which makes it a great choice for customers who want to deliver an entire website (both static and dynamic content) in a secure manner.

The thing is, for $600.- per month, I could rent more than 40 on-demand micro instances, each with its own elastic (dedicated) IP. If you'd spread 8 heavy-reserved small instances over all major regions, you'd be able to use Route53's latency-based routing and it would probably cost you less than $150.- per month (traffic not included). Latency might not be as low as with CloudFront, but I think it's definitely something I'd consider if a client wants to lower its global latency.

I'll do a post about this as well in the near future.

Tagged as: , , No Comments