Jaap Haagmans The all-round IT guy

30Jul/131

Increasing EBS performance and fault tolerance using RAID

Even though I will normally say you should consider your EC2 instances and EBS data as being disposable, this is not always possible. There are setups imaginable that simply cannot make use of S3 for their "dynamic" file storage (e.g. due to use of legacy software packages that highly depend on file system storage). In these situations, only making snapshots might not be sufficient, as the downtime might be quite high.

Increasing performance of EBS

EBS performance is often increased using RAID0, also called striping. Data is distributed over multiple volumes, increasing I/O capabilities. In fact, you can scale your RAID0 setup to up to 16 drives on Windows or even more on Linux. Many AWS users are employing this technique and are reporting it to be quite performant.

What should worry you if the first part of this post applies to you, is that if one EBS drive somehow fails, your entire RAID0 volume will fail, effectively corrupting all data on it. If this doesn't worry you (it might not, many setups on AWS aren't filesystem-dependent), you're now free to go. The rest of this post doesn't apply to you. However, I know there are people out there who will be -very- worried by this.

Before I go on, I'd like to note that Adrian Cockcroft mentions they only use 1TB EBS volumes to reduce (or maybe even eliminate) multi-tenancy, which will generate more consistent I/O results.

Increasing fault tolerance of EBS volumes

Amazon states that EBS volumes are 99,5-99,9% reliable over any given year. Compared to a regular physical drive, that's an impressive number. However, it might not be enough for you. You'd probably think that RAID1 can solve that. According to Amazon, you're wrong. EBS volumes are replicated through an Availability Zone, meaning that if the physical hardware behind your EBS volume goes down, your EBS volume will persist somewhere else in the AZ. So RAID1 will not reduce the chance that you lose your data (technically, this isn't true, but let's not go into that).

However, there's something Amazon seems to overlook. An EBS volume might underperform from time to time. If you don't use RAID1, you will have to just wait it out (or build a new volume from a snapshot). If you do use RAID1, you can quickly swap the EBS volume for a brand new one and rebuild the RAID1 array. That gives you complete control!

I myself am using RAID10 to make use of the advantages of both RAID1 and RAID0. But it's something you'll have to figure out for yourself. In fact, in some cases RAID1 might outperform RAID0 (especially when looking at random reads). However, RAID1 writes are always slower than RAID0 writes.

Resilient filesystems

I will get back to this after we're done setting it up, but we're working on moving to Gluster for centralized file storage. We're currently using a robust NFS solution to mount a webroot volume to our webservers, but it's still a single point of failure. Gluster provides us with a way to set up a resilient cluster for file storage, that can scale endlessly. Our plan is to build it on top of RAID10 EBS volumes and replicate across Availability Zones.

In any case, EBS performance shouldn't be too big of an issue. Yes, the latency might not be ideal for every use case, but if that forms a real issue, you're probably better off renting a dedicated server solution anyway.

Tagged as: , , , 1 Comment