Jaap Haagmans The all-round IT guy


Databases in a multi region setup

When it comes to hosting your server cluster in multiple datacenters, one of the things that most people have difficulty with is the database setup. I'll do a more generic "do's and don'ts" article on multi-region setups later on, but this is something I've ran into myself on numerous occasions (also outside of AWS). Do mind that this will be focussed on MySQL based use cases, but I'll also tell you why you should consider NoSQL in some situations.

Before talking about complicated forms of master-master replication, I'll provide you with some alternatives that might fit your own use case.

What Amazon does

For their own e-commerce solution, Amazon has geo-specific websites. In the US, the website runs on Amazon.com, in the EU it runs on Amazon.co.uk and Amazon.de. This in part eliminates the need for having a 1-on-1 copy of their database available for every region. It does some endpoint caching using Cloudfront to speed up the websites worldwide (which is why amazon.com feels fast from the EU until you login or order something). However, if placing the order takes too long, you're shown a thank you page that will tell you to wait for confirmation by e-mail (meaning your order has been queued).

Of course, this may not be the scenario you were looking for. Most medium-sized e-commerce companies don't have a website for every region. They just want to manage their website from one central location, but they also want someone from the other end of the world to experience a fast website.

Database slave nodes across the globe

If your application is built so that database writes only occur from the backend which is managed by the company (e.g. a purely informational website with a CMS), you can set up slave nodes in multiple regions. Writes to the master database are passed to its slaves and the new data is available to the world.

Like before, this is only a scenario that works in specific use cases, but it can be useful for some. In this specific use case, you could also consider using Cloudfront for your entire website, but that may be more expensive (especially in the case of (multiple) websites running over SSL) and if your website is updated irregularly, it might require you to alter your change workflow to protect consistency.

Optimize your application for multiple regions

You can build an application that is optimized for running in multiple regions. You will have to dig a little outside the scope of AWS for this, but there are tremendous advantages to this. Take the use case of Twitter for example. A tweet can't be edited, only deleted. This means Twitter can (if necessary) queue the creation of tweets. Using AJAX, the tweet is immediatly displayed, but the database write can take just a little longer. This is useful to ease and spread the load on database servers, but because you don't have to wait for the actual transaction to happen, tweeting feels very fast for your end user (even though the process might take up to minutes on occasion).

Yeah but, you might say, Twitter must have some kind of super-database behind it that can easily replicate across the globe. You might be surprised to hear that Twitter actually uses MySQL to store tweets. Granted, it does use some very fine-grained, sustainable replication and distribution techniques (named Gizzard) and it aggregates and caches a lot of data using other techniques (like FlockDB and memcached), MySQL is still used as the main data store.

Also, take a look at the other end of the spectrum, where you have the fully distributed database model of Bitcoin transactions. There's no master database, but every wallet has a copy of the transaction database. All connected clients are notified of any transaction, of which they notify their peers. This is how the database is synchronized across the globe.

Global replication with multiple masters

However, if you must use global replication, it's not impossible, although it might take some time to architect and set up. I'd like to take Paypal as an example. Paypal uses a transaction based consistency model, meaning that everything can be rebuilt using transactions. Paypal also uses replication across four AWS regions. Let's say it stores your current credit level in a table. If two people send you money at the same time, the transactions may be processed in two different regions. In region 1 your credit is set to 30, in region 2 it's set to 20. When those two updates meet, an inconsistency is detected and the real credit can be calculated using the underlying transactions (that are always unique). This is how Paypal can use a multi-master setup using MySQL.

Master-master replication in MySQL is usually set up so that each node is a slave and a master at the same time. Paypal uses replication in a circular motion. The MySQL node in region 1 is master to the node in region 2 (its slave), which is master to the node in region 3, which is master to the node in region 4, which is master to the node in region 1. This won't sound appealing to every database administrator, because the round time for a write could be quite high. In the case of Paypel, this is fine because every transaction is unique (and, like tweets on Twitter, can't be edited), meaning an eventually-consistent model suits them. But what if you have a typical MySQL PHP application that lets MySQL handle ID generation using auto_increment? Two inserts in the same table on multiple nodes -will- cause a replication error. If the round time for write consistency is 2 seconds (which could be quite conservative), you can imagine this happening all the time on a fairly busy website.

Now, AWS multi-AZ RDS instances don't have this problem. Yes, the replication is done using a master-master setup, but the second DB instance is a hot spare, meaning writes to it will only happen if the other database is down. But a multi-region setup with four active masters will have this problem. This is where MySQLs auto_increment_increment comes in. Set this in your my.cnf on the first node:

auto_increment_increment = 4
auto_increment_offset = 1

On the second node:

auto_increment_increment = 4
auto_increment_offset = 2

Et cetera. This will ensure MySQL increases every ID by the number of nodes available (4 in this case), using a starting offset. This means that the IDs on node 1 will be 1,5,9,13,17,... and on node 2 the range will be 2,6,10,14,18,... This also ensures that when a node can't replicate for some reason, it can "catch up" later without generating additional replication errors.

However, this will not prevent other replication errors. Say that your application has to create invoices and the IDs need to increase by 1 for every invoice (tax regulations in many countries stipulate this). If you use multiple databases, you can't simply take the last invoice ID and add one, because another node might do the same before the changes on the other node have reached it. You also can't use auto_increment, because the invoice numbers won't follow up. You will have to design your application around these kinds of issues. One way to go is setup some kind of invoice queue (SQS can be used for this) that will create these invoices one at a time, making sure invoice IDs follow up. This is an asynchronous process though, meaning you can't immediatly send a response to the client (although you could use NodeJS to simulate it).

Another way to go would be to set up an ID distribution agent, to which every node can send a request for a new ID. It makes sure it distributes every ID only once and can also be setup to support multiple nodes (checking the new ID with the other nodes before giving it out). You will have to take into account that a MySQL insert can also fail on certain occasions (omission of a value that can't be NULL for instance) while the ID has already been granted. So your transaction validation has to be very thorough and you should incorporate a rollback scenario (meaning, for instance, the ID can be reallocated by the distribution agent).

If your MySQL replication setup mainly targets failover scenarios, you might not run into these problems, but it's still something to think about.