Setting up a fault-tolerant site using Amazon’s Availability Zones
Amazon’s Availability Zones are a fabulous new feature that allows users to assign instances to locations that are very fault-tolerant from one another yet that have very high bandwidth between each other. I wish I could have done something like that as easily when I was responsible for operations at Citrix Online and we had 5 datacenters worldwide. As I’ll explain in this post, what Amazon actually provides us is much better than just putting servers into multiple datacenters.
The most confusing thing about availability zones is the name: In the cloud, what exactly is an “availability zone”? The easiest way to think about it is that a zone equals a datacenter. If power goes out in one datacenter and the generators fail to start (naah, that never happens…) then it doesn’t affect the other datacenter. Or if there’s a fire, one datacenter may burn out or be otherwise incapacitated, but others are unaffected. In reality zones don’t necessarily correspond to datacenters. Given careful engineering, it’s possible to have multiple “rooms” in a datacenter that are highly failure isolated while technically still being part of the same datacenter (imagine football-sized fields here).
The point of availability zones is the following: if I launch a server in zone A and a second server in zone B, then the probability that both go down at the same time due to an external event is extremely small. This simple property allows us to construct highly reliable web services by placing servers into multiple zones such that the failure of one zone doesn’t disrupt the service or at the very least, allows us to rapidly reconstruct the service in the second zone.
The one caveat to consider when using multiple zones is that there is no free lunch (you knew there was a catch, didn’t you?). First of all there’s the speed of light. The zones Amazon is exposing are all on the East coast (indicated by the names, such as “us-east-1a”. I don’t have inside information about the location of their facilities, but I imagine some may be in New York and others may be in Virginia, so the distance between zones may be considerable, thus translating into some network latency. And even if the actual facilities used by EC2 today are not that far apart, they may be someday in the future.
The second “gotcha” is that bandwidth across zone boundaries is not free. Amazon is charging $0.01/GB for what they call “regional” traffic. This is less than 1/10th the cost of Internet traffic, which seems perfectly reasonable to me. In the days where I was managing multiple datacenters the cost of traffic between them was essentially the same as the cost of random Internet traffic. Actually, I take that back, it cost twice as much: once to exit one datacenter and once to enter the other. (Granted, at high volume one can do interesting things to save some money, but it doesn’t become free by a long shot.)
An example
Enough talk, let’s show a diagram of how a simple redundant web site looks like with Availability Zones and Elastic IPs. At the core we’ll have two web servers (e.g. with Apache and PHP) running the web application and accessing the master database. All this occurs in one zone. We’ll allocate two Elastic IP addresses that we assign to the two web servers and then we create a round-robin DNS entry for our web site that maps the domain name to the two IP addresses (this is commonly called “round-robin DNS”).

In order to ensure the survival of the data in the case of a massive failure, we start a slave database in a second availability zone and replicate the data in real-time. This is how we’ve set-up all our customers to date, except that up until now we haven’t been able to specify the placement of the slave with respect to the master. In the RightScale Dashboard the zone of each server is shown and at server launch time one can select the desired zone.
Now suppose the zone with the web servers and database fails due to a fire! After receiving an alert, we first promote the slave in the second zone to master using the RightScale Manager for MySQL automation. We then launch fresh web/app servers in the same zone as the slave database. Once the promotion completes and the two new servers are up, it is a simple matter of reassigning the Elastic IPs to the two new servers to redirect all the users to the new servers and we’re up and running again.

The next step is to recreate the redundancy and for this the third availability zone that each account has access to comes into play. We start a fresh database slave in the third zone again using the automation in the Manager for MySQL. Once that comes up and starts replicating we are back to having a redundant setup!

If you have never tried to set something like this up yourself starting from renting colo space, purchasing bandwidth to buying and installing servers, you really can’t appreciate the amount of capital expense, time, headache, and ongoing expense saved by EC2’s features! And best of all, using RightScale it’s just a couple of clicks away :-).
Beyond the simple redundant setup
As an astute reader you probably noticed that the site described above would go down if there was a failure in the primary zone, which would require a manual restarting of new servers in order to bring it back up. Some of this can be easily remedied by placing one or multiple web servers into the secondary zone and having them talk to the master DB across the zone boundary. The performance of these servers may be slightly lower due to the inter-zone latency and there is some cost to the database access traffic. It’s somewhat application-dependent how these play out.
A more sophisticated setup uses load balancers to reduce the impact of the cross-site traffic. The idea is to place one load balancer instance in each zone and route the requests primarily to a set of redundant web/app servers in the primary zone, as shown in the figure below. A third app server can be running in the secondary zone and perhaps get a trickle of traffic from the load balancers just to keep it “warm.” Keeping it warm makes it easy to monitor and ensure that it’s operating properly.

The good thing about this setup is that the traffic shipped across the zone boundary is exactly the same as comes into the second load balancer. This means that for half the total Internet traffic there is a $0.01/GB surcharge, which results in less than 5% extra cost overall. (This is not counting the DB replication traffic.) Also, the extra latency from one zone to the other is negligible when compared to the already incurred Internet latency.
In the case of a primary zone failure, browsers will fail over to the load balancer in the remaining zone (this is a feature built into web browsers related to the round-robin DNS set-up). The load balancer will direct all traffic to the third web/app server. At that point the secondary database needs to be promoted to master and the third app server repointed to that database and everything will be back up and running. With automation the DB promotion could be done automatically, but it’s better to be conservative: a promotion due to a false alert could cause a lot of harm.
This second set-up is a bit more complicated than the previous one, but it requires less machinery and no server launches in the case of a failure. It also requires one extra machine if one assumes that each load balancer can run on the same instance as a web/app server (typically not a problem). Many more variants on this basic setup are clearly possible and should be considered on a case-by-case basis.
Wow, it’s mind-boggling how much power Amazon is giving us in designing sophisticated distributed redundant Internet services! In combination, the availability zones, the elastic IPs and the overall programmatic control over all the resources make the cloud a superior environment for deploying sophisticated Internet services. At RightScale we’re extremely excited and are hard at work to incorporate the new features into our standard deployment templates such that all our customers can easily take advantage of the new features in their deployments. We’re also automating a number of the failure scenarios so that you don’t need to have an alert wake you up if there a fire at Amazon in the middle of the night!
March 27, 2008 at 6:37 am
Amazon improves EC2 (by embracing failure)
Amazon just announced two big improvements to EC2: Multiple LocationsAmazon EC2 now provides the ability to place instances in multiple locations. Amazon EC2 locations are composed of regions and Availability Zones. Regions are geographically dispersed…
March 27, 2008 at 8:19 am
[...] The guys at RightScale have described Setting up a fault-tolerant site using Amazon’s Availability Zones. [...]
March 27, 2008 at 8:20 am
Using mysql-proxy http://jan.kneschke.de/projects/mysql/mysql-proxy (in conjuction with rw-splitting) and a configuration like this one http://pics.livejournal.com/capttofu/pic/00003988, you could just be notified when the web node and/or the db node go down. No manual promotion from slave to master would be necessary. Just rebuil the missing node in the third data center and wait for the next failure.
March 27, 2008 at 8:22 am
Using mysql-proxy http://jan.kneschke.de/projects/mysql/mysql-proxy (in conjuction with rw-splitting) and a configuration like this one http://capttofu.livejournal.com/1752.html, you could just be notified when the web node and/or the db node go down. No manual promotion from slave to master would be necessary. Just rebuil the missing node in the third data center and wait for the next failure.
March 27, 2008 at 8:51 am
[...] Radar Rightscale on Fault Tolerance and Elastic [...]
March 27, 2008 at 10:12 am
[...] Here are more details on how it works. [...]
March 27, 2008 at 10:27 am
Frederic: thanks for pointing out the mysql proxy. To be honest, I haven’t really dug into all its possibilities. Our goal with the replication is to use a simple, tried and tested set-up. The replication is really solid with known gotchas and that’s always valuable. The main difficulty with the failover isn’t actually promoting the slave to master and repointing the clients. We have all that automated and it’s literally one click in our interface, the issue is deciding when to fail over. And I don’t think the proxy makes that easier. If the master stands up and says “I’m dead” then it’s easy, but often there are partial failures or lock-ups and having some automatic decision cause a failover can cause more havoc than good. But in any case I need to read-up on the proxy stuff: thanks for the reminder!
March 27, 2008 at 10:45 am
[...] has also posted some tutorials on how to use the new features, including how to set up a fault-tolerant site. Written by: Mike | March 27, 2008 | Filed Under Amazon [...]
March 27, 2008 at 3:39 pm
Great Article! Thank you for laying it all out so clearly. You got a nice plug for these blog entries on the AWS CTO’s blog too.
Kent
March 27, 2008 at 4:01 pm
[...] Setting up a fault-tolerant site using Amazon’s Availability Zones Amazon’s Availability Zones are a fabulous new feature that allows users to assign instances to locations that […] [...]
March 27, 2008 at 9:19 pm
Slight correction: The $0.01/GB charge seems to be for bandwidth crossing /region/ boundraries, not /zone/ boundraries. And currently everything’s in the us-east region, so you won’t actually hit the $0.01/GB charge.
March 28, 2008 at 9:06 am
Kent: thanks for the nice words! I hadn’t noticed Werner’s blog entry, thanks for pointing it out.
bd_: Ahhhh, very good point, I had completely misread that. I better update the blog post, although I think I’m still confused about the pricing. Sounds like a good AWS forum topic.
March 28, 2008 at 9:15 am
bd_: I just looked at the EC2 pricing at http://aws.amazon.com/ec and I can’t follow your argument. It says “Regional Data Transfer — $0.01 per GB in/out - all data transferred between instances in different Availability Zones in the same region.”
March 30, 2008 at 12:28 pm
[...] in Ruby on Rails) to Amazons infastructure has a good blog post on the new changes here: “Setting up a fault tolerant site using amazons availability zones“. These icons link to social bookmarking sites where readers can share and discover new web [...]
March 31, 2008 at 10:10 am
Thank you for this blog posting. Very informative.
April 4, 2008 at 6:30 am
[...] Setting up a fault-tolerant site using Amazon’s Availability Zones « RightScale Blog (tags: ec2 amazon scalability cloud_computing) [...]
April 15, 2008 at 8:30 pm
[...] Setting up a fault-tolerant site using Amazon’s Availability Zones « RightScale Blog (tags: ec2 amazon aws scaling infrastructure architecture distributed loadbalancing networking scalability) [...]