Archive for March, 2008

RightScale supports the new Amazon EC2 Elastic IP addresses and availability zones

The cloud is accelerating past terrestrial hosting!

Today, Amazon unveiled some major upgrades to their service: Elastic IPs, Availability Zones, and Selectable Kernels. The first two are particularly important because they not only eliminate one of the remaining deficiencies of the service, but now provide better functionality than what is currently available in common hosting services. We’ve been busy at RightScale keeping pace and we’re happy to announce that RightScale supports the new features concurrently with Amazon’s launch. Not only that, but we’ve added the new features to the free developer edition as well, which means that RightScale is hands-down the easiest way to experiment with the new Amazon EC2 features. Here’s a quick review of what’s available now through RightScale:

Create “Static IPs” on EC2 with Amazon’s Elastic IPs
The Elastic IPs provide persistent IP addresses that can be assigned to instances. Many people have been asking us for “static IPs” — well this is it. Through the RightScale interface you can allocate a static IP and give it a nickname, say “web server 1.” When you launch your web server instance, you can then associate the IP with the instance once you have everything configured and running, and shortly, packets sent to that IP address get routed to the new instance.

If you then launch a fresh instance for an upgrade of your web site, you can bring the new instance up separately from your production site, and set up the new version of your site and test it thoroughly. When you’re happy with everything, you can reassign the IP address from the old instance to the new one, and within seconds your users will be accessing the new version of the site. (For a more in-depth description, see the earlier blog post.)

Should there be a problem, you can always reassign the IP back to the old instance while you fix the problem. This, by the way, is the power of cloud computing: you don’t upgrade your server in place, you grab a new one, and you leave the old one running until you’re sure the new one is stable and ready for production.

The elastic IPs also provide a solution for numerous other situations where a fixed IP address is required. One example is when interfacing with 3rd party services such as data feeds where it is impractical to change the destination IP address of the feed whenever there is an instance change. Now using an elastic IP the external feed can be configured once and for all. Another example is SMTP (email) traffic which pretty much requires a static IP address. For inbound SMTP traffic the static IP address ensures that mailers around the world can safely deliver the traffic across instance changes. For outbound SMTP traffic the static IP ensures that spam filters don’t discard the email because the IP is listed as dynamic or because of a prior user that sent out spam.

Better fault tolerance with availability zones
The availability zones are a terrific new feature that gives you control over the fault tolerance of your server deployment. Each availability zone is a data center or a portion of a data center engineered such that the probability that more than one zone fails at a time is extremely small. Basically, if you have two servers in two different zones then they will not go down at the same time, barring a major catastrophe of regional extent. So power going out and generators failing, or a data center fire, or border router screw-up will not affect multiple Availability Zones.

Users can now use the Availability Zones to engineer their deployments for an amazing degree of reliability. The most typical usage we foresee is to place the master database with all the app servers into one zone, and the slave (replica) database in a different zone. Should the primary zone fail, it becomes quite straightforward to promote the slave to master and relaunch the app servers in the same zone. (For a more in-depth example, see the earlier blog post.) Of course we want to automate that in RightScale so our users don’t have to worry about promoting these databases manually!

What’s really exciting is that the combination of Elastic IPs and Availability Zones bring cloud computing to a different level. In the above example, when the app servers get relaunched in a new zone, EC2 allows the elastic IPs that were associated with the app servers to be reassigned from the old servers in the failed zone to the new ones. So now traffic doesn’t just get routed to new instances, it actually gets routed to a different datacenter. From the outside this may seem straightforward, but in reality the degree of engineering that is necessary to support this type of technical feature is quite staggering. Even if you had servers in multiple co-location facilities it is not easy to make the colos independent of one another. If they are close-by physically they may well share regional internet routes. Or worse, your own basic setup may introduce dependencies, for example at the DNS level, where one colo going down has impact on DNS used by the other colo. Being able to then move IP addresses from one colo to another one without residual dependencies requires very sophisticated network engineering. With Amazon and RightScale, it’s now just a drop-down menu away!

Support for multiple Linux kernels
The selectable kernels feature introduced by Amazon allows users to choose from more than the single Linux kernel that has been available thus far. It’s one more step towards making EC2 a more flexible platform and keeping up with the Linux evolution.

Comments (15)

Setting up a fault-tolerant site using Amazon’s Availability Zones

Amazon’s Availability Zones are a fabulous new feature that allows users to assign instances to locations that are very fault-tolerant from one another yet that have very high bandwidth between each other. I wish I could have done something like that as easily when I was responsible for operations at Citrix Online and we had 5 datacenters worldwide. As I’ll explain in this post, what Amazon actually provides us is much better than just putting servers into multiple datacenters.

The most confusing thing about availability zones is the name: In the cloud, what exactly is an “availability zone”? The easiest way to think about it is that a zone equals a datacenter. If power goes out in one datacenter and the generators fail to start (naah, that never happens…) then it doesn’t affect the other datacenter. Or if there’s a fire, one datacenter may burn out or be otherwise incapacitated, but others are unaffected. In reality zones don’t necessarily correspond to datacenters. Given careful engineering, it’s possible to have multiple “rooms” in a datacenter that are highly failure isolated while technically still being part of the same datacenter (imagine football-sized fields here).

The point of availability zones is the following: if I launch a server in zone A and a second server in zone B, then the probability that both go down at the same time due to an external event is extremely small. This simple property allows us to construct highly reliable web services by placing servers into multiple zones such that the failure of one zone doesn’t disrupt the service or at the very least, allows us to rapidly reconstruct the service in the second zone.

The one caveat to consider when using multiple zones is that there is no free lunch (you knew there was a catch, didn’t you?). First of all there’s the speed of light. The zones Amazon is exposing are all on the East coast (indicated by the names, such as “us-east-1a”. I don’t have inside information about the location of their facilities, but I imagine some may be in New York and others may be in Virginia, so the distance between zones may be considerable, thus translating into some network latency. And even if the actual facilities used by EC2 today are not that far apart, they may be someday in the future.

The second “gotcha” is that bandwidth across zone boundaries is not free.  Amazon is charging $0.01/GB for what they call “regional” traffic. This is less than 1/10th the cost of Internet traffic, which seems perfectly reasonable to me. In the days where I was managing multiple datacenters the cost of traffic between them was essentially the same as the cost of random Internet traffic. Actually, I take that back, it cost twice as much: once to exit one datacenter and once to enter the other. (Granted, at high volume one can do interesting things to save some money, but it doesn’t become free by a long shot.)

An example

Enough talk, let’s show a diagram of how a simple redundant web site looks like with Availability Zones and Elastic IPs. At the core we’ll have two web servers (e.g. with Apache and PHP) running the web application and accessing the master database. All this occurs in one zone. We’ll allocate two Elastic IP addresses that we assign to the two web servers and then we create a round-robin DNS entry for our web site that maps the domain name to the two IP addresses (this is commonly called “round-robin DNS”).

Fault Tolerance with availability zones img1

In order to ensure the survival of the data in the case of a massive failure, we start a slave database in a second availability zone and replicate the data in real-time. This is how we’ve set-up all our customers to date, except that up until now we haven’t been able to specify the placement of the slave with respect to the master. In the RightScale Dashboard the zone of each server is shown and at server launch time one can select the desired zone.

Now suppose the zone with the web servers and database fails due to a fire! After receiving an alert, we first promote the slave in the second zone to master using the RightScale Manager for MySQL automation. We then launch fresh web/app servers in the same zone as the slave database. Once the promotion completes and the two new servers are up, it is a simple matter of reassigning the Elastic IPs to the two new servers to redirect all the users to the new servers and we’re up and running again.

Fault Tolerance with availability zones img2

The next step is to recreate the redundancy and for this the third availability zone that each account has access to comes into play. We start a fresh database slave in the third zone again using the automation in the Manager for MySQL. Once that comes up and starts replicating we are back to having a redundant setup!

Fault Tolerance with availability zones img4

If you have never tried to set something like this up yourself starting from renting colo space, purchasing bandwidth to buying and installing servers, you really can’t appreciate the amount of capital expense, time, headache, and ongoing expense saved by EC2’s features! And best of all, using RightScale it’s just a couple of clicks away :-) .

Beyond the simple redundant setup

As an astute reader you probably noticed that the site described above would go down if there was a failure in the primary zone, which would require a manual restarting of new servers in order to bring it back up. Some of this can be easily remedied by placing one or multiple web servers into the secondary zone and having them talk to the master DB across the zone boundary. The performance of these servers may be slightly lower due to the inter-zone latency and there is some cost to the database access traffic. It’s somewhat application-dependent how these play out.

A more sophisticated setup uses load balancers to reduce the impact of the cross-site traffic. The idea is to place one load balancer instance in each zone and route the requests primarily to a set of redundant web/app servers in the primary zone, as shown in the figure below. A third app server can be running in the secondary zone and perhaps get a trickle of traffic from the load balancers just to keep it “warm.” Keeping it warm makes it easy to monitor and ensure that it’s operating properly.

Fault Tolerance with availability zones img3

The good thing about this setup is that the traffic shipped across the zone boundary is exactly the same as comes into the second load balancer. This means that for half the total Internet traffic there is a $0.01/GB surcharge, which results in less than 5% extra cost overall. (This is not counting the DB replication traffic.) Also, the extra latency from one zone to the other is negligible when compared to the already incurred Internet latency.

In the case of a primary zone failure, browsers will fail over to the load balancer in the remaining zone (this is a feature built into web browsers related to the round-robin DNS set-up). The load balancer will direct all traffic to the third web/app server. At that point the secondary database needs to be promoted to master and the third app server repointed to that database and everything will be back up and running. With automation the DB promotion could be done automatically, but it’s better to be conservative: a promotion due to a false alert could cause a lot of harm.

This second set-up is a bit more complicated than the previous one, but it requires less machinery and no server launches in the case of a failure. It also requires one extra machine if one assumes that each load balancer can run on the same instance as a web/app server (typically not a problem). Many more variants on this basic setup are clearly possible and should be considered on a case-by-case basis.

Wow, it’s mind-boggling how much power Amazon is giving us in designing sophisticated distributed redundant Internet services! In combination, the availability zones, the elastic IPs and the overall programmatic control over all the resources make the cloud a superior environment for deploying sophisticated Internet services. At RightScale we’re extremely excited and are hard at work to incorporate the new features into our standard deployment templates such that all our customers can easily take advantage of the new features in their deployments. We’re also automating a number of the failure scenarios so that you don’t need to have an alert wake you up if there a fire at Amazon in the middle of the night!

Comments (24)

DNS, Elastic IPs (EIP) and how things fit together when upgrading a server

Amazon’s new Elastic IP (EIP) addresses allow users to allocate an IP address and assign it to an instance of their choice. What’s really cool is that each IP address can be reassigned to a different instance when needed. For example, if the first one failed or if a new one is supposed to take its place.

Before going into an example, let’s review how the Elastic IPs work:

  • You can allocate up to 5 Elastic IP addresses per account (default).
  • Each EIP can be assigned to one instance, in which case it replaces the normal dynamic IP address. Remember, by default, each instance starts with a dynamic IP address.
  • Each instance can have only a single external IP address. It starts out with the default dynamic IP address which can be swapped out for an EIP at any time. If the EIP is deassigned (or assigned to a different instance) then a fresh dynamic IP is allocated for the instance. The limitation of designating a single IP at a time is due to the way NAT (Network Address Translation) works. Remember that each instance has an internal IP address and an external (public) one, which is translated to the internal one. If two external IPs were translated to the same internal IP then inbound packets would arrive fine, but sorting out outgoing packets (i.e. determining which external IP address to assign to outgoing packets) would be very difficult. Hence, the limitation of a single external IP address per instance at any given point in time.
  • EIPs are free while they are assigned to an instance, but they cost $0.01/hr if they are not assigned. The reason for this charge is due to the fact that the number of IP addresses worldwide is very limited. Perhaps in theory, this charge will help prevent users from hogging unused IP addresses that could be dynamically allocated to other users. Yet, in a weird way there is no additional cost to Amazon for an assigned static IP as opposed to a dynamic IP because while an EIP is assigned to an instance it actually frees-up a dynamic IP.
  • Assigning or reassigning an IP to an instance takes a couple of minutes, which is longer than I would have hoped for, but I can imagine that many network devices need to be updated in the infrastructure to make it all happen.

Let’s look at a simple example of an application server running Apache and a PHP app, talking to a back-end mysql database server and how Elastic IPs can improve the process of updating the site. First we allocate an Elastic IP. Suppose we get 172.168.5.6 assigned. Then we set up the DNS in our preferred outsourced DNS service and map our web site name to the IP address, e.g. www.rightscale.com -> 172.168.5.6. Having done that, we can launch our web server and database server. Once the web server boots and we have the web site running, we assign it the EIP and can soon thereafter point our browser to www.rightscale.com. Here’s how this looks:

Elastic IP address on EC2

Now suppose we want to update from our current production release of the web site (we called it rel2 in the diagram) to rel3. The power of the cloud is that we don’t need to touch our existing web server and risk causing damage during the upgrade process. Instead we launch a second web server (shown in the diagram below as www_rel3) and install the new release on it. We can point a different DNS entry, such as test.rightscale.com, at the default dynamic IP provided for the instance by EC2 and test the site to make sure everything works properly.

Elastic IP address with additional test server

Once we’re confident in the new test version, we simply reassign the EIP 172.168.5.6 to the www_rel3 instance and shortly thereafter all users accessing the site are now receiving data packets from the new release. Remember, as long as the www_rel2 is available, you can easily swap back and forth between www_rel2 and www_rel3 until you are completely satisfied with the new site. And when you’re ready, you can terminate the old www_rel2 instance. See diagram below.

Elastic IP address switch to new server

Amazon did a very nice job in creating something much more powerful than simply adding “static IPs” to their offering. They are giving us dynamically remappable IP addresses that fit well into the overall cloud computing paradigm that we can use to manage servers better than with traditional hosting solutions.

The RightScale dashboard supports the new Elastic IPs, so all the operations described above are easy to initiate and monitor, even when using the free editions of the RightScale Dashboard. We are now in the process of updating our server templates so our customers can take full advantage of not only the Elastic IPs but also the new Availability Zones.

Comments (13)

Amazon’s communication is improving

One of the most often heard complaints about Amazon Web Services is the lack of communication about service status and issues. The community has been pretty vocal about this and they’ve certainly heard it and are committed to improving. So it was nice to see the following response to a minor incident the other day posted on the forum:

Following up with more information about this morning’s event. At 1:31am PST, a network engineer made a change to a pair of redundant aggregation routers fronting a portion of EC2. This change caused both to no longer route traffic to a subset of EC2 instances. The change should have been non-intrusive. We are taking steps to prevent this type of failure in the future. The failure affected a portion of EC2 instance connectivity for 2 hours and 5 minutes. In response to the question about RSS feeds, we are planning to provide this functionality in the future.

Sincerely,
The Amazon Web Services Team

Thanks Kathrin for posting this. We all know that this stuff happens and we all want to use a provider who is committed to getting to root causes of incidents and eradicating them as much as possible!

Leave a Comment

RightScale has a new look!

As you can see, RightScale got a little makeover! Our old home page had been neglected for many months and we finally have a new look thanks to Dean’s hard work. We also moved the blog to wordpress, which will make reading and navigating easier. Getting technical docs online is Dean’s next task, so stay tuned for lots of goodies!

The main change you will notice, other than the new layout, is that the dashboard has moved from www.rightscale.com to my.rightscale.com. I recommend you bookmark https://my.rightscale.com/sessions/new so you have one click less to get there.

As always, we appreciate feedback and suggestions!

Comments (6)