Archive for February, 2008

RightAws Ruby gem now supports Amazon SDB

Today we released version 1.5.0 of our RightAws gem at http://rubyforge.org/projects/rightaws/ !
This release not only provides an interface to EC2, S3, and SQS but also to SDB! The SDB interface has all the nice features of the other interfaces, such as persistent HTTP connections and error retries. Please give it a spin and let us know if you find a problem!This release also has full support for European (EU) buckets in Amazon S3 and it works properly with objects larger than 2GB (it supports the Expect: 100-continue stuff). Note that there is a slight incompatibility with attachment_fu: if you’re using attachment_fu and RightAws you need to load the latter last, otherwise you’ll get an error message and breakage.

Comments (2)

What is cloud computing?

There are many definitions for cloud computing around, and depending on the time of day I’d probably recite a different one. But what’s really defining to me is the availability of quasi endless computers (instances in EC2 terminology). Pricing model, pay as you go, someone else deals with muck: all great stuff. But endless computers, that changes computing. Why? Because launching a new instance becomes the one cure to many problems.

Whatever happens that is uncool, the answer becomes: launch a fresh instance! (Or two while you’re at it.) Whether it’s an instance failing, a hard drive starting to give out, or excessive memory errors. Or the load going up and about to max out the existing servers you have. Or the installation of a new version of your application and you’d rather bring it up and test it before cutting production over. Or the network not behaving cleanly because of some failures or other issues. Or your latest app upgrade causing a slow-down ’cause you overlooked a slow DB query. Etc, etc, etc. In all these cases, if you’ve prepared the ability to launch a new instance quickly and cheaply (labor-wise), you’re out ahead. That’s cloud computing!

So get comfortable and set up your tools to make the launching easy. Then everything else follows. This is of course what we’re enabling at RightScale, but even if you decide not to use RightScale it’s the way to go.

Launch!

Comments (1)

Top reasons Amazon EC2 instances disappear

(Judging by the posting gap, the author of this blog almost disappeared too! Time to lift the head from the day-to-day scramble and write the next entry!)

The fact that Amazon says up-front that computers fail seems to be the number one concern and criticism of EC2, specially from people who have not used it extensively. I don’t actually spend much time thinking about that because in our experience it’s not something to worry about. It’s essential to take into account when designing a system: whenever we set something up on a machine we immediately think “and what do we do when it fails?” That’s good thing, not a bad thing as anyone with production datacenter experience can attest.

Since it’s such a hot topic, I’ve been keeping a close eye on all the “my instance disappeared” threads on the EC2 forum, and it’s not easy to sort them out. I have no doubt that the vast majority has to do with operator error:

  • trying SSH and forgetting to open port 22 in the security group (or similarly with other ports)
  • having difficulties with the SSH keys, or forgetting to set-up a key to begin with
  • using/constructing an AMI that does not have SSH properly set-up
  • using/constructing an AMI that does not boot properly (network and/or sshd issues) and failing to look into console output
  • instance reboot failing, for example disk mounts failing due to mount point changes that were not reflected into init scripts
  • sshd killed by kernel out-of-memory reaper, failing to look into console output for diagnosis
  • … and many more

Some of these are beginners failing to read the getting started guide, some are more subtle and can happen even to veteran EC2 users. Then there are emails from Amazon saying “we have noticed that one or more of your instances are running on a host degraded due to hardware failure” and I wonder how many users don’t get these emails because their AWS account’s email address points into a bit bucket.

No doubt there are real failures as well where a host dies and takes the instances with it, or one of the disks used by an instance gives up which is the end of that instance. The question here is how frequent this is relative to the total number of instances running, and since Amazon is so secretive with their numbers it’s really difficult to make even an educated guess. I tried to go back into our year of logs to see whether I could estimate the failure rate, but I don’t have enough data to distinguish failure from shutdown, sigh.

The failures that concern me the most are actually not instance failures but network failures. Anyone having set-up a large datacenter will know that network issues are the most difficult to get under control. The damn network just keeps changing, and as soon as you try to hold still your service providers change stuff. Some of the instance disappearances are really network issues that cause an instance to be unreachable, or unreachable from certain other instances. These are hard to troubleshoot and on more than one occasion I’ve had to run tcpdump on both ends to see packets departing and never arriving. If I can get to the target instance at all to run tcpdump, that is… I hope Amazon gets a better handle onto this type of failure and provides us with better troubleshooting tools. In the meantime, it’s important to flag issues to them so they can troubleshoot and eliminate the root causes.

The really good news is that the Amazon folks are very dedicated to figuring out what’s going wrong and fixing it. So if you have an issue, be sure to do the troubleshooting you can, then set the instance aside, launch a new one to take its place, and post all the details on the forum. Shut the instance down only after the issue is looked-at. Looking back, there were two big issues causing instance termination, one was the day where some EC2 front-end explicitly terminated a bunch of instances by error. Not good, but from what we saw it wasn’t a massive failure either. They clearly have done their best to ensure this doesn’t recur. The other was an instance reboot bug which caused many instances to die in the reboot process. We learned not to reboot ailing instances but instead to relaunch and rescue any data. This issue also seems to be fixed at this point.

  • To summarize, if you can’t reach an instance, here is what you should do:try to SSH, check the security group
  • distinguish SSH timeout from key issues (timeout vs. permission denied type of errors)
  • use ping to test connectivity (enable ICMP if you have the bad habit of disabling it)
  • check the console output (use the convenient button on the RightScale dashboard), note that it can take a few minutes for stuff to appear
  • look at the RightScale monitoring to see whether the instance is still sending monitoring data
  • hop onto an instance in the same security group and try connecting from there (launch an instance if you don’t have any)
  • post details (instance ID, what you’ve tried, symptoms observed) on the forum and set instance aside

All in all, the number one lesson is “relaunch!” There are thousands of instances waiting to be utilized so use a fresh one if you see trouble with an existing one. If you master this step you can use it in so many situations: to scale up, to scale down, to handle instance failure, to handle software failure, to enable test set-ups, etc. If you use RightScale you will notice that that’s also what we focus on: making it easier to launch fresh instances.

Comments (4)