Archive for April, 2008

RightScale featured on Mashable [podcast]

Hear our CEO, Michael Crandell in a featured interview with Mark Hopkins on his show, “Mashable Conversations.” Michael gives a great overview of cloud computing and explains the growing interest among start-ups, mid-size, and even enterprise level businesses. The interview explains why cloud computing is such a disruptive technology, as well as RightScale’s role in this rapidly evolving industry. Michael talks about some hot topics in cloud computing such as data security issues and the overall stability of web infrastructures built on the cloud. He also provides a realistic comparison between hosting websites or applications on the cloud vs. hosting them on traditional hosting infrastructures.

Listen to the podcast at http://mashable.com/2008/04/23/rightscale/.

Leave a Comment

Tracking changes to your deployment

I’m sure you’ve had this experience: your site worked great yesterday, and now it’s acting up. Something’s clearly broken. “Who changed what when !?!?” [expletive deleted] Of course many of our users have turned to us for help in those situations and we’re listening (oh, and we have shouted the above words ourselves). Yesterday we released part 1 of the answer, which is to track changes to a number of the objects that you can configure and manage via the RightScale dashboard. Here’s how this looks for a deployment where Eugene made a number of changes to quickly test the feature:

More along these lines is forthcoming, especially more rigorous version control for RightScripts and Server Templates. We’re as desperate for these features ourselves as you are. This, by the way, can be a great asset when doing software as a service, specially when one engages with the customer as much as we do: we are the most demanding and advanced users of our own system and as such we feel the pain just as intensely as many of our users. This allows us to close the loop on product priorities much more rapidly than w otherwise could. But I’m sure we’re also failing to see things that we should be adding, so if you have a suggestion, please do take a couple of minutes and send me email or post a comment here!

Leave a Comment

Animoto’s Facebook scale-up

The Animoto guys did hit the jackpot on Facebook this past week. Jeff Barr mentioned a few of the stats on his blog: Animoto ramped from 25,000 users to 250,000 users in three days, signing up 20,000 new users per hour at peak. The system they run using RightScale is quite complicated with the www.animoto.com web site, then a separate site for the facebook app run by Hungry Machines, both of these feeding into a back-end web services site which then orchestrates uploads, and, most importantly, the render farm which creates the cool videos.

The upshot is that there are a lot of moving parts! Each one of the subsystems consists of many servers and everything needs to scale-up as the load increases. What Animoto CTO Stevie Clifton did really well is to connect all the operations using queues, many of them in SQS. One queue contains work items that list photo URLs to fetch from other sites, such as Facebook, Flickr, etc., and that is processed by one array of worker instances. Another queue has the list of render jobs and each work item in there points to the set of photos sitting at the ready in S3 and at the music files also on S3. All of these queues are held in Amazon SQS and the arrays of worker instances are managed by RightScale. This allows the monitoring part of our service to detect when the queue gets too large and more instances need to be launched. What’s nice about using queues is that it decouples the various parts of the site, so if the renderers get backlogged the queue simply builds up and users have to wait a little longer for their video to be produced. Waiting is not good, but dropping requests on the floor is much worse!

Producing the videos takes 8-9 minutes on average and at peak Animoto has pumped over 450 render requests per minute into the queue. Last week we ended up with just under 3500 instances in the various Animoto deployments and tonight it was more than 4000 and it looks like it will not drop under 2000 instances through the night. Yikes! At peak RightScale was launching and configuring 40 new instances per minute pretty much sustained to handle the injection of thousands of render jobs that needed special handling. Mind-boggling stuff…

Lessons learned… First of all, when you scale 10x and then 10x again to run on thousands of servers every little problem turns into a large one. That insignificant error rate of 0.1% gets multiplied by 1000x per second and you end up with an error a second, and actually, the error rate typically increases in itself too because of the added load on the system. So suddenly it’s not something you can ignore anymore. An example for this was having exponential backoff for uploads to S3 when using curl, but forgetting that the 5th retry exceeds the S3 connection timeout. Normally, this happens only once in a blue moon, but when tens of uploader instances are banging hard on one S3 bucket the S3 error rate goes up a bit and suddenly uploads are failing left and right. Once we changed this to a constant retry timeout it all went smooth again.

Now does this mean that you should fix all the little issues before going live? Of course not: you can’t! What I’ve found to be most effective is to think about every little problem that you come across for a few minutes. Don’t just brush it aside as being insignificant. It is now, but it *will* trip you up tomorrow or the day after. So spend 5 minutes to troubleshoot and hypothesize as far as you can get. You don’t have to solve it immediately. Think up a work-around or how you would troubleshoot further, or perhaps how you’d fix it. Then move on. Come tomorrow, when and if the issue becomes big, you will have an invaluable head-start. Instead of being caught off-guard you’ll be able to immediately kick into action and solve the issue.

Another lesson learned is not to forget the manual overrides. Yup, I know, we have this super smart auto-scaling algorithm. But we also have manual overrides and when Animoto went from about 50 instances to 4000 instances we used it. We wanted to make sure the extra instances didn’t overload the database, the queue, and that everything was running smoothly (and, yes, to take a pause and fix some issues before scaling up further). Stevie and the Hungry Machines guys also had put in some overrides to queue-up automatically generated videos and let manually requested ones zip through. This was essential in keeping the active users happy when everything first exploded and the system had trouble keeping up with the load. A lot of the queued videos were processed a bit later when the load went back down. Automation is cool for the daily routine events but for something like this you want the overrides.

Animoto is a great example of leveraging the cloud for its strengths of instant availability and virtually limitless scope. Of course, most sites don’t need to launch 4000 servers in one go, but its nice to know you can if you need to. Whether the number is 4 or 40 or 4000 — getting the resources you need at the time you need them is a key benefit of “right-scaling” your deployment using the cloud. Looking at our database today I noticed that RightScale has launched, configured, and managed over 200,000 instances to date! That’s an impressive number — but as the Animoto scale-up proves, we’re only just beginning…

Animoto AutoScaling Graphs

Comments (20)

Amazon takes EC2 to the next level with persistent storage volumes

The Amazon folks have gone public today with the next EC2 feature: persistent storage. The official information is found in Jeff Barr’s blog entry and in Matt’s forum post. Calling the persistent storage a “feature” is actually quite an understatement, it really revolutionizes EC2 and enables usage patterns that any big-iron SAN user would die for.

The basics

What does this persistent storage look like? We’ve been testing it for awhile and are thoroughly impressed. The Amazon folks are clearly still fine-tuning a lot of the details, but basically you can create storage volumes in the cloud next to the server instances you launch in the cloud. Think of having a really big SAN in the cloud in which you can create volumes of up to 1TB each with a single API call, or with a simple click in the RightScale UI (yes, of course we’ll have nice support for the storage volumes on our site coupled with some neat automation and an array of pre-packaged solutions). You can mount one or multiple volumes on an instance and they appear just like the other local drives, so you can format them as you like, set-up striping and do other useful things.

The feature that really makes the storage volumes sizzle is the ability to snapshot them to S3 and then create new volumes from the snapshots. The snapshots are great for durability: once a snapshot is taken it is stored in S3 with all the reliability attributes of S3, namely redundant storage in multiple availability zones. This essentially solves the whole backup issue with one simple API call or click in the RightScale UI. You can also easily restore a snapshot by creating a fresh volume from it. This feature is useful beyond just restoring a backup: you may restore to another instance where you now have a clone of the data and can do whatever you want to it. Wow!

The cool stuff

There are so many great uses for the storage volumes that it’s impossible to write them all up in a single blog post, and we obviously haven’t thought of them all (or even close). The first usage scenario we looked into is running a database. Up to today the only setup for a mission critical database we recommend is using two instances with real-time database replication and frequent backups to S3. We’ve now installed our Manager for MySQL replicated set-up for many, many customers and it works very well. In short, we use MySQL replication for redundancy and frequent (like every 10 minutes) backups to S3 on the slave to guard against the unlikely event of simultaneous failure of both instances located in different availability zones.

With the storage volumes the Manager for MySQL set-up works even better. Instead of having to tar-up the database files and upload them to S3 we can just take a snapshot. And in order to initialize a slave we simply create a volume for it from the last snapshot of the master and launch the replication: no more rsync of the data is necessary. It’s really nice to see how all the automation we’ve built stays in place with the new Amazon capabilities and saves just as many headaches as before, it just gets turbocharged by the storage volumes!

In addition, the storage volumes enable slightly lower-end database offerings. Since the storage volumes are more durable than local instance storage a lot of the risk of losing it all if the instance dies goes away. It is now possible to run a single instance with the database data living on a storage volume and to take frequent snapshots to get backups onto S3. Should the instance die, it is very simple to launch a fresh one using the same storage volume. Typically it would take only a few minutes for the new instance to come up and take off where the old one stopped! Of course this set-up has more downtime when compared to the redundant database set-up, and one has to be really careful in setting everything up to minimize the time it takes to mount the volume and to ensure a successful database recovery.

Just as the storage volumes enable the reliable use of single-instance databases they also enable single-tenant appliances in EC2. It is now possible to host the data for a single-tenant virtual appliance on a storage volume and mount it on an instance. What’s really cool is the decoupling of the data from the instance. It means that you can start a customer on a small instance and if they outgrow it, you can migrate them almost seamlessly to a large and later an x-large instance, all using the same storage volume. Beyond an x-large a couple of interesting options are possible to increase performance further, such as striping multiple storage volumes. EC2 really brings virtual appliances to the next level!

The S3 snapshots enable some completely different and very intriguing usage scenarios. Suppose you’re doing some DNA matching against a Genome data set on 1000 instances. In addition to firing-up 1000 instances on a whim you can, also on a whim, clone a nicely prepared snapshot of the data set 1000-times to create 1000 volumes, one for each instance. BANG! This way they can all independently crawl over the data set. This type of massive (essentially read-only) cloning really opens-up new possibilities in running such large computations in a cost effective manner.

Summing it up

I’ll stop here, but clearly the cloud has just squared in size! Two years ago, when I started on EC2 there were only small instances available and the sentiment was that in order to get the horizontal scalability and pricing of the cloud you had to accept inferior features. In the meantime we’ve gotten multiple instance sizes plus recently the remappable IP addresses and availability zones. That already indicated that computing in the cloud would soon surpass computing in traditional colos or in your own datacenter not just in scale and price, but also in feature set. With the addition of the storage volumes with all the cool snapshot features it’s now a fait accomplit: the cloud adopters will have much more computing horsepower and flexibility at their fingertips than those who are still racking their own machines. It’s going to be like agile software development: if you want to survive as an internet/web service you will have to compute in the cloud or your competitors will leave you in the dust by being able to deploy faster, better, and cheaper.

Update: Werner Vogels, Amazon’s CTO also blogs about the storage volumes in all-things-distributed with a little more background perspective. The Amazon folks are getting pretty coordinated with news appearing at the same time on their blogs and the forums. Maybe I missed it, but I don’t think they even press release this stuff…

Comments (34)

Meet RightScale at the MySQL conference

We’ll be at the MySQL conference in the expo hall April 15-17, please drop by so we can show you our Manager for MySQL. We’d also be happy to talk about all the new Amazon Web Services and RightScale features!

Comments (1)

How EC2 changes the game in batch grid computing

We’ve been receiving a fair number of inquiries about our RightGrid batch processing framework recently. RightGrid was designed to simplify the task of processing large numbers of jobs enqueued in Amazon SQS with the data to be processed residing in S3. Basically it takes care of all the fetching and pushing and all you need to do is plug-in your processing code that takes local files as input and produces local output files.

What’s interesting about the inquiries is that people are asking about the complex priority schedulers that come with traditional frameworks such as Condor, Grid Engine, or Platform LSF. And we’re kind of confounded by the purpose of these things in the EC2 world. Traditionally, the problem is that you have a cluster with N nodes, and you have users that enqueue enough jobs to keep those nodes busy until the end of the millenium. So you need to divide the cluster up into partitions and have complex rules to prioritize jobs on each partition.

Enter Amazon EC2. If user A enqueues a job needing 500 nodes for 10 hours and user B a job needing 800 nodes for 5 hours what do you do? Very simple: you check the balance in their account and then start 500 instances for user A and 800 instances for user B. Done. No priorities, no scheduling, just pure compute fun!

One of us (Ed) observed: the resource that is “allocated” in the finite computer center is the use of hardware, but the resource that is “managed” in a Cloud is cost. It is a new mind set that 1 computer for 100 hours has the same cost as 100 computers for 1 hour. Of course there are details such as start up costs for large numbers of nodes and ensuring that each billed instance hour is fully used. But those details are a small leap when compared to the issue of understanding that 1=100.

I’m sure that there is a role for scheduling software to enable things like running a 5 minute job on 500 nodes without having to pay for 500x 1 hour, or starting the next job on the same set of instances that are still finishing up a few laggard computations on a few instances. But assuming that Amazon can cough up enough instances, the game changes dramatically!

By way of an example, we have been testing some new features in RightGrid and we wanted to ensure that everything goes smoothly when launching large numbers of instances. So we set-up a queue with many thousands of test work items and an array of instances set to ramp up to a max of 500 instances. When we had it all ready to go, we looked at each other and exchanged a “should we do it now?” — “sure, why not?”. About 20 minutes later we had just under 500 instances running (we have the array set-up to launch about 20-25 instances per minute as the queue of tasks keeps growing). Everything ran fine and some 30 minutes later the queue was emptied and the instances were sitting idle waiting for either more work items to appear or the billing hour to reach its end. There was no warning to Amazon (“hey, we’re about to launch 500 instances”), no hiccup, and it cost us all of 50 bucks! We repeated this a few times to try a couple of combinations, all on our schedule.

All this being said, I’m sure there are good reasons to have more sophisticated queueing and scheduling machinery than we have in place today, and perhaps one of the traditional packages can be put to good use. What they seem to lack as far as I can tell is the ability to decide to launch more servers (I guess an email to sysadmin “rack 100 more servers” wouldn’t go down very well). What’s missing is something like “the job queue is sufficiently full, let me launch 10% more servers”, and of course the reverse when the job queue empties. The notion of “full” here doesn’t have to mean that jobs are queued for hours, it may simply mean that there is enough work to be able to fill additional full machine hours.

It’s going to be interesting to see how the cloud will change high performance computing usage pattern and whether Amazon can actually keep up with demand. If you have comments about the thinking above or suggestions I’d love to hear them!

Comments (5)

RightScale supports Amazon’s new Simple Queue Service interface

For a few weeks now we’ve been supporting the new programming interface for Amazon’s Simple Queue Service. The new interface comes with better pricing, especially for high volume users that enqueue thousands of messages. See Amazon’s site for details on SQS. The Dashboard now shows queues created with the old interface as well as queues created with the new one and there’s a handy “migrate” button that converts an old-style queue to a new one:

rightscale_sqs_migrate1.png

We’ve also recently released version 1.7 of the Ruby RightAws gem with support for the two SQS interfaces. Please see the usual rubyforge location for details.

Leave a Comment

RightAws release 1.7.0, including ActiveSDB alpha

We released a new version of our Ruby library gem for accessing AWS Services, including EC2, S3, SQS, and SDB: RightAws 1.7.0 is now available off rubyforge.  This version includes enhancements of the EC2 interface to support Elastic IP addresses, selectable kernels, and availability zones.  It also contains the first alpha release of ActiveSDB, which is a new ActiveResource-like interface for Amazon SDB. If you try out ActiveSDB please let us know what you think!

Comments (6)