RightScale Release: Rackspace Linux ServerTemplate Support, Windows 2008R2, PostgreSQL 9

When a systems or software engineer is learning a new language, a plethora of examples to learn from is invaluable. The cloud currently feels like a new software language to many – new constructs, better tools, rewritten rules. RightScale has always provided training to help people jump into this new world, and this release continues the education.

In it, you will find concrete examples of how to maintain advanced database architectures in the cloud, how to auto-scale Windows .NET applications, and even how to move database information between clouds. With your feedback, these examples will become production solutions that you can extend and modify. Before you know it, you’re a sysadmin rock star for your organization – people will wonder how you accomplish such magic.

Don’t hold back your secrets. ;-)

Rackspace ServerTemplate Support for Linux

On the heels of our Windows ServerTemplate Support for Rackspace, we’re elated to also bring you Linux ServerTemplate support – starting with CentOS. We have a few templates available for you to start with: a Base ServerTemplate that works across many clouds, a LAMP ServerTemplate with backups to CloudFiles, and a Database Manager for MySQL that takes snapshot backups to CloudFiles. These templates are all new and part of the public beta – please provide feedback and let your sales rep know if you’re experimenting.

With this release, you now have most of the Management (monitoring, user controls, auto-scaling, etc.) and Configuration (RightImages, ServerTemplates) aspects of our product available for Rackspace. We’re close to wrapping up our API support as well, and will be adding more RightImages, OSes, and ServerTemplates from here on out.

MultiCloud Magic

Let’s point out something important about the new Database Manager for MySQL that I mentioned in the Rackspace section above: it also works on Amazon.

This is one of our newer templates specifically designed to adapt to different clouds.  It is based off of another new MultiCloud ServerTemplate, the Storage Toolbox, which allows you to set up an LVM filesystem on instance drives or attachable volumes.  It also helps you take snapshot backups of your filesystem and upload it to the clouds object storage (in the MySQL case, to AWS S3 or Rackspace CloudFiles).

For the RightScale User Conference this week in New York, I demonstrated a database server running on the Rackspace Cloud, taking backups to Amazon S3, and restoring to a warm EC2 server.  I could have also gone the other way using Rackspace CloudFiles, or moved between Amazon regions using S3.

Magic? Nope. Here’s a tutorial so you can try it yourself.

CloudStack CentOS RightImages

Let’s move from MultiCloud to multi-hypervisor. With this release, CentOS RightImages are now available on all popular hypervisors for Cloud.com CloudStack clouds. We have RightImages for KVM, Xen Server, and VMWare ESX. Contact your sales rep for access to these private cloud images.

Database Manager for PostgreSQL

Many of you asked for a Database Manager for PostgreSQL 9 since replication issues from previous Postgres versions have been resolved. Well, the team took the structure of our MySQL Manager on Amazon and managed to replace MySQL with PostgreSQL – so you’re in luck! Full master-slave support, use of EBS volumes, assisted DNS failover, etc. Read the setup guide to get started.

Windows

Earlier this year, we released our first ever Database Manager for Microsoft SQL Server. We received great feedback, adding smart volume configuration with best practice disk configuration for system and user databases.  Now, not only are master, msdb and temp on EBS Volumes, but we create default locations so your database data and log files are directed to separate EBS volumes.  We coordinate simultaneous volume snapshots so we have sane and consistent backups that can be used for disaster recovery purposes with built-in restore RightScripts.  We also optimally configure SQL Server for you, enabling mixed authentication mode and creating an equal number of tempdb data files to the number of CPUs on the server. Check out the beta for our Manager for Microsoft SQL Server.

We’re also pleased to announce the Microsoft IIS Application Server, which can be used in an Array to serve as an auto-scaling .NET application tier. The ServerTemplate has built in Powershell-based RightScripts to register to either the AWS Elastic Load Balancer or to HAProxy.  In a similar vein to our other Application Servers in the MultiCloud Marketplace, this ServerTemplate will automatically download and deploy your application code and connect to a local or remote database server. Together with the Database Manager for Microsoft SQL Server, you can quickly get a multi-tier .NET app up and running in the cloud – get started with this setup guide.

Of course, both of these ServerTemplates are powered by RightScale RightImages.  We’ve enhanced our RightImages as part of this release to include support for Windows Server 2008 R2, bringing our total number of RightImages on Windows to 70!

There’s More!

We’ve added Nginx-based PHP and Rails Application servers too. To see the rest, please read the May and  June release notes for details and starting points.

Enjoy!

Posted in EC2, Rackspace, Releases | Tagged , , , , , , | 1 Comment

Commercial Support for OpenStack on the Horizon

A change that was very palpable at the recent OpenStack conference is that a number of major industry players are readying commercial offerings around implementing OpenStack clouds. Today Citrix officially threw its hat into the ring announcing “Project Olympus” that lets any customer build a private or public cloud based on “a Citrix-certified version of OpenStack and a cloud-optimized version of Citrix XenServer”. Citrix is also working closely with Dell and RackSpace in their offering to provide reference architecture and hardware as well as deployment services. The top-level bit here is that the commercial side of OpenStack continues to see healthy growth and there is little doubt that there will be a number of solid commercial offerings around OpenStack soon.  Where there’s real cloud usage, of course, you’ll also find RightScale – and we’re working closely with Citrix and other OpenStack providers to share our experience and enable RightScale support for their cloud offerings.

Citrix’s OpenStack announcement is especially notable given that it comes from the company that provides the hypervisor that has the longest history and powers more virtual servers than any other in the cloud today: Xen. So it will be interesting to see what it means to have a “cloud optimized version of XenServer” under the covers of Citrix’s OpenStack. That also brings another question to the forefront: what will it mean to have several flavors of OpenStack? Citrix uses the phrase “version of OpenStack”, Jim Curry (RackSpace) uses the phrase “OpenStack distribution”, and Barton George (Dell) also uses “OpenStack Distro.” It is clear they’re not just talking about a little packaging since, for example, Citrix states “Project Olympus will come pre-integrated with the Citrix Cloud Networking fabric.” In other words, it will have functionality different from ‘stock’ OpenStack.

From a selfish point of view I’m wondering how many versions or distros of OpenStack we will have to support and how compatible they will be with one-another? Of course, to be fair, other private cloud offerings also contain variants such as having multiple networking modes that differ substantially from one-another and that we support. This form of flexibility is a clear need. But for the larger community it will be intereting to see how things play out. At this point, my expectation is that this represents healthy differentiation and innovation in the OpenStack community, and we’ll continue to work with the various vendors to ensure we can support the architectures they’re implementing.

With the advent of these commercial OpenStack offerings, we’re witnessing a new emergence of reference architectures and an ecosystem of major players who can deliver complete IaaS solutions to enterprises and service providers who want to stand up and deliver clouds that can be managed by RightScale. Ultimately, that means more choice for our customers.

Posted in EC2 | Tagged , | 3 Comments

See you at the RightScale User Conference in NYC on June 8th

As the cloud computing landscape continues to rapidly evolve, RightScale is committed to helping you take advantage of the latest developments. These developments come from RightScale directly, our technology partners and from you – our customers.  Because cloud computing is moving at such a quick pace we decided early on to create an environment to share experiences and learnings.  That’s why we began hosting a RightScale User Conference two years ago and will now be hosting our 4th Conference on June 8th in New York City.

The best thing about working closely with our customers over the past four years is that we’ve been able to learn from your experience and use it to guide us in building an even better cloud management platform. So we thought it made perfect sense to focus this year’s conference around the theme Real Cloud Experience. Shared.   Our agenda is packed with sessions from the RightScale team, presentations from our customers, partners and from Forrester Research, Inc.  Come spend the day with us to find out what new stuff RightScale has cooking, gather insight from our customers and Forrester and attend in-depth breakout sessions from RightScale and our partners.  We’ll then cap the day with a RightScale hosted cocktail at the hip Gansevoort Hotel.

The final agenda is up and registration is open – check out all the sessions and get your free pass today!  As an added benefit, if you register for our conference you also get a free pass to the Cloud Expo happening that same week in NYC. Hope to see you there!

Posted in EC2 | Tagged , , | 2 Comments

RightScale Dashboard Release: Japanese, More Widgets, and -your idea here-

The RightScale customer community is growing rapidly and we always appreciate your feedback and support. We’re bringing RightScale users together for our 4th RightScale User Conference in New York in June. Customers are flying in from all over the world to share in our vision for Cloud Computing. Please join us!

We’ve just put out a new release of the Dashboard, and with it, we’re expanding our ability to meet the needs of this global community.

First, the initial Japanese translation for the Dashboard is now available. With it, we built a foundation for many more translations to be provided to and by the community in the future. For now, you can switch to Japanese by toggling the language in the lower left hand corner of the Dashboard (thanks to our team in Japan for your hard work!):

Next, we’ve release a couple features that customers have been asking for. You can now associate AWS ELB and RDS services with RightScale Deployments so you can view your whole AWS system on one page. We’ve also made it possible to add Cluster Monitoring heat maps and stacked graphs to your Dashboard:

On that note, we’re making it even easier for our community to partner with us to create the best Cloud Management Platform in the world. Today, we launched feedback.rightscale.com, a place where you can submit and vote on ideas to improve the RightScale products. In addition, for any feature you show interest in, we’ll tell you the instant it becomes available.

We’ve seeded it with a few recent requests already, so start voting…

Finally, let’s not forget you can already submit your ServerTemplate ideas… as ServerTemplates! We’ve launched our second ServerTemplate Showdown, where you can win prizes by simply publishing the ServerTemplates you use everyday. This spring, the grand prize is a 4-day trip to Santa Barbara… bring your surfboard, not your laptop. Read more details about the ServerTemplate Showdown.

What’s not obvious in the release? We’ve made a number of improvements in the back-end to enable better scaling of our service and to allow us to reduce the impact of releases. This work will be ongoing and we hope to be able to show the benefits soon.

Read the full Dashboard Release Notes for a complete list of new features and changes in this release.

Posted in Releases | Tagged , , , | 2 Comments

AWS outage follow-up: if you wanted details, you got details!

A week after the April 21st 2011 outage AWS posted a detailed post mortem explanation of what happened. It’ll be interesting to see how everyone digests the very detailed account. Since AWS did not provide an executive summary I’ll try my hand at one:

The outage was triggered by an operator error during a router upgrade which funneled very high-volume network traffic into a low-bandwidth control network used by EBS (Elastic Block Store). The resulting flooding of the control network caused a large number of EBS servers to be effectively isolated from one another, which broke the volume replication, and caused these servers to start re-replicating the data to fresh servers. This large-scale re-replication storm in turn had two effects: it failed in many cases causing the volumes to go offline for manual intervention, and it flooded the EBS control plane with re-replication events that affected its operation across the entire us-east region.

The steps taken by AWS to regain control started by stopping the re-replication attempts to quiesce the system and prevent new volumes from being drawn into the outage. AWS then isolated the affected availability zone from the EBS control plane to restore normal operation in other zones. Finally, AWS started to recover volumes by adding storage capacity to allow the re-replication to succeed where possible, by restoring data from snapshots on S3, and finally by manually restoring data. Ultimately 0.07% of the volumes could not be restored to a consistent state.

The Relational Database Service RDS was also affected by the outage. 45% of single-availability-zone databases in the affected availability zone went down because each database server stripes data across multiple EBS volumes with the result that one stuck volume halts the entire database. A number of multi-AZ  RDS databases whose  master server was in the affected zone failed to fail-over because of a bug in the fail-over process.

The post mortem lists a number of system improvements that AWS is working on. These primarily target improving the resiliency of EBS when replication fails as well as improving the tools created and used during the outage to recover from the situation. Customer communication improvements, especially regarding the frequency of updates, are also listed and AWS is crediting affected users a significant fraction of this month’s charges, this way beyond anything covered in its SLAs.

It is interesting to see how a network configuration error caused such a chain reaction within the EBS system. The outage trigger really is pretty incidental, a similar set of events could have probably been triggered by something else as well. The measures taken by AWS to contain and repair the outage highlight the deep technical expertise and full mastery of the entire software and hardware stack at AWS. Clearly deep code changes were made and sophisticated recovery tools were written 24×7 under the pressure of the outage, without which the situation most likely would have spun completely out of control.

The impact of the outage, the public reaction, and the measures necessary to control it show the scale at which AWS operates. It is pretty clear that this type of outage is part of growing the service to unprecedented scale. I find it amazing that this type of outage, where the sophisticated systems necessary to provide cloud computing at scale fail massively hasn’t happened years ago. This is a testament to AWS’s sophistication.

The outage summary exposes interesting technical details about the architecture of the services that AWS has kept confidential until now, however, more than providing information to competitors I believe that it provides education to cloud customers. All cloud providers who are planning world-wide cloud roll-outs absolutely must understand the power of and the need for availability zones in a region and isolation between regions (or equivalent constructs to “differentiate” from AWS). Without that redundancy and isolation, it has now become crystal clear: “how can we sell that to customers?”

An aspect of EBS durability which is not often mentioned is the role of snapshots during recovery. The EBS product description states “the durability of your volume depends both on the size of your volume and the percentage of the data that has changed since your last snapshot.” Here’s what this means. Suppose there are two copies of the volume (i.e. mirroring) and one fails, then a fresh mirror can fetch data contained in snapshots from S3 (which is itself replicated) but must retrieve other data from the single remaining copy, which may itself fail or become unreachable. Sadly the performance impact of taking a snapshot is such that most of our customers with high volume database cannot snapshot the master DB volume. Please fix that AWS!

An item missing from the remedies list in my opinion is EBS performance improvement. Better performance would have helped in the outage. Specifically I’d like AWS to reduce the impact of snapshots on volume performance so customers can actually snapshot high-volume servers and improve the performance of volumes so customers don’t have to stripe across multiple volumes which reduces availability (as it did with RDS).

I also am not satisfied with the communication improvements AWS proposes. I was fine with the frequency of status updates because it was clear that the EBS team was on top of it and didn’t have much new to report. I would like to see improved responsiveness so we don’t have to open a ticket before something shows up on the status page. But foremost I would like better content in the status updates. I’d like to be constructive, so I’ll make it concrete. Here is some of what I would have liked to see (I naturally have to make some assumptions about what was concluded when within AWS):

  • explicit mention that the initial network event was contained, status updates kept talking about “increased latencies”, which made it unclear whether there was a general ongoing network issue
  • clear statement that the outage revolved around EBS and noting the impact on launching servers from EBS images, but also stating that there was no impact on servers not using EBS
  • clear statement that certain API calls were disabled instead of vaguely referring to “increased error rates affecting EBS CreateVolume API calls”
  • timely reporting, e.g., the post mortem states “by 5:30 AM PDT, error rates and latencies again increased for EBS API calls across the Region” while the status updates only mentioned this at 7am
  • the fact that the outage was due to failed EBS volumes as opposed to just connectivity or latency issues accessing the volumes was only reported at 8:54am, yet this is crucial piece of information
  • the status updates never made it clear that EBS volumes continued to fail after the initial event, nor did they mention when this infection was halted
  • the isolation of the other availability zones from the “affected one” was reported several hours after it was put in place
  • it would have been useful to see some relative numbers, such as % of volumes deemed operational, % being recovered automatically soon, % slated for later manual recovery; best would have been emails to users with specific volume IDs

I’m sure that some of the items above weren’t quite as obvious at the time and in the heat of the moment it’s always difficult to determine what to say. But there is no question that the status updates were filled with vague terms, such as “increased latencies”, “moderate increase in error rates”, “affected availability zone”, “a network event”, etc. Perhaps foremost it’s not until 8 hours after the onset of the outage that AWS made it clear that volumes in the affected zone weren’t going to return to normal for hours to come. Up to that point it seemed that everything could return to normal any minute. This lack of clarity made it much harder for users to take the right decisions promptly.

On the public reaction front, while I understand it, I’m still baffled by reporters stating that the loss of 0.07% of volumes as not recoverable is a fundamental problem. This is equivalent to complaining about users losing data because their RAID array failed (happens all the time from operator error to 6ft drop due to earthquake). Users that lost data and were not aware of the risk they were taking need to seriously reflect on what they’re doing (and get help as appropriate).

This episode provides a key lesson to all cloud companies regarding architecting to withstand failure, and communicating with customers when failures do occur. While RightScale got through the outage relatively unscathed, we are working to improve on both those fronts ourselves. And we intend to continue to work with customers to enable AWS as well as other providers with independent, best-practice solutions that are resilient and highly available.

Posted in AWS, Cloud Computing, EC2 | Tagged , , , | 10 Comments

RightScale-Ready Ubuntu 11.04 Amazon AMIs

Ubunutu 11.04 Natty Narwhal is released, and there’s something in it for RightScale users!

We’ve been working with our partners at Canonical to make it possible to use Ubuntu AMIs out of the box with RightScale. This means you can start playing with Natty Narwhal 11.04 in RightScale today!

How does it work?

The method was pioneered by Eric Hammond, who helps maintain Ubuntu and Debian AMIs for EC2 along with Scott Moser. An AMI is setup to fetch the user-data when an EC2 instance launches, and execute any scripts delineated by a special format. Canonical has incorporated this into the Ubuntu releases as their cloud-init software.

In the latest Natty Narwhal 11.04 release, RightScale’s RightLink software is now compatible with the Canonical cloud-init. Furthermore, Canonical’s officially supported AMIs for Natty Narwhal 11.04 are capable of configuring RightLink automatically.

We’ve created the following MultiCloud Images supporting all AWS regions:

We’re sure this will benefit Ubuntu customers and partners, and we hope this method becomes more widely adopted by other virtual machine image creators.

Our hats off to the Ubuntu community for your continued advances in the cloud.

Posted in EC2, Releases, RightImage, Ubuntu | Tagged , , , | 1 Comment

Amazon EC2 outage: summary and lessons learned

Last Thursday’s Amazon EC2 outage was the worst in cloud computing’s history. It made the front page of many news pages, including the New York Times, probably because many people were shocked by how many web sites and services rely on EC2. Seeing so much affected was a very graphical illustration of how pervasive cloud computing has become.

I will try to summarize what happened, what worked and didn’t work, and what to learn from it. I’ll do my best to add signal to all the noise out there, in that respect I liked a tweet by Beaker (Christofer Hoff): “Happy with my decision NOT to have written a blog about the misfortune of AWS, stating nothing but the obvious & sounding like a muppet”.

Executive summary

  • The Amazon cloud proved itself in that sufficient resources were available world-wide such that many well-prepared users could continue operating with relatively little downtime. But because Amazon’s reliability has been incredible, many users were not well-prepared leading to widespread outages. Additionally, some users got caught by unforseen failure modes rendering their failure plans ineffective.
  • Some ripple effects within EC2 and in particular EBS caused by the initial failure should not have happened. There’s important work Amazon needs to do to prevent such occurrences.
  • Amazon’s communication, while better than during previous outages still earns an F. This is probably the #1 threat to AWS’s business.
  • The cloud architecture provides ample opportunities to design systems to withstand failures. The material cost of such designs is a fraction of what comparative measures would cost using traditional hosting means. However, designing, building, and testing everything is not cheap. Many of our customers who used our best practices fared well (I’m not claiming we’re perfect or that everything is automatic!) and we got numerous calls from other companies that were wholly unprepared.
  • Overall this is just one of many bumps in the cloud computing road. It reminds us that this is still “day one” of the cloud and that we all have much to learn about building and operating robust systems on a large scale. We are receiving a stream of calls from EC2 users that realize they need help in setting up a more robust architecture for their systems.

Outage analysis

At the time of writing Amazon has not yet posted a root cause analysis. I will update this section when they do. Until then, I have to make some educated guesses.

We got the first alerts at 1:01am on Thursday, the proverbial Christmas lights lit up indicating I/O issues on a large number of our servers. We started failing servers over and opened a ticket with Amazon. They finally posted a status message at 1:41am containing no useful details, sadly this is a typical sequence of events.

It appears that a major network failure was the initial cause of problems but that the real damage happened when EBS (Elastic Block Store) volume replication was disrupted. We did some extrapolations and concluded that there must have been on the order of 500k EBS storage volumes in the affected availability zone. It appears that a significant fraction of the volumes concluded that the replication mirroring was out-of-sync and started re-replicating causing further havoc, including an overload of the EBS control plane. It is also possible that the EBS replication problem was the root cause and that the network issues were a consequence, hopefully Amazon’s root cause analysis will shed light on this.

The biggest problem, from my point of view, was that more than one availability zone was affected. We didn’t see servers or volumes fail in other zones but we were unable to create fresh volumes elsewhere, which of course makes it difficult to move services. This is “not supposed to happen” and is an indication that the EBS control plane has dependencies across zones. Amazon did manage to contain the problem to one zone approx 3 hours after the onset.

After Amazon managed to contain the problems to one zone, it took a very long time to get the EBS machinery under control and to recover all the volumes. Given the extrapolated number of volumes it would not be surprising that an event of this scale exceeded the design parameters and was never tested (or able to be tested). I’m not sure there is any system of comparable scale in operation anywhere.

I do want to state that while “something large” clearly failed, namely the EBS system as a whole, the real big failure is that multiple availability zones were affected for ~3 hours. I also want to mention two important things that didn’t fail: we didn’t see capacity constraints in relaunching servers in other zones after the initial cross-zone issues and we didn’t see other regions affected at all. This is clearly good news!

Amazon communication failure

In my opinion the biggest failure in this event was Amazon’s communication, or rather lack thereof. The status updates were far too vague to be of much use and there was no background information whatsoever. Neither the official AWS blog nor Werner Vogels’ blog had any post whatsoever 4 days after the outage! Here is a list of improvements for Amazon:

  • Do not wait 40 minutes to post the first status message!
  • Do not talk about “a small percentage of instances/volumes/…”, give actual percentages! Those of us with many servers/volumes care whether it’s 1% or 25%, we will take different actions.
  • Do not talk about “the impacted availability zone” or “multiple availability zones”, give each zone a name and refer to them by name (I know that zone 1a in each account refers to a different physical zone, so give each zone a second name so I can look it up).
  • Provide individualized status information: use email (or other means) to tell us what the status of our instances and volumes is. I don’t mean things I can get myself like cpu load or such, but information like “the following volumes (by id) are currently recovering and should be available within the next hour, the following volumes will require manual intervention at a later time, …”. That allows users to plan and choose where to put their efforts.
  • Make predictions! We saw volumes in the “impacted availability zone” getting taken out many hours after the initial event. I’m sure you knew that the problem was still spreading and could have warned everyone. Something like: “we recommend you move all servers and volumes that are still operating in the impacted availability zone [sic] to a different zone or region as the problem is still spreading.”
  • Provide an overview! Each status update should list which functions are still affected and which have been repaired, don’t make everyone scan back through the messages and try to infer what the status of each function is.
  • Is it so hard to write a blog post with an apology and some background information, even if it’s preliminary? AWS tweeters that usually send multiple tweets per day remained silent. I’m sure there’s something to talk about 24 hours after the event! Don’t you want to tell everyone what they should be thinking instead of having them make it up???

Coverage from around the web

Since Amazon did not communicate much of substance beyond the rather sparse and obscure status updates everyone else was left to speculate. Most of the blog posts or news articles contained little information. Here’s a list of blog posts that I found interesting:

Lessons learned

Our services team handled 4x the incident volume last Thursday compared to a normal Thursday. A large number of callers needed help in assessing the situation or in bringing their servers back up. A typical request was: “It looks like my db server is down due to the outage, can you help confirm and assist with a migration?” Unfortunately we also heard from a good number of users who were using a single availability zone or didn’t set up redundancy properly. Hindsight is always 20-20.

A clear lesson for everyone is obviously that backup and replication have to be taken seriously (duh). In EC2 this means live replication across multiple availability zones and backups to S3 (and ideally elsewhere also). It has also become clear that a minimum of replicas must be running and a certain degree of over-provisioning is necessary to handle the load spike after a massive failure. Adrian Cockroft from Netflix summarized their strategy in a tweet a while ago: “Deploy in three AZ with no extra instances – target autoscale 30-60% util. You have 50% headroom for load spikes. Lose an AZ -> 90% util.” (Also see the discussion around the tweet.) Users that relied on launching fresh servers or on creating fresh volumes from snapshots were not able to do so for several hours. The only previous event that I remember where multiple availability zones were affected was the July 20th 2008 S3 outage that took down S3 in the US and EU (multiple regions!).

A number of blogs mention NoSQL databases as a solution to the replication and failure difficulties with traditional relational databases. While we’ve started to use Cassandra ourselves it has become pretty clear to me that this is not a silver bullet by a long shot. When a single node fails the built-in replication and recovery functions well, although the extra load on remaining nodes is high when the failing node is repaired and resynchronizes. But when large numbers of nodes in the cluster lock-up one-by-one over the course of an hour, I’d be hesitant to make a prediction about the outcome both in terms of the cluster’s availability and its consistency. We have two applications that make very different use of Cassandra and the behavior of the database is very different in both cases. My conclusion from what I have observed thus far is that clusters of replicated eventually-consistent NoSQL stores have pretty complex dynamics that can easily lead to unpleasant surprises. Sometimes it’s nice to have a comparatively simple MySQL master-slave set-up that experiences some downtime during the fail-over but acts very predictably.

I can’t help but feel uncomfortable about the performance of Amazon’s RDS “database-as-a-service” in that some databases that were replicated across multiple availability zones did not fail-over properly. It evidently took more than 12 hours to recover a number of the multi-az databases. The obvious failure here is compounded by the fact that Amazon has made it difficult for users to backup their databases outside of RDS, leaving them no choice but to wait for someone at Amazon to work on their database. This lock-in is one reason many of our customers prefer to use our MySQL master-slave setup or to architect their own.

The biggest lesson we learned abut operating RightScale itself is that we have to continue pushing hard on reducing the load on our central MySQL database and distributing our service. The database has grown too big and failover consequently takes too long because it takes forever to load the working set (over 30GB) into memory. We have some short-term measures we will be implementing to reduce the failover time, but more is needed. We also need to provide our users a choice of RightScale systems located in different regions and clouds: users operating primarily out of one region need to be able to use RightScale in an independent region or cloud. Ironically the first thing every public cloud operator and every company with a private cloud asks us is whether we can run RightScale inside their cloud: that seems pretty misguided to me!

We also were confused by Amazon’s status messages. In hindsight we should have intentionally failed-over our master database which was operating in the “impacted availability zone” early on at a time where we could minimize downtime. We were lucky that it didn’t get affected until about 12 hours after the start of the outage but we didn’t connect one and one. A clear message from Amazon that more and more volumes were continuing to fail in the zone would have been really helpful.

What’s next?

With Amazon’s overall stellar operating reliability it is easy to become complacent. This outage was a wake-up call for many of us. What remains to be seen is whether Amazon decides to take a lead and provide more granular descriptions of failure modes and recommended actions or whether they will leave it to everyone else to guess and figure it out. I see this as being one of the main long-term problems of cloud computing, namely that it is extremely difficult for users to list the possible failure modes and even more difficult to actually test any of them.

In the big picture I find Lew Moorman‘s analogy in the NYT article very appropriate: “The Amazon interruption was the computing equivalent of an airplane crash. It is a major episode with widespread damage. But airline travel is still safer than traveling in a car — analogous to cloud computing being safer than data centers run by individual companies. Every day, inside companies all over the world, there are technology outages, each episode is smaller, but they add up to far more lost time, money and business.” Most of the articles that predict a run away from cloud computing fail to explain where to run to. Unless you can hire superman to run your private datacenters my experience tells me that you’ll be worse off.

Posted in AWS, Cloud Computing, EC2 | Tagged , , , | 51 Comments

Zend publishes PHP PaaS on RightScale

It looks like 2011 is shaping up as the year of PaaS and the notion of what a PaaS is is starting to stretch out a bit. I used to think of PaaS as being what Heroku or Google App engine offer: a full compute service based around a language framework (or several) to which developers upload code for deployment. It is becoming clear that the “as a Service” aspect can take on a number of flavors. It may be a 3rd party company that offers the service. Or it may be in-house IT at the corporate level or at the departmental level. Or you may offer the service to yourself, so to speak, simply as way to make it easier to deploy your own apps. In the end, PaaS to me means two main benefits: a standardized deployment model which implies a standardized language framework and resource sharing. The standardization reduces the friction between development and operation. The resource sharing can reduce costs. In the big picture, this is what cloud computing is about: commoditization, for which standardization is an essential prerequisite, and resource sharing. PaaS is one way to standardize, our ServerTemplates are another one at a slightly different level.

We’ve been working with Zend for a long time and it’s gratifying to see the Zend PHP Solution Pack offered on RightScale graduate to a full Zend PHP PaaS offering. Zend PHP is important to us because 37% of our customers use PHP, which is to be expected given that apparently roughly a third of the web runs on PHP and there are 4 million PHP developers! We’re also seeing our customers combine IaaS and PaaS, basically some portion of their overall system runs within a PaaS framework and then “punches out” to or combines with other services that don’t fit the PaaS mold. We believe this is where RightScale can really shine: provide the flexibility to deploy and operate a wide variety of services in the cloud.

The Zend PHP Solution Pack is a PaaS offering that consists of a multi-server cluster that includes a number of Zend application servers, a Zend cluster manager, load balancers and a MySQL master/slave pair.  This configuration – runnable today on Amazon Web Services and in the future on other clouds, including private clouds – provides a production-ready high availability environment that will auto-scale up and down as required. We worked with Zend to make it easy for you to stand up a standardized PaaS environment on whatever cloud you choose then deploy PHP applications to your heart’s content! The solution pack includes RightScale’s premium onboarding service – a step-by-step coaching program to deploying on the cloud using best practices backed by RightScale and Zend. Find out more on our website or attend the joint webinar next week thursday, April 28th, @ 11am PT: register here. Also, Andi Gutmans, Zend’s CEO, wrote a nice blog post on our joint offering.

Posted in Cloud Computing, PaaS | Tagged , , , | 1 Comment

Cloud Foundry Architecture and Auto-Scaling

Yesterday’s blog post mostly covered the benefits of VMware’s Cloud Foundry PaaS and how it fits with RightScale. Today I want to dive a little into the Cloud Foundry architecture and highlight how IaaS and PaaS really are complementary. I’m really hoping that more PaaS options will become available so we can offer our users a choice of PaaS software.

CloudFoundry Architecture

From a technical point of view I see two main innovations in Cloud Foundry. The first is that the software is released as an open source project with an Apache license, which gives users and third-parties access to make customizations and to operate Cloud Foundry on their own. The second is that Cloud Foundry is very modular and separates the data path from the control plane, i.e. the components that make user applications run from the ones that control Cloud Foundry itself and the deployment and scaling of user applications. The reason the latter innovation is significant is that it really opens up the door to innovate on the management of the PaaS as well as integrate it into existing frameworks such as RightScale’s Dashboard.

Enough prelude, the pieces that make up Cloud Foundry are:

  • At the core the app execution engine is the piece that runs your application. It’s what launches and manages the Rails, Java, and other language app servers. As your app is scaled up more app execution engines will launch an app server with your code. The way the app execution engine is architected is nice in that it is fairly stand-alone. It can be launched on any suitably configured server, then it connects to the other servers in the PaaS and starts running user applications (the app execution engines can be configured to run a single app per server or multiple). This means that to scale up the PaaS infrastructure itself the primary method is to launch more suitably configured app execution engines, something that is easy to do in a RightScale server array!
  • The request router is the front door to the PaaS: it accepts all the HTTP requests for all the applications running in the PaaS and routes them to the best app execution engine that runs the appropriate application code. In essence the request router is a load balancer that knows which app is running where. The request router needs to be told about the hostname used by each application and it keeps track of the available app execution engines for each app. Request routers are generally not scaled frequently, in part because DNS entries point to them and it’s good practice to keep DNS as stable as possible, and also because a small number of request routers go a long way compared to app execution engines. It is possible, however to place regular load balancers in front of the request routers to make it easy to scale them without DNS changes.
  • The cloud controller implements the external API used by tools to load/unload apps and control their environment, including the number of app execution engines that should run each application. As part of taking in new applications it creates the bundles that app execution engines load to run an application. A nice aspect of the cloud controller is that it is relatively policy-free, meaning that it relies on external input to perform operations such as scaling how many app execution engines each application uses. This allows different management policies to be plugged-in.
  • A set of services provide data storage and other functions that can be leveraged by applications. In analogy with operating systems these are the device drivers. Each service tends to consist of two parts: the application implementing the service itself, much as MySQL, MongoDB, redis, etc. and a Cloud Foundry management layer that establishes the connections between applications and the service itself. For example, in the MySQL case this layer creates a separate logical database for each application and manages the credentials such that each application has access to its database.
  • A health manager responsible for keeping applications alive and ensuring that if an app execution engine crashes the applications it ran are restarted elsewhere.

All these parts are tied together using a simple message bus, which, among other things allows all the servers to find each other.

Auto-scaling Cloud Foundry

“So, does it auto-scale”? seems to be the question everyone asks. (I wonder who started this auto-scaling business? ;-) ) The answer is “no, but trivially so”. There are actually two levels at which Cloud Foundry scales, whether automatically or not. The first is at the Cloud Foundry infrastructure level, e.g. how many app execution engines, how many request routers, how many cloud controllers, and how many services there are. The second level is at the individual application level and is primarily expressed in how many app execution engines are “running” the application (really, how many have the application loaded and are accepting requests from the request router).

The first level of scaling the Cloud Foundry infrastructure is the responsibility of the PaaS operator. The operator needs to monitor the load on the various servers and launch additional or terminate idle ones as appropriate. In particular, there should always be a number of idle app execution engines that can accept the next application or that can be brought to bear on an application that needs more resources. This level of scaling can be performed relatively easily manually or automatically in RightScale. The app execution engines can be placed in a server array and scaled based on their load.

The second level of scaling is the responsibility of each application’s owner. The nice thing about the modularity of Cloud Foundry is that it exposes the necessary hooks to adding external application monitoring and scaling decisions. It is also interesting that Cloud Foundry in effect exposes the resource costs and lets the application owner decide how much to consume–and pay for. This is in contrast to other systems that make it difficult to limit the resources other than by setting quotas at which point an application is suspended as opposed to simply running slower.

What we envision in working with Cloud Foundry is simple: RightScale will be able to monitor the various servers in the Cloud Foundry cluster, and determine for example when it’s “slack pool” of warm, ready-to-go app execution engines has dropped below a given threshold (or exceeded an idle threshold), and either boot new servers to add to the “slack pool” or de-commission unnecessary ones to save on cost, as appropriate.

PaaS and IaaS Synergy

The benefits of PaaS come from defining a constrained application deployment environment. That makes it necessary for many applications to “punch out” and leverage services outside of the PaaS framework. In some cases this may be a simple service, like a messaging server or a special form of data storage. In other cases it will end up being almost a reversed situation where a large portion of the application runs outside of the PaaS and the portions in the PaaS are really just complements or front-ends for the main system. Cloud Foundry makes it relatively easy to make outside services available to applications in the PaaS, but these outside applications still need to be managed. This is where an IaaS management framework like RightScale is great because it can bring the whole infrastructure under one roof.

Some examples for this punching out:

  • Databases from the SQL variety to NoSQL and other models. Accessing legacy databases as well as leveraging popular DB setups like our MySQL Manager, which provides master slave replication.
  • Different load balancers in front of the request routers, perhaps with extensive caching features, global load balancing, or other goodies. Examples would be Zeus, Squid and many others.
  • Legacy or licensed software, for example video encoding software or PDF generators.
  • Special back-end services, such as a telephony server.

If there’s one thing I’ve learned about customers at RightScale it’s the incredible variety of needs, architectures, and software packages that are in use. For this reason alone I see PaaS as another very nice tool in the RightScale toolbox.

Can you run Cloud Foundry without RightScale? Of course. It certainly runs on raw servers. They can PXE boot a base image and join the PaaS in one of the above server roles. However in a mixed environment it is much more beneficial to run the Cloud Foundry roles within a managed infrastructure cloud.

It seems obvious from the traditional SaaS/PaaS/IaaS cloud diagrams that these different layers were made to interoperate. And that’s what we’ve already seen our customers doing: combining PaaS and IaaS in ways that meet their needs. There are a number of PaaS solutions in the market with more on the horizon. We will continue to support as many as we can and to the extent that their architectures allow it, because cloud is a heterogeneous world and customers want choice. In the case of Cloud Foundry, we have a particularly open architecture that provides a compelling fit – and we’re excited to see where our joint customers take us together.

Posted in Cloud Computing, EC2, PaaS | Tagged , , , | 8 Comments

Launch VMware’s CloudFoundry PaaS using RightScale

VMware’s Cloud Foundry release has the potential to be quite a watershed moment for the PaaS world. It provides many of the core pieces that are needed to build a PaaS in an open source form — VMware has put it together in such a way that it is easy to construct PaaS deployments of various sizes and also to plug-in different management strategies. All this dovetails very nicely with RightScale in that we are providing multiple deployment configurations for Cloud Foundry and will add management automation over the coming months.

Advent of private PaaS

Until now the notion of PaaS has lumped together the author of the PaaS software and its operator. For example, Heroku developed its PaaS software and also offers it as a service. If you want to run your application on Heroku your only choice is to sign-up to their service and have them run your app. Google AppEngine has the same properties. All this is very nice and has many benefits, but it doesn’t fit all use-cases by a long shot. What if you need to run your app in Brazil but Heroku and your PaaS service doesn’t operate there? Or if you need to run your app within the corporate firewall? Or if you want to add some custom hooks to the PaaS software so you can punch out to custom services that are co-located with your app? All these options become a reality with Cloud Foundry because the PaaS software is developed as an open-source project. You can customize it and you can run it where you want to and how you want.

Of course you can also go to a hosted Cloud Foundry service whenever you don’t want to be bothered running servers. This could be a public Cloud Foundry service that is in effect competing with Heroku, AppEngine and others, but it could also be a private service offered by IT or your friendly devops team mate. This opens the possibilities for departmental PaaS services that may have a relatively small scale and can be tailored for the specific needs of their users.

Benefits of PaaS

PaaS is really about two things: simplicity of deployment and resource sharing. The way a PaaS makes deployment simpler is by defining a standard deployment methodology and software environment. Developers must conform to a number of restrictions on how their software can operate and how it needs to be packaged for deployment. Restrictions is perhaps the wrong word here, a set of standards is a better way to phrase it because just as some flexibility is lost a lot of benefits are gained out of the box. It’s similar to no longer writing applications that tweak device interfaces directly and instead have to go through a modern operating system device driver interface. In the PaaS context, instead of having custom deployment and scaling methodologies for each application there is a standard contract. This makes for much simpler and cheaper deployment and reduces the amount of interaction necessary between the teams that produce applications and those that run it.

Resource sharing is a second benefit of PaaS in that many applications can time-share a set of servers. This is similar to virtualization but at a different level. Where this resource sharing becomes interesting is when there are many applications that receive an incredibly low average number of requests per second. For example, a corporate app that is used once a quarter for a few days is likely to receive just a trickle of requests at other times. If virtualization were used then at least some virtual machines would have to be consuming sufficient resources to keep the operating system ticking, the monitoring system happy, log files rotating and a number of other things that are just difficult to turn off completely without shutting the VMs down, which may not be desirable for a number of reasons. In a PaaS the cost of keeping such applications alive drops down significantly.

PaaS running in IaaS — Cloud Foundry with RightScale

PaaS is sometimes believed to be at odds with IaaS, as if you have to choose one or the other. We believe in both models and CloudFoundry starts to fulfill that vision. RightScale enables Cloud Foundry to be deployed in a number of different configurations that vary in size, in location, in underlying cloud provider, in geographic location, or in who controls the deployment or pays for it

With RightScale it becomes easy to set-up a number of Cloud Foundry configurations for different use-cases. It is possible to set up a large deployment for many applications and really leverage the resource sharing benefits. But as some applications mature and have more stable resource needs and perhaps need to be separated from other to improve monitoring, resource metering, or allow for customization this can be easily accomplished by launching appropriate deployments. Finally, some applications may outgrow the capabilities of a PaaS environment and require a more custom deployment architecture.

Try it out!

We’ve created an All-In-One ServerTemplate in RightScale that launches Cloud Foundry in one server on Amazon EC2. If you do not have a RightScale account you can sign up for one free (you will have to pay for the EC2 instance time though). The ServerTemplate is called “Cloud Foundry All-In-One“. When you launch it, take a coffee, and come back, and you’ll be able to load your apps up! (Note that currently a lot of components are compiled at boot from the source repositories, so the server takes ~10-15 minutes to boot, we will be optimizing that as soon as the code base settles down a bit.)

I must say that this is one of the more exciting cloud developments in a while. I’ve been wanting to add good PaaS support to RightScale for a long time and Cloud Foundry is now making it possible. We’ve been talking to Mark Lucovsky about his secret project for many moons and it’s really refreshing to see the nice clean simple architecture he and his team (hi Ezra!)  have developed see the light of day. We’re now planning RightScale features around PaaS support so please let us know what you’d like to see from us!

NB: I had wanted to write about the architecture of Cloud Foundry and how it fits with RightScale ServerTemplates, but the timing was too tight. Stay tuned for a follow-on blog post in the next couple of days… Update: I did write the follow-up post Cloud Foundry Architecture and Auto-Scaling

Posted in Cloud Computing, PaaS | Tagged , , , , | 9 Comments