Top Cloud API Sins

… while I’m at it, I might as well… Here is a short list of the top poor design decisions that I’ve seen in cloud APIs. Let me rephrase that… Here’s a short list of the top API features that got in our way or simply didn’t work for us. There may well be other use-cases where these features make sense.

  • Listing of resources without the details, e.g., a list-servers call that doesn’t return all the details for each server. This makes it very expensive to poll for server state changes because the listing doesn’t have enough information and so one has to do a show-server for each individual server. Imagine polling for an account that has several thousand servers – ouch. It’s fine to have a “with details” flag in the request so one can get the bare list, but we’d always set that flag.
  • Not returning a resource id on creation. Some APIs don’t give you a server id when you request a server to be launched, the response is just “ok, we’ll launch a server”. This means you end up guessing “is that new server that just appeared in the list the one I just launched?”
  • Providing a task queue. Several APIs I’ve seen have a task queue that is supposed to provide updates on tasks that are in progress. E.g., you launch a server and you get a handle onto a task descriptor. For us that’s just overhead. Just include a state field in the resource itself and we’ll just keep track of the state changes on the resource. So if mounting a volume takes a while, create the volume resource and set its state to “attaching” (or whatever is appropriate). Having a separate resource to say “that volume you created is attaching” is just overhead and means that the state of a resource is now in several places.
  • Lacking publishable images or the equivalent of EC2’s user_data (small amount of data that is passed to the launching server via the launch API call). I touched on these in my previous blog post.

UPDATE: reviewing YACA (yet another cloud API) reminded me of another two sins:

  • Not returning deleted resources in a “list resource” call. In particular, terminated servers must be returned in a list servers call for a certain duration, probably at least for an hour. The reason is that otherwise the client has to infer that the server self-terminated or failed when it no longer finds it in the result of list servers calls. Well, we have seen multiple completely different clouds fail to list running servers. In the case of EC2, which lists terminated instances for a good amount of time this resulted in error emails alerting us of the situation. In another cloud this resulted in servers marked as terminated, which is an irreversible operation and often triggers alerts and automation. And then the servers “resurrected”. Ouch! Now combine this with the next sin:
  • Pagination that goes page-wise instead of using a marker, e.g. where you get page 1 or the first 100 resources and then issue a query for “page 2″ or “from 100 on”. Explain to me how a client can get a consistent resource listing when resources can be added and removed concurrently. This is particularly fun if the client has to infer deletion from the absence of a resource in the listing: was it deleted or did it fall through the crack between pages due to a different resource being deleted concurrently with the listing? The proper way to do pagination is using markers like Amazon does it, but for a cloud API I actually don’t see the value in pagination. We always retrieve the whole list.

If you’re working on a cloud API, please think twice if you’re doing one of the above. Again, I don’t know all the use cases, just ours.

Now here’s what I’d really like to see. This is what we’re working on for internal purposes and it’s not easy, which is an event based interface instead of a request-reply based interface. Request-reply is fine if you have a system that sends commands to the cloud. It’s a problem when you build a system that reacts to changes in the cloud because you have to keep polling all these resources. We run a good number of machines that do nothing but chew up 100% cpu polling EC2 to detect changes. Fortunately cpu cycles are cheap :-) .

Comments (11)

Cloud API requirements

Adrian wrote an interesting post on cloud APIs and whether a standard cloud API makes sense or not. I’m still of the opinion that there is too much diversity out there and that the time is not ripe yet. There are many indispensable features in Amazon EC2 that no-one else has implemented at scale or for which there is no different take. The whole EBS feature set is one example. Image sharing and publishing is another one. Elastic IPs is another example. Eucalyptus has implementations for all this, but I can’t point to someone operating them at scale. In any case, Adrian quotes Tom White with the following “Turing complete set of cloud compute services”:

1. Instance lifecycle. The lifecycle defines the basic commands to provision, start, stop, and terminate instances. The bare-bones of a compute service.
2. Shared images. While it is possible to bootstrap from plain OS images created by the cloud service provider, the ability to build your own customized image, and crucially to share it with others provides a social element that helps drive adoption of a cloud platform.
3. Instance metadata. This feature allows you to inject small amounts of user-specific metadata to each instance at boot time – e.g. secret keys, or custom boot parameters – which allows another level of customization. This feature works well in conjunction with shared images: the common non-user specific code is baked into the shared image, and the user-specific code is supplied at launch time as metadata.
4. Network controls. Cloud providers need to think about the network environment that the user’s instances run in. Offering such services as DNS, firewalling, VPNs (and exposing it via an API) makes it easy for developers to get started quickly without having to build this infrastructure themselves.

This is a great start and I would wholeheartedly agree that these are requirements, but I don’t think they’re enough. In order to operate interesting services (or acquire interesting customers) a cloud must offer more than this. I’m not yet sure what the list exactly is, but the following come to mind:

    • Security groups or vlans: users must be able to control the network boundary around their servers, they must be able to group servers into tiers, and they must be able to create private communication structures. I believe the only two differences between security groups and vlans to be that a network interface can be in many security groups but only one vlan and that vlans can offer layer-2 multicast (and no-IP protocols) while security groups can’t.
    • Private IPs and remappable public IPs: going hand-in-hand with the notion of private communication structures goes the notion of private IP addresses. Of course publicly routable addresses are required as well, and there has to be some way to remap those IPs such that the failure of a server can be masked or a quick fail-over for other purposes can be engineered. I believe that in the end NAT (as used by Amazon’s Elastic IPs) is the only scalable choice, but I’m ready to learn new things and there certainly is room for improvement over EIPs.
    • Mountable storage volumes with snapshot backup: we did operate for a long time without Amazon EBS and at the time the “we need no expensive SAN” feeling was great, but after operating databases in EC2 with EBS for several years now there’s no way I’m going back. I need to be able to mount a storage volume on a server, operate it, take a snapshot backup, and then create a fresh volume from that snapshot on another server. I’m ok for the volume to be a remote filesystem as opposed to a block device, and I’m ok for the snapshot to be another volume as opposed to tertiary storage (S3 in Amazon’s case). Oh, but please don’t make it hard to do this across failure zones so the two above servers are failure-isolated.

      In my opinion we can’t be successful until we can hash out all these features with a reasonable degree of flexibility so providers can differentiate yet at the same time a reasonable degree of uniformity so users (like the RightScale cloud management system and its pre-built ServerTemplates) can write portable systems. I have not heard the discussion reach the level of sophistication needed (and I freely admit that I haven’t listened as hard as I could have) and frankly I also feel like we’re all still learning new things all the time. On the RightScale end we’re in the process of reworking our multi-cloud layer so we can incorporate some of the above feature sets in a standardized manner so I hope I can make more specific contributions in the near future.

      Comments (7)

      Amazon Consolidated Billing and Reserved Instances

      Amazon added consolidated billing a couple of days ago. It allows you to consolidate multiple accounts onto a single bill so your credit card only gets one hefty charge instead of many smaller ones from all the accounts you might have. What’s actually nice is that you can download a csv file with the details of all the accounts in one place so you get an overview of where the money is going. For someone like me who has over a dozen accounts that’s really, really nice!

      The way it works is that from the account you use for billing you can send invites to others so they can add their account to your billing method. They can still see their bill, it just doesn’t hit their credit card, it hits yours. You get to see what they used too, since you’re paying for it. I created a new account that I’m not using for any service at all and consolidated all my “real” accounts onto it. This way I can pretty freely hand out the credentials in finance & admin so anyone there who wants to see the numbers can log in without having any power over any instances, buckets, or whatnot.

      One of the nice benefits of consolidated billing is that you can share some of the savings of reserved instances across accounts, but the details are pretty weird. The part that makes sense is that if account A doesn’t use a reserved instance “slot” then another account B can make use of it and get the discounted hourly rates. Where it gets weird is that if account A purchased a reserved instances in zone us-east-1a then account B gets the benefit also in zone us-east-1a. While this may sound logical it makes no sense because the zone names are permuted between accounts, so A’s 1a is typically different from B’s! Amazon: so why do we need to buy reserved instances in an availability zone as opposed to a region since it clearly doesn’t really matter???

      I’ve actually always been unhappy with reserved instances. They seem to combine two completely different notions that I believe don’t go together because the use-cases are disjoint. One notion is that of a “revenue commit” or “buying into a discount tier”: you pay some money up-front to get a lower rate. Makes perfect sense, but why does it have to be tied to an availability zone? Or even to an instance type for that matter? The second notion is that of guaranteed availability, i.e., your reserved instance slot is always guaranteed to be available, you won’t get an “insufficient capacity” error. I understand why that has to be for a specific zone and type.

      The reason the two notions don’t mix for me is that in order to get the discount benefits you have to run your instance virtually all the time, it really only is advisable for instances that always run. Well, in that case the whole notion of reservation is moot since you’re running all the time anyway! If you’re looking for the availability guarantee it’s generally because the instance is *not* running all the time and you want to make sure that the day you need it you can get it. Think disaster recovery scenario. Well, in that case the whole “discount” is moot, since you’re like to actually pay more due to the up-front reservation cost! I wish I could just buy the discount without the availability guarantee and without the tie to a specific zone!

      One thing I noticed is that most of our larger customers are not using reserved instances. I wonder why…

      Comments (4)

      Bid for Your Instances!

      We’re clearly witnessing a year-end release finale at AWS with another big release tonight: EC2 Spot instance pricing. Spot instance pricing is the third pricing model introduced by Amazon after the original per-hour price (now called “on-demand”), then the “reserved” instance pricing and now a supply and demand driven “spot” pricing. As far as I know, this is the first step on a large scale towards “market pricing” for computing based on offer and demand. I know many people have been dreaming about something like this and a few startups have started to offer a compute market of some sort. But with Amazon’s offering it is now available on a large scale to anyone!

      How it works is simple yet complex. You can read the official product pageJeff Barr’s blog, and Werner’s blog. Here’s my attempt at explaining it. AWS publishes a spot price for each instance size in each region. The spot price is the per-hour cost of a server and if you launch a spot price server now that’s what you pay for the next hour. So instead of $0.10/hr for a small server you might only pay $0.03/hr if that’s the current spot price. AWS adjusts the spot price periodically based on the idle capacity available, so the price might be low at night or week-ends when many sites auto-scale down and it might be high during the day when everything is busy.

      Now comes the complex part. You don’t just launch a spot instance and forget about it, you actually specify a maximum price you are willing to pay and for each hour you have your server running you pay the spot price current at the start of the hour. As the spot price continues to vary while your instance is running this maximum becomes very important because should the spot price exceed your maximum then your instance will be terminated by AWS! It’s also possible to work the maximum price in reverse: specify a price lower than the current spot price in the evening and your request stays queued until the spot price drops below what you specified and AWS then automatically launches your instances. You can revise your maximum at any time, so if at 4am the spot price has not dropped enough you can raise your max so your instances get to run before sunrise.

      It should be clear from the way the spot pricing functions that this is intended for transient compute capacity. For your database instances you should carefully stay with the on-demand or reserved instances, but for late night batch jobs where it doesn’t matter whether they run a bit earlier or later the spot pricing can save quite some money.

      One thing that is not obvious at the outset is what would motivate Amazon to keep the price down. Part of the answer lies in the fact that instances whose max bid drops below the current spot price get terminated, thus if the price goes up too much, too many instances get terminated which results in less revenue. So there is a balance between more instances at a lower price and fewer instances at a higher price. But I’m sure it’s a lot more complex than that.

      We will be supporting spot pricing in the RightScale platform over the coming months and we’re curious about the functionality our customers would like to see in that respect. There are a lot of opportunities for automation here!

      Comments (16)

      Amazon EC2 – A New Chapter Begins

      Tonight Amazon made a milestone release introducing the ability to boot instances from an EBS volume and stop & start instances. In addition, just a few weeks after announcing their plans to expand AWS to the far east, today they’ve moved west and made a US west coast cloud available. (Do they need a compass?) For the AWS view on all this see Werner’s Blog as well as Jeff Barr’s postings. But one thing at a time…

      Amazon introduces US west coast cloud

      Almost exactly a year after the first geographical expansion of EC2 to Europe today is the second big step to the west coast. What is notable about the EC2 architecture is that each one of these expansions constitutes a new cloud or “region” in EC2 speak. This means that now in addition to the US-EAST-1 and EU-WEST-1 regions we have a new US-WEST-1 region. Each region operates autonomously from the others in order to provide failure isolation, which has benefits as well as downsides. A major benefit is obviously the redundancy one can get by operating in more than one region or placing DR in a region other than the one used for one’s primary service. The downside is that sharing across regions is not as easy as one might imagine. For example, machine images (AMIs) are not shared, so for each image you’re using in one region you have to copy and re-register the image in the other, and then it has a different id you need to keep track of and reference. We didn’t plan it this way, but our multi-cloud support turns out to be very helpful in managing operations in multiple EC2 regions. For example, in RightScale you can define ServerTemplates that use different images in different clouds, this means that as you update your ServerTemplate it automatically works across clouds and thus EC2 regions.

      For redundant operations the comparison between the cloud and DIY datacenters is becoming ever more lopsided. Who can really afford to lose the man-hours, the cap-ex, the time-to-market, and incur the headaches it takes to set-up a datacenter from scratch, even if it’s in a traditional colo? And who can afford to go through all that again to set-up a second or DR site? The ease with which it is now possible to set-up a DR site in the cloud that is a faithful replica of the primary site is really remarkable. And the best is that the second site can be extremely low cost because very little needs to be running there: most of it can be fired up on-demand in the case something happens. If you already have your own datacenter/colo set-up then all hope is not lost. Setting up DR in the cloud is one of the common use-cases we see.

      Amazon Instances Boot from EBS

      The real sea change about to occur in EC2 is booting from EBS. Tonight’s release includes a ton of new features which build on the recently introduced ability to publish EBS snapshots. Here’s a quick summary:

      • instances can boot from an EBS snapshot instead of a traditional AMI, EC2 creates an EBS volume from the snapshot and makes it the root partition
      • instances can also boot from an EBS volume, which means that a “boot from EBS” instance can effectively be stopped and restarted later by keeping the volume around and launching a fresh instance from the same volume
      • instances can now be stopped and restarted later, which works almost exactly as described in the bullet above except for the fact that the instance id (the i-12345678 number) remains the same
      • almost all attributes of an instance can change while stopped, including the instance size (naturally the availability zone is one thing that can’t change)
      • EBS snapshots can be registered and published as images, so now we have “traditional images” as well as “EBS images” (I wonder what AWS will call these)
      • images can specify snapshots and volumes to be automatically mounted at boot, and they can specify EIPs to be attached at boot, the run-instances API call can add/override these “image defaults”
      • instances can be “locked”, which prevents their accidental termination
      • instances can be bundled into images using an API call (with shutdown or optionally without)

      That’s a long list of features to digest! What’s going on here is that AWS is responding to the needs of enterprise customers who have many ‘legacy’ applications that are not designed to scale out or to play nice with the operations agility enabled by the cloud. It’s for the apps that sysadmins spend weeks setting up and then do their utmost not to touch again. Now they can be installed on an EBS root volume and servers can be launched and relaunched as needed without having to touch the config. Basically this enables the old-school way of managing servers to be applied to EC2.

      But these new features are also of great benefit to those operating scalable arrays of servers or web 2.0 web sites. It is now much easier to make changes to a clean server image: mount the image as a volume onto an extra server, edit the software/config on the image (e.g. using chroot and the native packaging system), when happy create an image from the volume and boot a server. Test it out and fix any problems in the original volume. Repeat until happy. If done correctly, this results in clean images that are not polluted by repeated boots and other operations, which is one goal we’ve always pursued with the RightImages we publish.

      The stopping and starting of servers can also make development more cost effective. Developers that use dev & test servers can stop them at the end of the day and start them back up when they next need them. In fact, many servers could be set-up to stop by themselves if there has been no activity for a while. (This reminds me that I saw that the three longest running instances visible by RightScale have been running for over 1000 days and that the account they run in has seen no activity since then, except for credit card charges I assume, impressive and scary at the same time!)

      Stopping and starting servers can also be abused. For example, it can be used to implement “dumb auto-scaling”: simply stop some servers when the load drops and start them back up later. The good thing is that you don’t end up with fresh servers on start, so they don’t have to self-configure, the bad thing is, well, that you don’t end up with fresh servers, servers come up believing the world hasn’t changed since they were last stopped. I think of this as abuse because it’s easy to forget to update one of the stopped servers when making changes to the system, whether these are changes to the software installed on each server or changes to the rest of the system each server needs to communicate with. In other words, the danger of having a zombie come back to life and create mayhem is high. Better keep a basic amount of hygiene and start with fresh servers.

      The Cloud Marches On…

      It will be interesting to see how EC2 and its user base continue to evolve. With each release Amazon offers more options. That’s more ways to do interesting stuff, but also more ways to shoot oneself in the foot and more stuff to ‘grok’ to get started. Maybe the most important, though, is that the Boot from EBS features rank very high on the “remove sales objections” scale: not every application is ready for the former EC2 cloud, not every sysadmin is ready for it either, by far not. I have to admit that all this leaves me with mixed feelings. EC2 used to have a simple & clean model, it required some rethinking but that was for the better. It was clear how to deploy highly scalable, highly redundant applications with a high degree of automation. Now that there are 10 ways to skin the proverbial cat it’s much harder to stay on track and to leverage automation. Where early customers needed help figuring out how to operate in the world of EC2’s disposable servers today’s customers need help just navigating through all the options available in EC2 and which to apply to each application or use-case.

      Support for the new features and the new US-WEST region in RightScale will become available with our next release, currently scheduled to go live just before xmas. Full support for booting from EBS will take a little longer as it has far-reaching implications. I’m sure that many of our customers will be operating in the new west coast region and that  it may even have some appeal to those in the far east and south pacific as “one step closer” to a local presence.  As always, we’d love to hear your thoughts on the new features, how you’re planning to use them, and how you’d like to see us support them.

      Updates:

      • AWS now gives each region a little local character: US-WEST-1 is listed as “N. California”, US-EAST-1 as “N. Virginia”, and “EU-WEST-1″ as “Ireland”.
      • Nice blog post on some of the mechanics of using Boot from EBS by Shlomo Swidler (but see comment below)
      • Some things you can’t do with traditional AMIs: start & stop instance, create image (new way of bundling)
      • Some things you can’t do with EBS-based AMIs: dev pay, protect the content of public AMIs (someone can mount the content as a data volume and pull files off it)
      • If you plan to create a public EBS-based AMI beware of deleted files: don’t just “delete” files with sensitive data on the volume because they can be “undeleted”, you have to erase the blocks, or better, not put anything sensitive there in the first place

      Comments (6)

      RightScale ServerTemplate Library and Machine Tags

      Yesterday’s release of the RightScale platform introduced two new features that I’m really excited about: the ServerTemplate Library and the use of Machine Tags on servers. (Ooops, I shouldn’t forget the new features for RackSpace, but I’ll talk about those next week.)

      We’ve had rather sophisticated sharing of ServerTemplates in RightScale for over a year now allowing certain users to share ServerTemplates, RightScripts and other design artifacts with other RightScale users. This enables us to publish free ServerTemplates to all our users, premium ones to our customers and it also lets ISVs on our platform publish ServerTemplates for free or for pay to their users and customers. In addition, each of the design artifacts is versioned such that users who have launched servers with a ServerTemplate last year can still launch new servers with exactly the same version of that ServerTemplate.

      A result of all this publishing, sharing and versioning is that there’s a lot to choose from. So much that drop-down menus have become really unwieldy and this is where the new library comes into play. In the past, when adding a server to a deployment one had to find the correct ServerTemplate from the list of all available templates in the RightScale system. Now this has become a two-step process where you first import the ServerTemplates of interest from the library into your account and then only the imported templates are shown in all the drop-down selection menus. Separating the library import/export step will also allow us to significantly upgrade the experience browsing all the design artifacts in the library over the coming releases, stay tuned…

      We introduced Flickr style machine tags recently and we’re expanding their use with this release. One of the really exciting new features is that servers now have tags and we’ve integrated the tags with the routing of messages between servers, with Chef (via the RightLink agents) and with the UI. All this is still in alpha but it’s starting to take shape. Our first real use-case is the registration of application servers with load balancers. The way it works is that when a load balancer comes up and is ready for operation it adds a “loadbalancer:lb=www” tag to say “I’m a load balancer for the www vhost”. When an app server starts up, it requests all servers in the deployment with a “loadbalancer:lb=www” tag to run a Chef recipe that adds the app server to the load balancer rotation. This way, the app server doesn’t need to know which or how many load balancers there are. The tag matching, communication, and running of the Chef recipe are all done by the RightLink agents.

      In order to let new load balancers come up when app servers are already running we can do the same tag-location in reverse: app servers announce “loadbalancer:app=www” to say “I’m an app server serving vhost www” and load balancers on start-up can add all app servers to their config by querying for all servers with that tag. For overall resiliency it’s a good idea for load balancers to re-query the set of app servers and to update their config accordingly. This catches race conditions as well as issues where portions of the app servers may be temporarily invisible due to network partitions. The theme here is “eventual consistency” and we’re still evaluating what the best primitives are to support high availability.

      You may wonder why the examples above use such long tags and that’s really where machine tags come in. The “loadbalancer:” prefix helps isolate the tags to coordinate the load balancer registration from other tags. Think of “loadbalancer” as being the name of the application or feature that uses these tags, e.g. the load balancer registration. The “lb=www” and “app=www” tag predicate and value can be used to support multiple vhosts. So a load balancer could announce “loadbalancer:lb=www” and “loadbalancer:lb=api” to indicate that it’s load balancing the www and api vhosts. And an api app server then would only query for the “lb=api” tag and it would only announce the “app=api” counterpart.

      While all this is happening amongst the servers, the RightScale UI provides access to all the tags, so one can see the servers announce the various tags and one can even intervene and manually modify these tags. We might provide a “don’t touch” notion for some tags, but right now it’s much more important to us to be able to expose all this machinery. As an ops guy there are few things I loathe more than hidden automation that I can’t inspect and override when I need to.

      Of course there’s more in the new release than just these two features: more support for RackSpace (monitoring in particular), improved support for Chef, support for new AWS features, and more

      Comments (1)

      Amazon launches Relational Database Service and larger server sizes

      Today is another big AWS launch day with two important new features available for EC2: a Relational Database Service (RDS) and larger servers. Plus a 15% price reduction on compute cycles: yay!

      Relational Database Service

      With the Relational Database Service AWS fulfills a long standing request from a large number of its users, namely to provide a full relational database as a service. What Amazon is introducing today is slightly different than what most people might have expected, it’s really MySQL5.1 as a service. The RDS product page has the low-down on how it works, but the short is that with an API call you can launch a MySQL5.1 server that is fully managed by AWS. You can’t SSH into the server, instead you point your MySQL client library (or command line tool) at the database across the network. Almost anything you can do via the MySQL network protocol you can do against your RDS instance. Pretty simple and the bottom line is that businesses that don’t want to manage MySQL themselves can outsource much of that to AWS. For background on RDS I’d also recommend reading Jeff Barr’s write-up and Werner’s blog which recaps the data storage options on AWS.

      What AWS does is keep your RDS instance backed up and running, plus give you automation to up-size (and down-size). You can create snapshot backups on-demand from which you can launch other RDS instances and AWS automatically performs a nightly backup and keeps transaction logs that allow you to do a point-in-time restore.

      The way I think of an RDS instance is as a virtual appliance or a special-purpose server. You really get an EC2 instance with an EBS volume running a specific version of MySQL plus automation for backups and resizing the storage volume. The API is designed such that additional versions of MySQL and other databases can easily be added in the future. Just like a regular server, each RDS instance lives within an availability zone and access is controlled through a security group (plus the MySQL authentication). I haven’t had the opportunity to run some performance tests, but I would surmise that it’s not too different from DIY running MySQL on a regular instance.

      One of the current shortcomings of RDS is the lack of replication. This means you’re dependent on one server and it’s impossible to add slave MySQL servers to an RDS instance in order to increase read performance. It’s also impossible to use MySQL replication to replicate from a MySQL server located in your datacenter to an RDS instance. But replication is in the works according to the RDS product page.

      In terms of cost RDS is priced at 30% above the same raw EC2 instance (after the Nov 1st price reduction) but the comparison is a little tricky because some backup storage is included as well. Of course I quickly compared to the cost of RightScale: if you run three XL RDS instances the extra cost is already more than a RightScale subscription which (just on the DB end) gives you replication, read-scaling, full control, plus real live support. Interesting to see how the per-hour price surcharge compares with a more flat-fee subscription to a broad management service.

      But our core conviction is that we want to offer our customers the broadest choice possible and we’ll support RDS instances in the RightScale dashboard within a day or two when we complete our next release!

      Larger Instance Sizes

      EC2 now sports larger and faster servers: XXL and XXXXL sizes, properly called m2.2XL and m2.4XL. These new server sizes are particularly important for large database users and we’ve been awaiting them ourselves. We haven’t had an opportunity to play with them yet but we’ll update our MySQL ServerTemplates as soon as we have a chance. The fact that the new instance size names start with m2 reflects that the speed of each core is significantly higher than that of the m1 series. With the prices being less than 2x and 4x that of a current m1.xlarge instance there’s no reason not to keep scaling up in machine size!

      Cloud Computing Keeps Getting Better

      Amazon shows it again and again: listen to your customers, implement new features accordingly, and iterate. Tonight’s release adds important new capabilities to the AWS cloud offering and we’re sure many of our customers will rapidly adopt them. I remain a little reserved about the database service because it does not currently support replication, which I wouldn’t want to live without, but Amazon is definitely on the right track.

      The 2XL and 4XL servers will be gobbled up real fast by many of our larger customers. We’ve seen a trend towards more and larger servers over the past year and I’m sure that will continue. By the way, how fast can you launch 10 68GB servers in your datacenter? ;-)

      RightScale User Meetup

      In case you’ve missed it: we’re hosting a RightScale User Meetup next Monday (11/2) in Santa Clara collocated with the Cloud Computing Conference & Expo and we’d love to see you there! We’ll be discussing trends in cloud computing that we see in our user base, our current and future product roadmap, and some “from the trenches” stories from several RightScale customers. It’s easy to register, and free. If you know anyone who might be interested send the link along. Hope to see you there!

      Comments (14)

      Amazon Usage Estimates

      Two weeks ago Guy Rosen posted a very interesting analysis of the EC2 instance IDs which reveals how many instances (virtual machines) have been launched on EC2 since its beginning in 2006. We’ve also been digging in our records and I can share some interesting findings.

      First of all, Guy’s analysis contains one significant error which is due to the limited data set he had access to. Before May 2009 EC2 issued even and odd instance IDs, not just even ones as he mentions. Since that date EC2 issued only even IDs until it switched to only odd ones in early September. The even/odd switches don’t seem to correlate with ID boundaries, perhaps Amazon switches between two active/standby reservation systems or something else is going on.

      The formula to convert an EC2 ID into a sequential launch number as far as we call tell is:

      Given an aws id as i-11223333
      Assign p1 the 1's, p2 the 2's and p3 the 3's
      Also assign p31 the first two 3's and p32 the last two 3's
      Compute:
        c1 = (p1 ^ p32) ^ 0x69
        c2 = (p2 ^ p31) ^ 0xe5
        c3 =  p3 ^ 0x4000
      And finally concatenate c1-c2-c3. (This does not include the even/odd adjustments)

      The upshot of Guy’s error is that he underestimates the launches by almost 2x! Here is a graph showing the instances launched daily since late 2006 that we would postulate based on his formula for instance IDs and what we’ve observed. We compute a total of 15.5 million instances (!) launched to date:

      ec2 instances

      You can see that EC2 has been growing very steadily, except for dips during the holidays and a spike in activity in april of 2008. That spike was due to Animoto’s scaling to several thousands of servers within few days. We’re a little puzzled about this spike, however, because the instance ID analysis shows about 2x more servers launched than Animoto actually launched (we launched them so we know). We believe this discrepancy to be temporary, but there remain some mysteries in the instance ID allocation…

      It’s also important to be clear about the what an instance launch means — namely, the launch of a virtual server.  It says nothing about what size server is launched (and therefore it’s cost per hour) or how long that server runs (and therefore how many servers are running concurrently).  As a result, an “instance launch” might mean as little as 10 cents in EC2 revenue (1 small instance for 1 hr) or, for example,  $7008 in EC2 revenue (1 XL instance run for 365 days), or even more.  That’s quite a difference, and makes it challenging to calculate revenues based solely on total instance launch statistics.

      Another interesting facts that we have observed is that during 2009 many of the larger EC2 customers have been migrating to the larger instance sizes. In earlier days the predominant method of scaling was by launching more servers, but we are now seeing a lot more scaling by replacing smaller servers by larger ones. Those XL servers are going like hotcakes! In addition we see a clear rule where the larger the server the longer it runs. A lot of the small servers go as quickly as they came: they’re used for experimentation, development, and testing. Once you launch a large server and fill it up with data chances are you’ll keep it running for a while. Hold onto your wallet!

      Another interesting trend we’ve seen is the improvement in sysadmin-to-server ratio. Our customers who grok the RightScale platform become very effective at managing lots of servers with few people. Hundreds to thousands per sysadmin. As a result they use servers aggressively to solve business needs — whether to keep up with exponential traffic or simply flexibility during dev & test.

      Overall, in terms of all cloud spending, in the last 12 months we’ve observed:

      • Cloud infrastructure spending grew 380% – i.e. $$ spent on cloud provider resources
      • Average cloud costs per customer grew 140% – i.e. cloud users on average are spending 2.5X more than a year ago
      • RightScale’s own cloud infrastructure consumption grew 440%

      That’s phenomenal growth – and testimony to the value of managed cloud computing.

      Meanwhile, the beat goes on, and we’re all consuming more and more cloud resources as each day passes. If you have a story about your own cloud usage, or trends and patterns you’re seeing in cloud usage in general, please post a comment or send it in.

      Comments (16)

      RightScale Release: RackSpace, RightLink, Chef, Machine Tags, VPC, and more!

      Yesterday’s release included a number of features that I’ve been itching to get into RightScale for a long time. This stuff is fresh off the press in alpha-release form so we’re hoping for your feedback so we can evolve it to suit your needs. Here are the highlights and some background on where we’re headed.

      First off we’re adding RackSpace CloudServers to the set of clouds in RightScale and it’s available to everyone as of today! All you need to do is to get a CloudServers account and enter your credentials into RightScale. Please refer to our tutorial for the details. What we’re releasing today is full support for our ServerTemplate machinery which is the foundation for building cloud portable systems. The ServerTemplates are built using our new RightLink agent and support Chef cookbooks as well as our standard RightScripts (see below for more info on this). While we don’t have a RightImage available for RackSpace quite yet it turns out that we’ve implemented enough magic to make the “Ubuntu 8.10 (intrepid)” image provided by RackSpace work as if it were a RightImage.

      Some of the features we’re missing for RackSpace are a full set of the core RightScale production ServerTemplates and the support for monitoring, alerts and automation, such as auto-scaling arrays. We’re working hard to release all this as soon as we can and that’s one reason the current RackSpace support is still labeled alpha.

      The second major new feature is the RightLink agent which supports not only RightScripts but also Opscode’s Chef cookbooks. The RightLink agent connects each server with the RightScale core as well as other servers around it. Boot scripts and operational scripts are launched via RightLink and we’ll fully support direct server-to-server communication in a next release. RightLink uses Nanite for the communication, it includes the Chef-client for running cookbook recipes, and it can run RightScripts as well. We’ll be enhancing the whole communication infrastructure so servers can communicate with each other efficiently but in a secure and controlled manner, for example to enable application servers to register with load balancers and to locate the currently active database master.

      I’m also very excited that we are now supporting the Chef server configuration system. When I started RightScale almost three years ago I wanted to include something like Chef but couldn’t get myself to pick among the available options. When I dug into Chef earlier this year and started talking to Jessie and Adam at Opscode it became clear to me that this is the right technology for configuring servers in the cloud. Chef cookbooks are the next level beyond OS distributions like RedHat or Ubuntu: a cookbook leverages the distro for getting the right bits onto the machine and then layers the operational know-how on top: how to configure everything and perform operational tasks. RightScale’s ServerTemplates then combine all the cookbooks needed on a server into a portable package and add the coordination between servers. After all, no server operates alone in the cloud…

      A nice side-effect of using Chef is that we’ve been able to fully embrace git for developing cookbooks (svn is also supported). We publish our cookbooks on github where you can fork and change what we offer to suit your needs. The RightScale web site pulls metadata information about each cookbook directly from github or any accessible git repo and servers also get everything directly from git. This means that all of git’s (or svn’s) software development goodness (branching, merging, tracking, etc) is now fully integrated with RightScale ServerTemplates!

      We still have to put together a getting started tutorial for Chef but we have published a sample ServerTemplate called “Rails all-in-one (EC2 Chef Alpha)”. It launches and comes up running the Rails Mephisto blogging app. You’ll notice that it’s a bit on the slow side to boot — we have a number of things to optimize — but it does pull from the public Opscode and RightScale cookbooks on github. Look into the Server Template under the Repos tab and you’ll see the definitions for the repositories.

      But there’s more! We’ve started to add Flickr style machine tags to RightScale resources. A machine tag is a tag that follows a special triplet syntax of namespace:predicate=value and the purpose of machine tags is to allow anyone or any external application to attach metadata to RightScale resources. Right now tags are only available for Servers, Images, and EBS Snapshots. Rather than start attaching tags everywhere we preferred to start using tags ourselves for something concrete so we can ensure we have a good feature set. We’re using tags now for snapshots to control the rotation of backup snapshots and to organize snapshots of multi-volume stripes. We’ll soon use tags to encode the features provided by images, e.g. whether they’re RightImages, support RightLink, support the freezing of repositories, etc. But most importantly we’ll add API access to tags so you can attach your automation to tags. We’d love to hear from you what exactly we need to provide. But in the meantime you can at least add tags to servers and use that in the UI to filter the list of servers you see.

      Amazon has been on a tear lately with few weeks going by without a new feature announcement. The most important news to come along in a long time has been the introduction of Virtual Private Clouds (VPC) and we’re pleased to support them in this release, which means that you can create subnets in your VPC and launch servers into them. We’re also now supporting the purchase of reserved instances straight from the RightScale dashboard, plus we’ll show what you’ve purchased.

      Finally we’ve improved the speed of the site across the board specially for larger accounts with lots and lots of servers. We continue to appreciate feedback on anything that doesn’t work well or that we should enhance: use the feedback link on our site or email feedback@rightscale.com directly.

      I hope you’ll enjoy the new features as much as we do — yes, we eat our own dog food and manage RightScale using RightScale!

      Comments (7)

      Internal external private public hybrid virtual cloud

      I’d like an external private hybrid cloud, dry, with whole milk, please!

      Enterprises rise to the cloud, terminology takes off… As if we didn’t have enough cloud confusion already. But after some thinking it’s not all bad news, some of the terms do make sense. While many of the benefits associated with the cloud are independent of cloud type – internal, external, private, public – the type of cloud does determine regulatory compliance, security and financial benefits. The cloud end-user mostly shouldn’t have to care, but to IT these are important considerations.

      Note that I’m exclusively talking about infrastructure clouds (IaaS) here, like Amazon EC2, so all this is orthogonal to the the SaaS, vs. Platform cloud (PaaS), vs. Infrastructure Cloud (IaaS) terminology axis.

      Many of the benefits of the cloud to central IT are independent of the exact nature of the cloud:

      • Automation increases reliability and system administrators’ efficiency
      • Self provisioning by end users reduces IT menial labor
      • Cost reduction by homogenizing and simplifying the infrastructure

      But when we get to regulatory, security and financial benefits internal/external and public/private cloud types come into play. Let me try to define:

      • An internal cloud is located in the enterprise datacenter and it owns the assets which are capitalized
      • An external cloud is located at a service provider and charges are expensed
      • A private cloud is dedicated to an organization, it’s “single tenant” in that sense (but that’s a tricky nomenclature because a private cloud may be used by many internal tenants within the organization)
      • A public cloud is shared across many organizations that don’t even know about each other

      Several combinations of the above make sense and here are some example:

      • An internal private cloud could be a Eucalyptus or (future) vCloud implementation in the datacenter of a large enterprise
      • An external private cloud could be a service provider, like perhaps IBM dedicating a number of racks in their facilities for a cloud they operate on an enterprise’s behalf
      • An external public cloud is what the cloud started as with Amazon EC2 and now emulated by others like RackSpace
      • An internal public cloud doesn’t make much sense to me, but I’m sure we’ll see some, perhaps it can make sense for renting out unused capacity, who knows…

      privpubcloudThis nomenclature turns out to be useful in teasing out the benefits of these various types of clouds. For public vs. private clouds the two main distinguishing factors are isolation and elasticity. In a private cloud it is easier to draw a hard boundary around the servers, the storage, and the network used by an organization’s cloud resources. This may have advantages from a security compliance and audit point of view. On the flip side, public clouds will tend to have more elasticity than private clouds because of the increased scale and ability to balance across more disparate types of uses. The elasticity is a very important cloud characteristic because it underlies a number of the end-user benefits.

      Amazon’s Virtual Private Cloud (VPC) is an interesting in-between the strict public and private definitions. The VPC provides increased isolation between a VPC’s resources and those of other users, but Amazon isn’t very clear on the exact nature of this increased isolation. At the same time the VPC does not compromise elasticity and cost-effectiveness, which is very important. Werner Vogels argues that without the elasticity it’s not a cloud.

      intextcloudThe three main distinguishing benefits of internal vs. external clouds are about control, the nature of the costs and cloud locations. By outsourcing the cloud infrastructure to a service provider the typical cap-ex costs of computing infrastructure can be turned into variable costs that scale relative to the actual use of resources. As more and more service providers offer clouds across the globe it is also increasingly easy to place compute resources where they are needed, whether for latency reasons or for regulatory purposes. Internal clouds are bound to where the enterprise has or can summon physical resources.

      That leaves the word ‘hybrid’. At RightScale we’ve been using it to denote hybrid cloud uses where an organization makes use of different types of clouds, which is something we believe will be very common. Given the large application portfolios in many enterprises some will undoubtedly be good candidates for credit-card based self-provisioning in external public clouds while others will remain under close scrutiny of IT in internal private clouds for a long time. This type of hybrid use is where the RightScale service is very effective at providing a seamless experience across the many clouds.

      While all the concerns around the internal / external / private / public nature of a cloud is interesting, it is important not to loose track of the fact that a cloud is a means, not an end. The most important thing is to deliver the benefits of the cloud to its end users, those who will launch servers in the cloud and use the cloud on a daily basis. In the enterprise space this includes many constituencies across the organization outside of central IT thanks to the fact that the cloud moves the provisioning closer to the end user. enduserbenefitsDevelopers can launch dev servers in the cloud when they need them and shut them down again when they’re done. Test engineers can launch whole clusters for test runs and they go away automatically at the end of the run. Operations engineers can set up staging systems for short periods to engineer the roll out of the next release. Marketing support engineers can launch demo systems for events or important prospects, and in general the various business units are in more direct control of their compute resources. All these users are outside of central IT.

      The cloud end user benefits I see in the enterprise settings:

      • Self-provisioning by end users so they can decide when, what, and how much.
      • Increased flexibility and reduced planning thanks to the on-demand nature of the cloud
      • Reduced costs thanks to fewer idle servers and economies of scale and commoditization
      • Increased operational efficiency thanks to more automation from management platforms like RightScale

      It’s important to note that none of the end-user benefits are directly related to whether it’s a private, public, internal, or external cloud. End users should care about the elasticity and on-demand nature of the cloud as well as the automation offered by cloud management services like RightScale.

      Well, while writing this rather long blog entry the different terms have actually started to grow on me. They do make sense in the right context. But what I am left with is the worry that everything cloud is becoming yet more complex when one of the fundamental benefits of the cloud is simplicity and standardization. The need to simplify IT was also one of the top messages delivered by VMware CEO Paul Maritz at VMworld this year. We have to continue simplifying and standardizing clouds and cloud application architectures at the same time as the forces of enterprise IT try to pull it all in thousand different directions.

      Comments (2)

      Older Posts »