… while I’m at it, I might as well… Here is a short list of the top poor design decisions that I’ve seen in cloud APIs. Let me rephrase that… Here’s a short list of the top API features that got in our way or simply didn’t work for us. There may well be other use-cases where these features make sense.
- Listing of resources without the details, e.g., a list-servers call that doesn’t return all the details for each server. This makes it very expensive to poll for server state changes because the listing doesn’t have enough information and so one has to do a show-server for each individual server. Imagine polling for an account that has several thousand servers – ouch. It’s fine to have a “with details” flag in the request so one can get the bare list, but we’d always set that flag.
- Not returning a resource id on creation. Some APIs don’t give you a server id when you request a server to be launched, the response is just “ok, we’ll launch a server”. This means you end up guessing “is that new server that just appeared in the list the one I just launched?”
- Providing a task queue. Several APIs I’ve seen have a task queue that is supposed to provide updates on tasks that are in progress. E.g., you launch a server and you get a handle onto a task descriptor. For us that’s just overhead. Just include a state field in the resource itself and we’ll just keep track of the state changes on the resource. So if mounting a volume takes a while, create the volume resource and set its state to “attaching” (or whatever is appropriate). Having a separate resource to say “that volume you created is attaching” is just overhead and means that the state of a resource is now in several places.
- Lacking publishable images or the equivalent of EC2’s user_data (small amount of data that is passed to the launching server via the launch API call). I touched on these in my previous blog post.
UPDATE: reviewing YACA (yet another cloud API) reminded me of another two sins:
- Not returning deleted resources in a “list resource” call. In particular, terminated servers must be returned in a list servers call for a certain duration, probably at least for an hour. The reason is that otherwise the client has to infer that the server self-terminated or failed when it no longer finds it in the result of list servers calls. Well, we have seen multiple completely different clouds fail to list running servers. In the case of EC2, which lists terminated instances for a good amount of time this resulted in error emails alerting us of the situation. In another cloud this resulted in servers marked as terminated, which is an irreversible operation and often triggers alerts and automation. And then the servers “resurrected”. Ouch! Now combine this with the next sin:
- Pagination that goes page-wise instead of using a marker, e.g. where you get page 1 or the first 100 resources and then issue a query for “page 2″ or “from 100 on”. Explain to me how a client can get a consistent resource listing when resources can be added and removed concurrently. This is particularly fun if the client has to infer deletion from the absence of a resource in the listing: was it deleted or did it fall through the crack between pages due to a different resource being deleted concurrently with the listing? The proper way to do pagination is using markers like Amazon does it, but for a cloud API I actually don’t see the value in pagination. We always retrieve the whole list.
If you’re working on a cloud API, please think twice if you’re doing one of the above. Again, I don’t know all the use cases, just ours.
Now here’s what I’d really like to see. This is what we’re working on for internal purposes and it’s not easy, which is an event based interface instead of a request-reply based interface. Request-reply is fine if you have a system that sends commands to the cloud. It’s a problem when you build a system that reacts to changes in the cloud because you have to keep polling all these resources. We run a good number of machines that do nothing but chew up 100% cpu polling EC2 to detect changes. Fortunately cpu cycles are cheap
.

This nomenclature turns out to be useful in teasing out the benefits of these various types of clouds. For public vs. private clouds the two main distinguishing factors are isolation and elasticity. In a private cloud it is easier to draw a hard boundary around the servers, the storage, and the network used by an organization’s cloud resources. This may have advantages from a security compliance and audit point of view. On the flip side, public clouds will tend to have more elasticity than private clouds because of the increased scale and ability to balance across more disparate types of uses. The elasticity is a very important cloud characteristic because it underlies a number of the end-user benefits.
The three main distinguishing benefits of internal vs. external clouds are about control, the nature of the costs and cloud locations. By outsourcing the cloud infrastructure to a service provider the typical cap-ex costs of computing infrastructure can be turned into variable costs that scale relative to the actual use of resources. As more and more service providers offer clouds across the globe it is also increasingly easy to place compute resources where they are needed, whether for latency reasons or for regulatory purposes. Internal clouds are bound to where the enterprise has or can summon physical resources.
Developers can launch dev servers in the cloud when they need them and shut them down again when they’re done. Test engineers can launch whole clusters for test runs and they go away automatically at the end of the run. Operations engineers can set up staging systems for short periods to engineer the roll out of the next release. Marketing support engineers can launch demo systems for events or important prospects, and in general the various business units are in more direct control of their compute resources. All these users are outside of central IT.