As many of you know we’ve been working on supporting Windows on par with Linux within RightScale for close to a year. Well, the big moment has finally arrived where we can announce the GA (General Availability) of our Windows support! By “support” we mean much more than being able to launch a Windows instance on EC2, which has been possible for a long time, so I thought it worthwhile to expand a little on what it took to make Windows behave nicely in the cloud and what the outcome is. This is also a good opportunity to examine whether there is a “Windows handicap” in the cloud, or whether it works just as smoothly as Linux.
First the upshot: across the board we now support Windows 2003 and 2008 on par with Linux. We provide RightImages that work well with RightScale out of the box, they can be managed in the dashboard, we support extensible monitoring, associated alerts and automation, and we support ServerTemplates with Windows and even provide a few sample templates. Plus, if you have existing images, we provide a RightLink installer with documentation and a starter ServerTemplate that gives you monitoring out of the box. This will give you a good dose of the RightScale love (monitoring, scaling arrays, user access control, etc.).
The mountain of work that we’ve been chipping away at for the past two weeks is being able to release 40 Windows RightImages. This work wasn’t what we had imagined, it crept up on us little by little. When we release RightImages, whether it is for Windows or Linux, we strive to provide a consistent software environment across the RightImages of a generation and to ensure they work smoothly with RightScale. Both the Linux and Windows RightImages work great without RightScale as well and many EC2 users have been using Linux RightImages for a long time. Here is what we had to do for the Windows RightImages:
- install RightLink for integration with RightScale
- ensure the images are automatable, meaning psexec can be used, which also involved getting consistent settings for the windows firewall, file sharing and a few other details
- install PowerShell 2.0 on all images, including the Windows 2003 ones, because according to everyone we’ve talked to it’s a must
- ensure all the right DLLs are installed to manage SQL server from PowerShell
- increase the size of the root disk for the 2008 images because the std 40GB are insufficient for a lot of interesting software installs (SharePoint being one example), and increase the pagefile to 1.5x ram size (up to 8GB)
- ensure the server’s clock is successfully synchronized to NTP at boot and periodically thereafter
- install logic to determine when the server is fully ready after boot before starting any automation, which is not as simple as it may sound due to sysprep, double boot, and an ec2-specific service that runs at boot and sets a few things up
- ensure complete install of asp.net on Windows 2003
This list kept growing as we discovered issues. In addition, we are supporting all four EC2 regions (us-east, us-west, eu-west, ap-southeast) and it turns out we have to build each image from scratch in each region because it’s not possible to copy a Windows image from one region to another (licensing restrictions). We are supporting 10 images in each region:
- Windows 2003 /2008 across i386/x64
- Windows w/SQLServer Express 2003 /2008 across i386/x64
- Windows w/SQLServer 2003 /2008 on x64
Getting to the final images involved a fair number of iterations (we don’ know how many images we built total, and frankly, we’d rather not be reminded) so we built automation to crank out the images. This makes them consistent and reproducible. Unfortunately, the AWS images we had to start from are not all automatable, so we also had to build some “intermediate” images by hand with just a few settings tweaked so we could target them with our automation. Some of our developers ended up with nightmares about clicking through endless circular install dialog boxes!
While we were launching lots of Windows instances for testing we noticed that a fair number had their clock off by a large amount, like days. Digging into the issue we discovered that, unlike with stock AWS Linux images, the Windows wall clock is not synchronized to the virtualization host’s time and that the initial NTP synchronization doesn’t always succeed, partially because the Windows NTP service is “challenged” but also because it is pointed at a public pool of servers. Since performing automation on a server that thinks it’s yesterday or tomorrow is a non-starter we concluded that we had to beef up the time synchronization by ensuring that we get an NTP sync before proceeding with any automation and also run our own set of NTP servers so we can ensure our customers always have in-cloud NTP servers available to synchronize with. We think this really is part of the famous “muck / undifferentiated heavy lifting” that Amazon prides itself to take care of, but they politely declined, which is really a shame. We also noticed that the new HVM Linux images don’t lock their clocks to the host, so we will probably switch the way the clock sync works across all RightImages.
Another interesting issue we discovered during testing was that automation at boot time would frequently fail. The root cause ended up being that our service, which was configured to launch at boot, was starting too early and that the server just wasn’t ready yet. The way Windows boots in the cloud is quite different from Linux or from the “normal” world. With Windows, each OS install generates a server key that is embedded in the registry and uniquely identifies the server, so when an image is booted many times a fresh key needs to be generated for each instance and a number of things need to be updated (“sysprep”) and the server rebooted once. Towards the end of the second boot a special Ec2Config service finalizes the config, including admin password and hostname. This means that any automation has to wait for the reboot plus the ec2 service to complete its changes, which is not trivial due to an oversight by AWS. It’s interesting how the security details of Windows (i.e. the server key) ripple down into the whole boot process, making it take twice as long as it should. The net is that while Linux instance boot times on EC2 have come down from a typical 6-8 minutes back in 2006 to under a minute now when using EBS images the Windows boot times are starting out around 10-15 minutes. Hopefully Microsoft can be sensitized to the notion that fast boot times are an important asset in the cloud because they enable a lot of automation that is very painful if one has to wait so long for additional capacity or replacement servers to come online.
The final and biggest piece is implementing all the automation support we offer in Linux for Windows as well. As part of this we ported Chef to Windows (we’re working with Opscode to feed the changes back into the mainline) and we built out support for PowerShell. This means that software can be installed and configured on Windows servers similar to the way it’s done in Linux. We find that larger software packages often need to be installed manually, but even in that case it’s nice to be able to automate as much as possible and choose which portions to run “ahead of time” and bake into a custom image and which portions to leave off for the actual server launches. To round out the automation we wrote a nice little monitoring plugin that speaks the collectd protocol and that offers a simple meta-language that can be used to query many of the WMI statistics available on Windows servers. And all this is available out of the box to all our customers.
One of the open question we have concerns Windows updates. We will most certainly republish fresh RightImages when AWS updates its base set. What is not clear to us is whether we should be publishing “fully updated” RightImages on a regular schedule. Most of the customers we asked told us that they really want to carefully manage the exact set of updates on their servers. It’s going to be interesting to see how the whole update management with Windows servers plays out and how it will affect the amount of image rebuilding that everyone will have to do. We will definitely build our ServerTemplates directly on RightImages or RightImages augmented by just base installs of larger software packages and do all the config using Chef/PowerShell. This way we minimize the work we have to do when we swap out the underlying Windows install or update level.
I must say that overall I’m very happy with where we’re ending up, which is that we’re getting Windows to a point where it definitely is usable in cloud-style. What I mean by that is not just migrating a set of traditional servers into a equivalent set of servers in the cloud, but rather automating Windows servers for the cloud and leveraging the flexibility of the cloud to enable the business. The friction along the way certainly is higher than with Linux, whether it’s from license questions that crop up everywhere to the mechanics that currently require double-booting, but it is totally possible and Microsoft can, if it focuses on it, make it a lot better yet!