RightLink Agent Security Features and Upgrading from V4 RightImages

A fundamental problem in Cloud management is “how do I get the remote instance to do what I want it to?”. Taking this task on for a few systems is doable with a number of techniques, making it scale for many thousands is not quite as simple. At RightScale, we have been on the “bleeding edge” of this issue since the early days of cloud computing, and we have learned a lot along the way. One of those lessons has led to our implementation of the RightLink agent and the RightNet protocol.

Introduction to RightLink

RightLink (link) is the instance side agent that supports RightScale’s RightNet protocol. The agent provides an improved and secure ability to leverage RightScale to manage large numbers of instances in the cloud. In the RightScale architecture, we leverage a light-weight RightLink agent on every instance to support our latest automation features. Prior to RightLink, which was released a bit over year ago, RightScale leveraged the command execution features of SSH to perform tasks on remote instances. With the introduction of RightNet and the RightLink agent, we are no longer reliant on SSH access for instance management.

The RightLink agent communicates with the core RightScale systems using the Advanced Message Queuing Protocol (AMQP). RightNet leverages AMQP’s Simple Authentication and Security Layer (SASL) support to perform a basic session authentication to ensure that the RightLink agent is talking to a legitimate RightScale core component (a broker in our lingo). This session authentication uses a shared key to authenticate the ends. After the session is authenticated RightNet uses payload encryption (openssl with X509 certificates, PKCS7 envelopes and AES 256 CBC cipher for encryption) to protect that data while in transit, and to provide a much stronger authentication mechanism (public-private key versus only the shared key of the session). Both of these security features are to ensure that packets are properly segmented and protected in the highly multi-tenant aspect of the cloud.

All our version 5 (v5) RightImages (and Multi Cloud Images, MCIs) include the RightLink agent by default. We started releasing v5 images over a year ago, and have seen a large, but not complete, adoption. For those of you still on v4 images, I am going to try to give you a couple more security motivations that may encourage you down the upgrade path.

  1. Ability to restrict SSH access on the instance: Because RightLink does not use SSH you can restrict access to the ssh service on Linux systems. With non RightLink enabled images (i.e., v4 and earlier by default), the RightScale platform ran scripts on the instance by ssh-ing into that instance directly, thus the need for ssh port to be accessible on the instance from the RightScale platform usually meant that it was accessible from any IP address. This created some exposure with potential brute force attacks. I will say that by default, RightImages configured SSHD to support public-key authentication only, so the risk of brute force password guessing was not an issue. What was an issue was that any vulnerability found in the SSHD server would then be potentially exploitable by anyone on the Internet.  With RightLink, this exposure can be mitigated.
  2. Managed SSH: In addition, v5 RightImages introduce a “Managed SSH Login” feature. This allows you to use a different SSH key for each user logging into a server. It can either use an SSH key uploaded by each user or the dashboard can generate a key for each user.  When using EC2 you may still select an EC2 SSH Key when launching the instance, however, it’s only really necessary if you need to log-in before RightLink starts to troubleshoot something in the bootstrap process. Note that the SSH connection is from your desktop system (wherever you are running the dashboard UI from, not RightScale) to your instance, thus working seamlessly with any SSH access restrictions you put in place.

SPOILER-ALERT: one of the items we are working on for RightLink v5.8 (next version coming out) is a Managed SSH Login that will bind each RightScale authentication principal to a distinct, non-root Unix user whenever they login via the dashboard. This is intended to improve the login auditing as well a enable each user to load a customized shell profile. We’d be very interested in your feedback as to the usefulness and desire of this specific feature.

Upgrade options

The cleanest and best way to move to v5 images is to find a v5 ServerTemplate, clone it and make the modifications needed to effectively duplicate the functionality you currently have. This will work like a charm if you if you did your scripts right and took a modular approach to deployment.

Next option is to change the RightImage (i.e. Multi Cloud Image, MCI) you’re using to a v5 one and relaunch. The V5 execution of RightScripts is almost fully compatible with v4 so, in theory, that’s all you need to do. The catch typically is that this brings updated versions of the OS and packages with it and may cause some incompatibilities. You will probably spend a bit more time troubleshooting this avenue.

Lastly, you can get RightNet support by RightLink enabling your v4 instance (see http://support.rightscale.com/12-Guides/RightLink/04-Creating_RightScale-enabled_Images_with_RightLink), and many might be motivated to go that route. I would encourage you to move to v5. While you’ll get the “not using ssh for command and control” benefit, you will miss many other benefits of the v5 image update.

Why Again?

Because there are some really cool features in v5:

  • Managed SSH
  • Bug fixes
  • Faster Execution of Operational Scripts
  • Added Chef Support in addition to RightScritps

More details can be found http://support.rightscale.com/06-FAQs/FAQ_0180_-_What_are_the_differences_between_v4_and_v5_RightImages%3F

It will take a bit of effort, but I guarantee the improvements you gain will be worth it! My one-liner of advice to those RightScale customers with older versions ”if you’re one of those hanging onto v4 or earlier you really should upgrade.”

Posted in Chef, EC2, Security | Tagged , , , , | 12 Comments

Why Do-It-Yourself Cloud Computing Management Is a Temporary Fad

I recently called up my buddy who used to be vice president of marketing at SugarCRM. I asked him if he ever encountered companies that were building their own CRM solutions internally. “No, that’s dumb,” he said. “That’s why they came to Sugar, so they could use ours. It’s too much work to do it yourself.”

Building your own Salesforce.com? Yup, sounds like a lot of work. Yet here at RightScale, I see many companies trying to build their own cloud management solutions. Perhaps it is the DevOps mindset that has made cloud computing so popular: “If I can’t get approval, I’ll just do it myself on the side.” Or perhaps it is because we are still in the early stages of cloud, and people are experimenting and discovering what is possible internally versus what is available in the market.

I did an informal poll of our sales team, and here’s what they said were the top reasons companies try to make their own solutions rather than use a cloud management product:

  1. They want control, or the ability to highly customize their environment.
  2. PaaS and IaaS, as concepts, seem simple, easy to jump on. A “cloud computing management platform” seems like a complex paradigm to adopt.
  3. Because they can, and they want the challenge of exploring a new frontier.
  4. The cost of a cloud management solution is too high.

OK, so these appear to be valid reasons at first glance. But these statements are typically founded in misconceptions about cloud management solutions in general or RightScale in particular, which I’ll address here:

Control: RightScale is not a PaaS service. We let you get into everything – perhaps more so than we should. Change the images if you must, run custom scripts against our API, and export usage data to include in your own data warehouse. Fifty-two percent of the servers running on RightScale are controlled by completely custom ServerTemplates, not ones we provide. Our product philosophy is to let you “get under the hood” if you need to – so please do.

Complexity: Cloud management is complex, and I don’t argue that. What RightScale aims to do is provide a layer of abstraction that makes the difficult and mundane tasks, like auto-scaling, much easier. It is unfortunate that the term seems complex, because if anything, a cloud management solution can make managing your entire cloud infrastructure and applications so much easier.

Conquering the new frontier: You’re being told by your boss to “Learn cloud now – just figure it out.” You want to truly understand what’s possible, how to build it, and deliver on expectations. As you start down this path, you cobble together some tools to accomplish your first foray into the cloud. Unfortunately, technologists have a tendency to “reinvent the wheel” as they continue along their path to the cloud. We’re many steps ahead, and we’re happy to share what we’ve already learned.

Cost: Netflix is a poster-child for DIY cloud, and has been forthcoming about its experience, which has helped grow this new paradigm. Netflix “designed its cloud architecture so that it has the option to move to an Amazon Web Services competitor” if needed, according to this NetworkWorld article. At a recent conference, Adrian Cockcroft, Cloud Architect for Netflix, mentioned that Netflix has 50+ engineers working on this cloud-independent solution. Doing some quick math, that’s about $8.3 MM per year Netflix spends building and maintaining this platform. That could buy a lot of RightScale Enterprise Editions!

At the end of the day, we see many customers who come to us after they outgrow their own internal solutions. They eventually discover that there are just too many things to stitch together: configuration management, systems automation, monitoring, application automation, provisioning, user permissions, reporting…it goes on.

We have hundreds of employees and have spent many millions creating the most comprehensive cloud management platform in the world. And we designed our product to drive the same way no matter which cloud you choose. So while cloud management may seem like a fun weekend project to tackle, it’s not – please don’t try it at home.

Yes, Amazon is still the dominant cloud, but a tornado of new clouds is swirling. The next thing your boss will likely ask is, “So what if we wanted to use this other cloud instead?”

Update Feb 3, 2012: Since I published this post, I’ve received a lot of feedback regarding DIY in the cloud computing space.

A few of our customer developers pointed out that they actually appreciated learning the cloud through RightScale – it gave them both an understanding of the underlying IaaS cloud as well insight into ideal cloud management frameworks. Forbes ran an article on how this extensive cloud computing knowledge is in high demand in IT and beyond, and we’re starting to see RightScale listed as a required skill on some of these job postings.

Next, I’ve heard from a few more larger companies who have built their own internal cloud management solution. They also cited approximately 50 engineers in their cloud computing groups, so it seems this is the sweet spot for development and maintenance of a robust internal solution. Let’s not forget about the PaaS-like solutions we offer with our ServerTemplates in this regard – it is not just automated provisioning that these larger companies ultimately need to build.

I’m not saying you can’t “do it yourself” in cloud computing (or in anything for that matter), I just want to encourage developers to avoid the trappings of #3 above – namely ignoring off-the-shelf solutions in the interest of personal discovery. It may work in the short term… until you hit one of the many walls that we’ve already had to plow through. At that point, you’ll either have to scale the solution and team, or re-architect for a product that offers the necessary solutions already.

Posted in Cloud Computing | Tagged , , , , | 4 Comments

Ending the Year with a bang! 5 new clouds managed by RightScale

What a year it’s been!  We’ve released a lot of really cool features, including a MultiCloud API and many MultiCloud ServerTemplates.  To round out the year, last week, we launched 5 new public clouds that are available on the RightScale MultiCloud Management Platform: AWS South America in São Paulo, Datapipe, Logicworks, SoftLayer and Rackspace UK.  These new clouds offer choice for our users when they ask where workloads should be launched on the cloud.  With these latest additions, we span a total of 8 geographic areas with additional presence in Amsterdam, Dallas, Hong Kong, London, New York, Sao Paolo, Seattle, Singapore and Washington DC.

These clouds have been in the works for a little while, and I’m pleased they are now available in the RightScale platform for our customers.  When we integrate with a given cloud, we work hard to ensure a seamless experience across all the clouds we support.  We provide a generic interface to each of the clouds integrated within RightScale.  This is not to limit functionality from the clouds themselves; but rather to ensure all that cool functionality is usable.  If I’m using SoftLayer and Datapipe, I don’t want to deal with different storage solutions like volumes or instance based storage (or at least not until I’m ready to optimize the storage).  Likewise, keep networking off my plate…I don’t care whether it’s security groups or ip tables.  Just make that infrastructure stuff work so that my app can run.

As a user, I want to  easily port what I have in one resource pool to another resource pool.  For this purpose, RightScale has generic constructs for things like instances, instance types, images, volumes, volume snapshots, etc, that are exposed in our dashboard.  Then, in our ServerTemplates (stay tuned by the way, a release is imminent), we use chef to abstract features for individual ServerTemplates that work, albeit very differently, across different resource pools.  Using the above example, someone launching servers in SoftLayer’s Amsterdam cloud and Datapipe’s Hong Kong cloud doesn’t have to worry about the differences between network configuration and storage management.  You can launch an entire 3-tier PHP architecture on both environments using ServerTemplates from the MultiCloud Marketplace.  We’ll take care of dealing with instance based storage in Amsterdam and set up the proper security groups for you in Hong Kong through the platform.

Why does RightScale spend so much time touting ‘MultiCloud’ and why should anyone care?

It’s a good question to ask actually.  I spend a lot of my time working with service providers and various companies looking to deliver infrastructure as a service for public consumption.   A number of people, our existing customers included, come to us and say “hey, I know I will have multiple clouds (if I don’t already)…help me make that happen.”  Analysts also agree – Forrester’s Holger Kisker touts “multi cloud becomes the norm” as his number 1 cloud computing prediction for 2012.

It’s real.  And it’s great validation for being the leader in ‘MultiCloud Management”.

Perhaps even more interesting (and contradictory if you think about it) is that the service providers say the same thing!  We describe how RightScale offers clouds to consumers and the choice consumers have to use what works best for their business needs.  And, IaaS providers are more than happy (okay, some take it as a challenge to deliver an even better service for their users. ;-) ).  In truth though, they recognize that cloud is a heterogeneous environment.  A single customer will use more than one cloud offering in a single environment.  Cost is one factor, and another I hear often is performance.  In some cases, geographic location is important and they “can’t get there with their current IaaS provider.”  It’s an opportunity for some to seize, and we’re partnering with the best to deliver the multi-cloud solutions our customers want.

Within RightScale, you can use any or all of the following clouds – all the Amazon regions, SoftLayer, Rackspace Cloud across US and UK, Datapipe, Logicworks as well as private cloud management with CloudStack and Eucalyptus.

I encourage you to click and try the new clouds on RightScale.  Use a new app or an existing one that’s already in cloud and as always, let us know what you think.

Posted in AWS, Cloud Computing, EC2, Releases | Tagged , , , , , , ,

Applying Security Workarounds in the RightScale Universe

In a recent post I discussed some of the options for patch management in the RightScale platform, this time I will talk about what happens when a patch is not available through traditional patch channels from the vendor. This typically happens in one of two cases:

  1. A “workaround” or configuration “fix” is made available from the vendor of a package
  2. The vendor of a package applies a security patch to their distribution, but the patch has not been applied to the packages distributed by the operating system vendor

In both of these scenarios, updating from the Security Repositories from the vendors will not provide a fix, it is necessary to do some custom “configur-ating” to get the patch or workaround applied to instances. In both situations, a patch/fix is deployed with custom RightScripts to running instances, as well as those that will be launched until the vendor package patch is released and ServerTemplates are updated.

I’ll walk through one possible way to accomplish this for the recent Apache HTTPd Denial of Service vulnerability. To refresh everyones memory, a vulnerability was found in various versions of Apache that allowed a remote attacker to consume all the CPU on the system the Apache server was running on. Workarounds were issued shortly after the vulernability disclosure, then about a week later, Apache released an official patch in the form of an updated version. We were running some HTTPd 2.2.x servers and the specific version we needed was HTTPd 2.2.20. At the time of the patch release, the linux distros had not yet updated their packages, so we needed to implement an out of band patch (i.e., work around our normal process).

Here are the steps we took to update HTTPd running on a CentOS based image. We did the initial building on a test server as you will see there was some hefty debugging needed to get it right. Here are the steps we followed:

  1. Start by reviewing the applicable CVE and find information about the vulnerability.
  2. Pull down the sources into the test instance. The instance should be launched with the same ServerTemplate as current running instances that will need to be updated
    • curl -o /usr/src/redhat/SOURCES/httpd-2.2.20.tar.gz http://apache.cyberuse.com//httpd/httpd-2.2.20.tar.gz
    • curl -o /usr/src/redhat/SOURCES/httpd-2.2.20.tar.bz2 http://apache.cyberuse.com//httpd/httpd-2.2.20.tar.bz2
    • wget http://mirrors.servercentral.net/fedora/releases/test/16-Alpha/Fedora/source/SRPMS/httpd-2.2.19-4.fc16.src.rpm
  3. Run an rpmbuild to get a list of dependencies for the update
    • rpmbuild –rebuild httpd-2.2.19-4.fc16.src.rpm
  4. Install the dependencies
    • yum install xmlto libselinux-devel apr-devel apr-util-devel pcre-devel openssl-devel -y
  5. Update to the latest package available for CentOS
    • rpm -Uvh httpd-2.2.19-4.fc16.src.rpm –force –nomd5
  6. Start configuring to create our own rpm (this is where the hard part begins)
    • cd /usr/src/redhat/SPECS/
    • edit httpd.spec to add
      • Version: 2.2.19 => 2.2.20
      • Release: 10%{?dist}.1 => 1%{?dist}.0
  7. Build the rpm, expect a boatload of errors to walk through:
    • rpmbuild -ba httpd.spec -> Fix error -> repeat
  8. Once successful, the newly built package is in /usr/src/redhat/RPMS/
  9. Install the update and restart the server
    • rpm -Uvh httpd httpd-tools –nodeps
    • service httpd restart
  10. Take the newly created rpm and upload it as an attachment to be used by the RightScript
  11. Create a RightScript that performs the update and restarts the server
    • rpm -Uvh $RS_ATTACH_DIR/httpd*.rpm
    • service httpd restart
  12. Run the RightScript as an “Any” or “Operational” script to update servers in the deployment.

While this process is for CentOS, Ubuntu requires similar heavy lifting to get things functioning. This process took one of our Professional Services engineers about 4 hours to complete (obviously the most time was spent on step #7 ). This type of process takes a lot of hackery to back port a version into an srpm. It is not trivial, but can be done.

So basically the answer to “How?” is “RightScript”. Even though it is non trivial to get that custom rpm or package debugged, once it is, then the deployment to systems is very quick and painless using the RightScale platform.

A final note, once the Linux distribution actually issues the patch, you should transition from “fix” mode your standard “patch” mode for overall consistency. Remember that very little if any testing is given to “fixes” that are released, whereas, a certain level of regression testing is typical for vendor released patches (i.e., distro packages or Windows Updates).

Posted in EC2, Security | Tagged ,

Security Patching in the RightScale Universe

Security PatchingSecurity vulnerabilities happen, it is just a fact, not only in technology but in life in general. When we are made aware of those vulnerabilities, we need to “fix” things or mitigate them to the best extent possible. In IT, that is typically synonymous with installing security patches or workarounds. I know that many of our customers have questions about how to best do patch management using the RightScale platform. This post is the first part in helping you accomplish that task and focuses on cases where a vendor patch is available. In my next blog post, I’ll talk about best practices for applying workarounds or fixes when there is no vendor patch available.

Within the RightScale platform, there are 3 primary options that can be used to automate the patching of instances:

  1. Unfreeze Security Repositories and enable automatic updates on systems, hope that the updates don’t break anything.
  2. Manually unfreeze Security Repositories for test systems and update. Perform regression testing, then update & refreeze Security Repositories for production systems and apply updates. Do this regularly (say monthly or weekly).
  3. Update each ServerTemplate with the latest Security Repository. Regression test each updated ServerTemplate. During a schedule maintenance period, force all servers to be relaunched with updated ServerTemplates.

Of course, there’s also always the option to hide underneath a pile of coats and hope it all works out for the best. It goes without saying that while many people de-facto implement this last option, it is not a viable long-term strategy! :)

Let’s dive into each of the options a bit more and look at some pros and cons, so you are in a better position to pick the one (or combination) that works best for you.

  • Unfreeze with automatic updates: Since many (most?) of the core Linux distributions have functionality to allow selecting of security updates only, you freeze all channels, and then set the security repo to /latest via a RightScript. You then configure the system to install those updates on an interval you desire (daily seems to be a good choice). For example, on Debian based systems, such as Ubuntu, security patches are broken out into a separate repository. For a given release it is possible to only automatically install updates from http://security.ubuntu.com/ubuntu/ instead of http://us.archive.ubuntu.com/ubuntu/, making this very easy to implement. Just unfreeze that repository and updates will apply as they are released.
    With CentOS you can run “yum update –security” and only install security related patches. Using this method allows rapid access to the latest security updates, with almost no work required to enable this behavior as the unattended upgrade packages do all the work for you.
    The downside is that if a broken package is released, say into Ubuntu-security, it could affect production. A side note is that as it relates to security patches, the industry at large has pretty much come to the acceptance that the risk of problems with automatically patching security vulnerabilities outweighs the potential risks with doing it. For example, Debian, Ubuntu and Windows 2003-2008 all ship this way. For those who determine that the risk of automatic patching is too great, there is …
  • Unfreeze in test, test it, update production: This option is to apply the security patches to instances in a test deployment, then after regression testing, deploy them to production using a RightScript to update repositories and perform an update on production servers. This has the advantage of some level of regression testing prior to deploying security patches in production. The downside is that there is a high manpower cost to perform the functional testing on a regular basis. There is also the fact that you should test the specific items that the security fix supposedly touched which involves a bunch of research. This is a non-trivial effort. It would likely require a special test environment dedicated to security testing. From a purely dogmatic standpoint, this is the way it should be done, but the pragmatist in me knows that for many organizations, the additional cost associated with this is not justified by the increase in risk posed by just installing security updates. I’d rather have patched systems, than people not doing it because it was not the absolute best way to go about it.
  • Update ServerTemplates and relaunch: This may be the cleanest and seemingly easiest approach. There is relatively little change in current operations, as many of you use this method currently. This also ensures that all packages are tested before being deployed in production. The upside is that systems are cleanly built, and ServerTemplates are updated more often. The downside to this is that your patch level is only as good as your latest ServerTemplate update, and while it works for servers that can be frequently updated (app servers, web servers, etc.), it really doesn’t work well for services that are infrequently updated, or difficult to relaunch (databases, load balancers, etc.). Further, it forces you to relaunch servers you wouldn’t otherwise relaunch during maintenance windows.

So, you may be asking “You use RightScale to manage RightScale, so how do you do it?” Well, at RightScale, we have chosen a hybrid approach of #1 & #2. Our default patching policy is “Unfreeze with automatic updates”. As stated earlier, there is some inherent risk in this stance, but we feel that getting critical fixes in outweighs the incremental risk of taking too long to get the patch deployed. In instances where the risk of any patch (security or not) breaking a system, we use the “Unfreeze in Test, test it, update production” patch policy. Further, we design our platform with mitigating controls to restrict access to systems and services that may not get the latest patches on a daily basis. This policy/stance works for us, and we think it is a reasonable one for others to start with (if you didn’t already have a stance).

I would be remiss if I did not point out that there are likely a myriad of other ways that you can perform security patching, but that these are ones you get “out of the box” with the RightScale platform. The specific approach you choose will be driven by your business requirements. Remember that you have options, so use them to develop a process the works for you and your organization. My next blog will be on deploying workarounds and non packaged fixes. Until then, Happy security “patching!”

Posted in Cloud Computing, EC2, Security | Tagged , | 3 Comments

RightScale Launches 3 Millionth Server

Here at RightScale, we’ve just passed the 3 million server milestone.  Driven by our growing customer and free-user base, and their ever-increasing cloud usage, the 3M mark represents a benchmark in the industry, and is noteworthy in three different ways.

First, 3 million is impressive in the data center business.  Many well-known hosting companies house between 50,000 and 100,000 servers, and estimates for the world’s largest computer companies with large data centers range up to 1 million.  (See the DataCenterKnowledge report here.)  It’s difficult to compare our statistic with these installations, since many may be running largely under a pre-cloud operational model.  Nevertheless, launching 3 million is quite a number by any comparative metric, and there’s no question that it was achieved only with new levels of automation and dynamic configuration that are core to RightScale.

The second reason 3M is worth noting has to do with how fast we got there.   After our founding in 2007, it took us about 27 months to reach 1M, another 12 months to reach 2M, and then just 6 months to reach 3M.  That’s more than twice as fast for each subsequent 1M servers.  Likewise, one year ago in Sept. 2010, we had launched 1.5M servers – and we doubled in the last 12 months.

The third reason this milestone matters is that the servers our users launch have increased in power, and persist for a longer duration, as each month passes.  In fact, since January this year server runtime has increased on average 30%. So the trend is clear: companies are running “bigger iron” in the cloud — and keeping it running longer — than ever before.  Here is a graph of the size distribution we recorded this summer:

Certainly, the growth rate we’re tracking for the quantity, power and longevity of servers launched on RightScale remains quite healthy and mirrors the broad adoption of cloud services industry-wide. But equally important is the range of customers driving this growth, representing a wide variety of industries, use cases and services powered by RightScale on the cloud. For example, during the last year:

  • media giant Pearson converted a traditional educational software offering to a SaaS based model that allowed faster onboarding of new customers;
  • consumer goods company American Girl (a division of Mattel) launched their virtual world with a major advertising push behind it and sailed smoothly through the holiday season;
  • online game company Zynga launched new games that consistently broke records;
  • and companies like Amdocs and Trader Media spoke at our User Conference last June about new enterprise services launched on both public and hybrid clouds.

All of these RightScale customers contributed toward the 3M milestone, and we continue to be dazzled by the solutions they achieve using cloud infrastructure. We’re looking forward to the next million servers launched by our customers, and the amazing services they’ll power with them.

Posted in AWS, Cloud Computing, Cloud.com, Eucalyptus, Rackspace | Tagged , , , , | 3 Comments

Microsoft .NET Stack Released

This is the Final post in our release series…following both our dashboard and ServerTemplates releases.  Today we’ll talk about a unique solution we now support…let’s get started…

Drum roll please…Introducing the auto-scaling, high availability .NET Stack on Amazon!

Some of you might think, ‘well Finally!’.  As a product manager, I empathize with that sentiment. It was certainly tricky getting this to hum on the cloud. I’m very proud of our team’s work, especially when reviewing the challenges they overcame to get this out the door.  Let’s take a deeper look into some of them.

Database Manager for SQL Server

When I first proposed the SQL Server Database Manager last year to our development team with what ultimately would become our first SQL Server ServerTemplate, I met with mixed reactions.  EVERYONE thought it was a great idea and would be useful to users.  However, a lot of reservations too…doing something like what we have for MySQL with master/slave replication is no easy feat.  Adding in Microsoft complexity with Powershell as well as unexpected Windows behavior in the cloud, the solution seemed out of grasp. Some of the notable questions we asked ourselves:

  • How does MS Licensing work as existing orgs transition to the cloud?  Is SQL Server Standard good enough or will users demand Enterprise?
  • How will we backup data on SQL Server?  Native backups guarantee “sane” backups without service interruptions but take a long time.
  • What are the setup best practices and how do we implement them in the cloud?  Multiple Data / Log Volumes, default monitoring and alerts, backup scheduling, etc…all need to be considered.
  • How do we set up replication with SQL Server?  There are so many supported options, what’s best for cloud?

We started at the beginning, pushing licensing off to the likes of the service providers (Amazon) and focusing on prototyping our implementation.  One of our developers found that backups using Volume Shadow Services was a better option than SQL Server native backups.  The following is an excerpt from his report:

Now that we have a clear understanding of how to proceed with backups, we tried to figure out best practice configuration — focusing mainly on the volume configuration and management.  We encountered issues attaching EBS volumes before Windows was ‘ready’ which resulted in out of order drive letter assignment.  We solved that, and moved on only to find that Powershell was running as a 32-bit process on the x64 environments…great.  Fixed that too.  Those two issues actually got us pretty far and enabled our first release of a Beta ServerTemplate that supported Standalone backup/restore functionality.

Going through the Beta of our backup/restore ServerTemplate, we learned enough to facilitate building a High Availability SQL Server Solution (that’s what we published recently!).  We focused on a few things:

  • SQL Server Mirroring (Set up, Monitoring and Alerting)
  • Authentication and Data transfer Encryption
  • Failover to the mirror

One key challenge in setting up the mirroring session was waiting.  We had to have not only the mirror server in operational state but also a set of database full and differential backups from the principal to initialize the database on the mirror.  We utilized the RightScale ability to tag servers and locate servers by tag to help automate this whole process.

I recommend you try out our ServerTemplate to see just how much we managed to take off your shoulders for the database management.

IIS Application Server

Phew.  When you think about the complexity of the Database Manager, IIS seems like a walk in the park!  But a lot of work went into this template too.  I went over and chatted with the dev lead whose team built this template to get the inside scoop of the challenges they had to overcome.  Here’s his list:

  • Built with the use case of the scalable app tier in mind
  • Integrate a front-end load balancing solution
  • Figure out how to get the app on the server
  • Figure out how to tell app servers where the db is when they are ready
  • Oh, and of course, best practice configuration of IIS App Servers (this is Microsoft after all)

Doesn’t sound too complicated does it?  Luckily, with standardized images and use of RightScale tags to discover the Database server, it worked pretty well.  Also, based in large part on the design, we were able to get this ServerTemplate to work equally well on both Rackspace and Amazon.  That actually was a cool product win, even though it created more work for our testing team (details of which I went into in my last post).

I encourage you to take a look at this ServerTemplate too.

Together, these two ServerTemplates with either the HAProxy Load Balancer ServerTemplate or Elastic Load Balancing from Amazon make up our .NET Stack.  Take them for a spin, and as always, please send us your feedback.  Enjoy!

Posted in EC2, Microsoft, Rackspace, Releases | Tagged , , ,

RightScale Release: MultiCloud ServerTemplates and RightImages…

This is Part Two of our release series.  A couple weeks ago, we announced a lot of goodies and while we’re going to talk about some new stuff today, be sure to come back next week for more about our latest Windows ServerTemplate offerings…

It’s been a R.A.C.E. to the finish line this sprint, but we have some exciting news!  RightScale’s current ServerTemplate release showcases the entire PHP 3-tier stack across the Rackspace Cloud, Amazon’s Elastic Compute Cloud, Cloud.com’s CloudStack and Eucalyptus Systems (hence R.A.C.E).  These new HAProxy Load Balancer, PHP App Server and Database Manager for MySQL 5.1 ServerTemplates are available now in the MultiCloud MarketPlace.

Underneath all these templates, we’re releasing a new CentOS 5.6 MultiCloud Image with RightLink 5.7 for all EC2 regions, Rackspace Cloud Servers, Cloud.com’s CloudStack and Eucalyptus Systems.  For a complete list of ServerTemplates and MultiCloud Images we released, check out our latest release notes.

People often talk about developing for the cloud and associated challenges.  But what about building platforms on which other people develop their apps in the cloud?  It is very challenging building for the “general use case”, but it is even more important when you build for generality that it works well.  A lot of time and effort goes into designing and building our ServerTemplates.  During development, and especially before release, we conduct extensive manual and automated testing cycles where we put our templates through the wringer.  Below is a sample test matrix that represents our checklist for one ServerTemplate.

Notice that we test this one ServerTemplate across 2 separate images in 7 clouds each.  Considering that we released 9 ServerTemplates last week, this makes for quite a few permutations.  Of course, we also find bugs and have to retest rapidly.

Luckily we have help with a home-grown automation tool that we call “Virtual Monkey.” The monkey uses the RightScale API to create deployments with servers to test, launches them, runs tests against them, collects the results, shuts everything down, and cleans up. In most cases the tests include entire 3-tier application deployments so we can test the interactions between the servers and ensure things work end-to-end. All in all we launch hundreds of servers a day in various clouds to test these ServerTemplates and make sure they work before we let them loose in the wild.

Not all clouds are created equal…

If you’ve done anything on multiple clouds, you’ll appreciate the behind-the-scenes presented above.  When launching hundreds of servers and developing infrastructure-as-a-service agnostic solutions, small nuances can be big blockers towards expected end-user functionality on the solution.  From security groups on Amazon versus iptables management on Rackspace to volume snapshot API response differences in Eucalyptus and CloudStack, the ServerTemplate needs to be aware and abstract away differences.  Do our customers really care about these differences?  Of course not.  They just expect one solution to work on one cloud type just as well as it works on another cloud type.  You may have noticed that we utilized Chef for many of our recently released ServerTemplates.  Chef allows us to abstract the business logic away from the details of the cloud,  making it easier to propagate the same solution to as many clouds as we can get our hands on.

Wait, each cloud has different profiles for instance types!

But Chef isn’t the only innovation that we use.  We also have to support many ‘by design’ cloud architecture differences. For example, pre-defined instance types.  Instance types in Amazon, while being the same across all regions within Amazon, do not align well to flavors in Rackspace.  Plus, in private clouds, you can either use the default instance types or custom configure to what your application demands.  How then does a specific ServerTemplate correctly configure a new instance for optimal performance?  Seems like that’s (yet another) real prerequisite for proper application setup.

In the specific case of MySQL, our ServerTemplate will auto-tune configuration parameters including innodb_additional_mem_pool_size and table_cache.  The tuning is based off of an instance’s available memory.  Of course this can be overridden on a per-server basis.  This mechanism extends well to our PHP App and Load Balancer ServerTemplates where you override parameter defaults specified in apache2.conf and haproxy_http.

All of this is just the tip of the iceberg.  Check out the ServerTemplates for yourself and let us know what you think.

Posted in Chef, Cloud.com, EC2, Eucalyptus, OpenStack, Rackspace, Releases | Tagged , , , , , , , | 1 Comment

RightScale Release: New MultiCloud API, New Add Server Assistant, and Community Translations

This is the first of three posts you’ll be seeing regarding everything we’re releasing over the next couple of weeks. First, we had a new dashboard release tonight, which I’ll tell you about below. Next week, we’re releasing some sophisticated multi-cloud ServerTemplates that work across a few clouds. Finally, we’ll wrap up with how to create an auto-scaling Windows IIS/.NET application on top of our new mirrored Database Manager for SQL Server. Let’s get started…

Add Server and Add Server Array Assistants

First, we’ve created new assistants to simplify the process of creating a server or server array. It’s a big change, requested by our customers, designed to make your life easier in the day-to-day usage of the Dashboard. We worked with a few customers over the last few months to refine this flow, so we know it will be a welcome change. The previous process was getting a little disjointed after a few years of rapid cloud innovation! Learn more about the new assistants shown below.

 

Community Translations

Next, in May of this year, we launched our first language translation for the RightScale Dashboard: Japanese. Back then, we mentioned that we accomplished this through a platform we planned to make available to the community. That time is here. You will now notice a “Help Us Translate” link in the footer:

So how can you help us translate (and why would you)? First, the how. When you click on this link, you’ll be taken to a tool that will allow you to see all the phrases that need to be translated, translate these phrases, and vote on translations that others might have submitted. You can then link back to the Dashboard to see the translations in real-time! To get started, click on the link, read the instructions, and choose your language:

Now, why would you help us? Well, many of you have already offered out of the goodness of your heart, and we appreciate that. If you are someone that needs an incentive, we appreciate that too. That’s why we’re going to give the top translator for each of the following languages an Amazon Kindle: German, Chinese Simplified, Japanese, French, Spanish, and Korean. For how to get started, and for more information on this “Translation Showdown,” read the RightScale Dashboard Translation Guide.

New MultiCloud API

We’ve been incubating a new API with a few of our largest customers for over half a year now. This API is a complete redesign, and takes into account everything we have learned over the years on how to manage multiple clouds behind a single “pane of glass.” Or in this case, a single set of XML/JSON instructions.

We are making it available today as a public beta, supporting Cloud.com, Eucalyptus, and Rackspace. Not everything that is in API 1.0 is available in this new API yet, but it is burning a hole in our pocket, and will be extremely useful to our customers who want to begin automating their multi-cloud deployments. As we equalize the feature set between this new API and the 1.0 EC2 API, we will move AWS EC2 support into this new API and retire API 1.0.

One new feature here (available on all clouds) is the ability to provision and manage users via the API. You can now list all users in an account, add users, and set their permissions. Coming up in the October release, Enterprise plan customers will also be able to provision new accounts.

Learn more about the new API and how to get started with it.

Release Notes

As always, please read the Release Notes for a detailed list of changes made to the Dashboard and API.  Look out for the ServerTemplate & MultiCloud Image release next week – we have some great solutions coming up for both Linux and Windows cloud administrators.

Enjoy!

Posted in Releases | Tagged , , | 2 Comments

Performing Security Testing in the Cloud

[This is Phil Cox's first blog post since he joined us as Director of Security and Compliance. We hope to have more from him to post in the near future! -Thorsten]

Security testing is one aspect of a security program that is often overlooked. Organizations who take security seriously understand that testing systems and applications is just smart business. We felt that one way we could help our customers is to describe the process, and nuances, that we go through during our testing. Since RightScale runs in the cloud, the information should help any RightScale customer accomplish the same tasks on their environment.

Our process is basically broken down into the following steps:

  1. Identify instances and applications that will be tested
  2. Select tools and systems that will be used to perform the testing
  3. Coordinate with the cloud service provider to get authorization for testing
  4. Execute the test
  5. Communicate the results

Below I have outlined some of the practical details of each of these steps.

Identify Targets

Before we start testing, we identify what we want to test. For this particular test, we decided that we would include all of the systems that make up our platform, as well as the main dashboard application. Since we use RightScale to manage RightScale, and one of the main functions of our service is using ServerTemplates™ and RightScripts™ to ensure that systems are deployed consistently, there was a temptation to select a representative sample.

Since this was my first time testing RightScale since becoming the Director of Security and Compliance, we decided to test them all. We figured it is good practice, and provided a “validation” of sorts that we were following the practices we champion. We did however decide to limit the testing to publicly addressable AWS IP addresses. (Note: Anyone trying to be PCI compliant in AWS will likely need to test private IPs as well.)

As for the application, we decided on the entire dashboard, and not just a portion (mostly because I wanted a good overview to have as a baseline).

Select Testing Tools

Along with determining which systems/instances and applications we were testing, we selected tools that would help us automate the testing. We had agreed that a primarily automated vulnerability test (with manual validation) was acceptable, but that the application scanning would require a more manual approach given the complexity of our application. To that end, we had the following basic selection criteria:

  • Vulnerability scanner: Number one criterion was its ability to appropriately identify vulnerabilities. We did not want a lot of false positives, but felt that false negatives would be much worse. A second criterion for the vulnerability scanner, was the flexibility of its reporting mechanism.
  • Application testing: Number one criterion was our ability to use it, not what others think of it. A second criterion for the application testing tool was its ability to test against the framework of our application.

Given those “requirements” we chose three vulnerability scanners that we wanted to evaluate, in hopes of selecting one as the foundation for our ongoing testing program. Those were SAINT, NeXpose, and OpenVAS. Many will point out that there are other tools out there, and I agree, but these were tools I personally have history with, and one is free. We had to start somewhere.

As far as the application testing, I have used Burp Pro for a number of years and am a fan of it, and selected that as an application testing tool of choice. It should be noted that a number of other tools have recently come out that may rival Burp Pro in its functionality, but familiarity of use was important. We wanted to test the application, not the tool.

Where to Run Them?

Once we determined the tools that we wanted to use, we had to figure out where we wanted to run them:

  • SaaS
  • Instance in the same cloud
  • Instance in a different cloud
  • Traditional hosting environment
  • Physical system on our network

We chose the “Instance in the same cloud” for a couple of reasons:

  • Flexibility: We were able to install multiple tools to evaluate and test
  • Eating our own dog food: RightScale is all about configuring and managing systems, so what better way for us to help our customers be able to deploy scanning systems than to do it ourselves
  • Bandwidth cost: By using an instance within the same availability zones on AWS, bandwidth was not an issue
  • Access to internal IPs: By running in the same cloud (AWS region) we can test internal IP addresses

Once we decided to build our own, we downloaded a trial version of SAINT, the community version of NeXpose, and followed the Ubuntu installation directions for OpenVAS. Then we wrote some RightScripts to automate the majority of the install and we were “cooking with gas” so to speak.

Get Authorization from Cloud Provider

Once we identified all our instances we were going to test, and had our testing sources (one in our case), per the AWS usage agreement, we needed to get authorization from AWS to perform the testing.
AWS provides a form that we filled out to request penetration testing of instances. We had to supply the AWS instance IDs and IPs that we obtained earlier, as well as the source of the testing. AWS uses this to create a ticket that AWS security team will get, and subsequently white list the account so the IDS systems are not triggering alerts during the testing. This prevents getting nasty emails about policy violation as well as port blocking, which would affect the test results.

AWS security responded back within a couple of days with approval for the scanning. It is interesting to note that it appears it is the vulnerability scanning that this applies to, for all intents and purposes you should make this request for application-based scanning as well, but it’s been my experience that testing the application does not cause abuse reports to be generated within AWS. During the testing, launching and relaunching of the scanner we did accidentally perform a number of scans from an IP address other than the one we provided to AWS and we did receive two abuse notices.

Probably the biggest point to note with respect to testing instances running in AWS is that instance size must be medium or greater. AWS policy does not allow pen testing, including port/service scanning, of smalls or below, presumably because they want to avoid that the testing degrades the other VMs on the same host. It should be noted, that we were just testing in AWS, depending on your cloud service provider, what you need to provide as far as what you are testing will vary. For AWS, we provided the instance ID as well as the public IP that will be tested, and the source of the testing.

For AWS, the quickest way to get the list of all AWS instance IDs and associated IPs is to use the rest_connection API. It can be used to programmatically generate a list of the instances and associated IP addresses that will be the targets of testing. We ignored the security groups in this test and hit all the “well known ports” that the tools scan. An alternative would be to only test the accessible ports.

Execute the Test

Once we obtained the authorization for the testing, we coordinated with the ops team to make sure they were ready for any potential problems. Once we got their “we are a go” signal, we commenced the testing. The general methodology looked something like this:

  1. A sequential vulnerability scan, using each of the scanners. For both SAINT and NeXpose, we utilized the “exploit” portion of the tools (when it existed) on any noted vulnerability. (Note that we performed multiple scans with each scanner over the course of our 3 weeks of testing.)
  2. General walk through and Burp Pro “passive” testing of the entire dashboard. Attempting to get an overall feel for the testing tool with the dashboard, and basically doing a full manual spider of the site.
  3. Next we specifically performed testing of our session state mechanism, looking for entropy, manipulation, and injection flaws.
  4. We then stepped through each of the dashboard’s main function areas, “Reports,” “Manage,” “Design,” “Clouds” and “Settings,” looking for well-known attack vectors. In particular focusing on identifying Cross Site Scripting and Request Forgers (XSS and CSRF), Injection, parameter manipulation, and other common web app exposures. See the OWASP testing guide for a good discussion of things that should be tested for in web applications.

Note that all testing we performed was done in both an authenticated state as well as an unauthenticated state.

As stated earlier, we made the decision that the vulnerability scanning portion of our testing would be mostly automated, and the application testing mostly manual. It took us approximately 3 weeks to identify the systems, get the authorization, and perform the testing. About 2 weeks of that was dedicated to the manual app testing.

A Bit More on the Application

It could be argued, that the bulk of “cloud” security testing should revolve around the application. This is not to say that making sure supporting services like Apache and MySQL versions are patched is not important (it is, just ask Sony), but meaning that much of the exposure to your data will come through the application. Taking the time to assess the mechanisms protecting the application is critical. For example:

  • Are the security groups appropriate?
  • Do you have appropriate controls on who can access API calls or make security related changes via the UI?
  • Does your authorization mechanism enforce appropriate controls via all interfaces?

Items like these are things that will be critical for long-term protection of information. Make sure that you include them in your testing regiment.

Communicate Results

We are an Agile shop, so frequent communication is part of our culture, and we leveraged that to provide feedback from the testing to the appropriate engineering or ops teams as we uncovered potential threats. This allowed us to create records of our testing results, as well as provided timely information to be fed into our sprint process. At the completion of the testing, we wriote a summary report and included details of the vulnerabilities from each of the tools as appendices. Even though the information is already fed into the appropriate groups, including details along with the final report allowed stakeholders the ability to review the overall testing methodology and findings, as well as dig down into the details of any vulnerabilities found.

Your process may vary, and you may have a much more formal reporting requirement. The most important part is to get the appropriate information to the people who can get the system services or applications fixed in a timely manner.

Summary

The process of identifying targets, maintaining testing tools, coordinating with cloud service providers, and communicating those results should be formalized within your organization. Security testing should become an integral part of the IT culture. There will always be issues, as nothing is absolutely secure, but trying to stay ahead of the curve is a worthy cause. With a formal process, you can make it a regular occurrence, thus enhancing your security program and likely meeting many practical as well as compliance requirements.

One side note about the testing is that for all practical purposes, it was exactly the same methodology and tools that I have used previously in non-cloud environments. So I encourage you to roll up your sleeves and implement a testing program for your infrastructure and applications.

Posted in AWS, Cloud Computing, EC2 | Tagged , , ,